Emu3.5 comes from BAAI, aka the Beijing Academy of Artificial Intelligence. It's a big multimodal model built to handle both text and images.
It natively outputs images in 480 or 720p resolutions in various aspect ratios.
You can get it in three versions:
Emu3.5. Handles most tasks like mixed text-image generation and editing.
Emu3.5-Image. Focuses more on making high-quality images.
Emu3.5-VisionTokenizer. Used for converting visuals into tokens the model can understand.
It runs on a decoder-only transformer with 64 layers and about 34 billion parameters. They trained it on huge piles of video frames and matching transcripts, over 10 trillion tokens they say. First came regular training, then they fine-tuned it and added reinforcement learning to sharpen how it handles mixed content.
To make it faster, they added something called Discrete Diffusion Adaptation (DiDA). This lets it predict image tokens faster by guessing multiple ones at once instead of going one by one. So you get quicker results without losing much accuracy.
The paper says Emu3.5 is the first real shot at native large-scale vision-language generation. It’s trained to spit out a mix of frames and text that stay consistent over time and make sense together. This lets it do stuff like:
Visual storytelling. Think picture-based lessons or creative scenes.
Visual step-by-step guides. It can show how things work or how to do stuff in clear steps.
Simulated worlds. You can move through or control made-up scenes like you're exploring or tweaking a 3D world.
They tested it on a bunch of these and say it beats Gemini 2.5 Flash Image, which is still closed-source.
If you're trying to run it yourself... good luck. It’s smaller than Hunyuan-Image-3.0 (which is 80B), but a 34B model still isn’t small. You’ll need serious GPU hardware, probably split the model up or use tricks like quantizing. Maybe - just maybe - you can get it going on a 5090 GPU if you squeeze it down with quantization.
If you'd like to access this model, you can explore the following possibilities: