Tencent dropped HunyuanImage 3.0 end of September 2025, saying it’s the biggest open-source text-to-image model yet. It runs on 80 billion total parameters with 13 billion active each time it works. They claim it's on par with top closed-source models out there.
The model's built from their in-house multimodal LLM and got extra tuning just for image generation. That setup gives it some solid skills.
It can reason with general world info. It gets super long prompts. It puts the right text inside images.
Instead of using older DiT methods, it runs on a MoE setup with something called Transfusion that blends Diffusion and LLM training into one system. It sits on top of their Hunyuan-A13B and learned from a huge pile of stuff: 5 billion image-text pairs, video frames, mixed image and text data, and 6 trillion tokens of text.
This mix helps the model handle multiple jobs at once.
It can make detailed text, comics, emojis, and school-style drawings in way less time than before.
System needs for local install are 80GB of VRAM.
If you'd like to access this model, you can explore the following possibilities: