DreamOmni 2 edits and makes images based on both text and image instructions. So you can tweak a photo or start from scratch, using words and example pics together. It handles clear stuff like “swap the lantern with the dog” and also trickier things like “match the lighting” or “use this hairstyle” or “make it look like an oil painting”.
You can give it several reference images at once and point to “image 1” or “image 2” using a system that avoids blurry blends or mix-ups. They built it using a vision-language model to help make sense of long or detailed user prompts.
The model was made by folks at dvlab-research including Bin Xia and Bohao Peng, working with The Chinese University of Hong Kong and ByteDance. Code and files are on GitHub under Apache-2.0 license.
Everything’s out now. That includes the model, weights, scripts, and a sample demo site with examples like swapping objects, changing lighting, transferring art styles or fonts, copying poses or facial expressions, and more.
It uses both words and pictures. You can write what you want and point to example images.
It edits and makes new stuff. Same setup handles both.
You can feed it more than one reference image. Up to 5 if needed.
It changes textures, patterns, hairstyles, poses and styles. Not just objects.
It has a way to tag images and avoid copying mistakes. Keeps things clean.
It’s trained with a vision-language model. Helps it understand more detailed instructions.
It uses one model for both editing and making. Depends if you're keeping parts of the image or not.
It makes square images. By default they’re 1024×1024 pixels.
It keeps the rest of the image untouched. When editing it only changes what you ask it to.
Tests in the paper show it does better than other tools at keeping identity or pose, and handling tricky changes like lighting or material. They’ve also shared a test set with 205 editing tasks and 114 generation tasks.
If you'd like to access this model, you can explore the following possibilities: