This is an AI video generation comparison for
image-to-video
prompt:
A couple sits at a small white iron table outside café. They hold hands and look at each other. The shot stays steady at 24fps with a light film-like grain. It starts focused on the couple, then shifts to the back. A man with a suitcase walks into view. His face shows shock. That changes the mood fast. The street has striped awnings, café chairs, and a busy but quiet flow of people. You hear street sounds, some footsteps, and soft clinks of dishes. No traffic noise or music. While all this hap...
Log in to see full prompt.
Tested: October 4, 2025
JSON prompt worked better than in VEO 3. There's rack focus, natural movements and contextual understanding.
Tested: October 4, 2025
I've had several variations of this prompt+image with Veo3. Doesn't seem to like doing rack focus for this scene.
Tested: October 4, 2025
Hailuo seems to benefit from more context thus I slightly improved the prompt and it yielded better results than the original. Now man stops and looks not passes by the table as before.
Tested: October 4, 2025
Vidu doesn't have sound but has good prompt following.
Tested: October 4, 2025
Used the same context-rich prompt and it's good.
Tested: October 5, 2025
Awkward, but not in the intended way. The husband just keeps walking unbothered. Couple is wanting to kiss but stops mid-way for no reason (likely censorship?) in both generations I've ran for this prompt.
Tested: October 8, 2025
That's funny.)) Multi-character dialogue seems still very raw.
Tested: October 22, 2025
Multi-character dialogue + lip sync + prompt following is still very challenging. Used a single image as reference.
Tested: October 24, 2025
JSON prompt worked beautifully, and multi-character dialogue is flawless. Nice cinematic camera motion.
Does the couple share an intimate gaze toward each other?
Does the rack focus smoothly shift from the couple to the background pedestrian?
Is the man in the background holding a suitcase clearly visible after focus shift?
Does his expression register as shocked/unsettled when revealed?
Does the audio include gentle café ambience (murmurs, cutlery, footsteps) without loud traffic or music?
Are the dialogue lines delivered clearly without subtitles or text overlays?
Check out the results from Wan (Online Platform) (Wan2.5 Preview) vs Google Gemini App (Veo 3 Fast) vs Freepik (Hailuo 02) vs Vidu AI (Vidu Q2 Cinematic) vs Freepik (Kling 2.5 Turbo) vs PixVerse (PixVerse V5) vs GROK (Grok Imagine v0.9) vs Vidu AI (Vidu Q2 Reference-to-Video) vs LTX Studio (LTX-2) for similar or identical prompts side-by-side.
Cinematic multishot capybara mafia poker night
Timelapse city behind anthro animals