Ovi is open-source and made for short human-focused clips with fast generation speed for 5-second videos at 24 FPS. Its twin-backbone setup has two matched Diffusion Transformer models running side by side: one for video, one for audio. They swap timing and meaning info as they go, so things like lip movement and speech match up better. The timing’s matched too with a special RoPE method that keeps audio tokens in sync with video frames.
You get a 5-second 720×720 video clip at 24 fps where voices background sound and visuals all lock together naturally. The whole thing runs from a text prompt or you can toss in an image to guide the look too.
Audio tops out at 16kHz which can make it sound a bit flat sometimes. The team’s already thinking about how to fix that maybe by slicing the job into parts or using smaller faster versions later on.
Sound tags usage:
<S> ... <E> → Speech (converted into spoken audio)
<AUDCAP> ... <ENDAUDCAP> → Background audio / effects
Tests showed people preferred Ovi’s results over other open models like JavisDiT or UniVerse-1. Sync felt more natural and smoother overall. Though yeah the base video quality took a tiny hit compared to models like Wan 2.2.
The team behind it includes Chetwin Low, Weimin Wang (who led the work) and Calder Katyal, Character AI and Yale. They dropped a full site with demos paper and resources so others can mess with it too. All open and Apache-2.0 licensed.
If you'd like to access this model, you can explore the following possibilities: