Ovi is open-source and made for short human-focused clips with fast generation speed for 5-second videos at 24 FPS. Its twin-backbone setup has two matched Diffusion Transformer models running side by side: one for video, one for audio. They swap timing and meaning info as they go, so things like lip movement and speech match up better. The timing’s matched too with a special RoPE method that keeps audio tokens in sync with video frames.

You get a 5-second 720×720 video clip at 24 fps where voices background sound and visuals all lock together naturally. The whole thing runs from a text prompt or you can toss in an image to guide the look too.

Audio tops out at 16kHz which can make it sound a bit flat sometimes. The team’s already thinking about how to fix that maybe by slicing the job into parts or using smaller faster versions later on.

Sound tags usage:

<S> ... <E> → Speech (converted into spoken audio)

<AUDCAP> ... <ENDAUDCAP> → Background audio / effects

Tests showed people preferred Ovi’s results over other open models like JavisDiT or UniVerse-1. Sync felt more natural and smoother overall. Though yeah the base video quality took a tiny hit compared to models like Wan 2.2.

The team behind it includes Chetwin Low, Weimin Wang (who led the work) and Calder Katyal, Character AI and Yale. They dropped a full site with demos paper and resources so others can mess with it too. All open and Apache-2.0 licensed.

Key Features:

Ovi Examples

Limited by 5 seconds so had to shrink her speech Generated on October 10, 2025

for link to original.

Compare Models

It does render pretty quickly. Seems to like cutting out the last syllable. My test was longer in 1st attempt and got cut mid-word, so I shortened it. Still finishes abruptly. Generated on October 4, 2025

for link to original.

Compare Models

Where To Find Ovi

If you'd like to access this model, you can explore the following possibilities:

Ovi lipsync model

Ovi Examples

Where To Find Ovi