VibeVoice by Microsoft is an open-source TTS model that turns your text into podcast-length multi-speaker audio with ease.

You can do up to 45-minute podcasts with 4 different voices sounding like they’re actually talking to each other. If you’ve got a short audio recording just 5 to 20 seconds long it can use that to copy a voice.

It supports English and Mandarin and runs on an efficient setup using slow-burn token shrinking at 7.5 Hz.

The speech it makes isn’t flat. It adjusts how it sounds based on the mood of the text. So if the line is angry or sad or cheerful - it picks that up and changes how it talks. You don’t have to do anything special. Just write the script and it figures it out.

VibeVoice officially works in English and Mandarin but people have used it with other languages like Spanish, German, Japanese and Korean and the results hold up. If your reference audio has background music it even tries to copy that too.

There are a few versions to pick from.

This large 7B model sounds better but it only goes up to 45 minutes. The smaller 1.5B model is smaller and faster and can go up to 90 minutes.

VibeVoice‑Large was initially released under the MIT Licence but a couple of days later taken down by Microsoft. Still, the version is already available for download from other sources.

You’ll need around 17 to 24 GB to run the VibeVoice‑Large.

In testing people picked VibeVoice’s Large output over other top voice tools like ElevenLabs v3 and Gemini 2.5 Pro TTS.

Key Features

Supported Languages

Model Performance Editor’s Rating

No editor performance evaluations available for this model yet.

User Ratings

Censorship

Lower = less censorship. Higher = stricter filtering.

Creativity

Expressiveness

Generation Speed

ID preservation

Prompt Following

Realism

VibeVoice‑Large Examples

Where To Find VibeVoice‑Large

If you'd like to access this model, you can explore the following possibilities:

Weights Spaces GitHub Licence

Hugging Face

Other Models by Microsoft

VibeVoice-1.5B

Related Audio Models

🔒 to see up to 20 related models.

VibeVoice‑Large audio model