VibeVoice by Microsoft is an open-source TTS model that turns your text into podcast-length multi-speaker audio with ease.
You can do up to 45-minute podcasts with 4 different voices sounding like they’re actually talking to each other. If you’ve got a short audio recording just 5 to 20 seconds long it can use that to copy a voice.
It supports English and Mandarin and runs on an efficient setup using slow-burn token shrinking at 7.5 Hz.
The speech it makes isn’t flat. It adjusts how it sounds based on the mood of the text. So if the line is angry or sad or cheerful - it picks that up and changes how it talks. You don’t have to do anything special. Just write the script and it figures it out.
VibeVoice officially works in English and Mandarin but people have used it with other languages like Spanish, German, Japanese and Korean and the results hold up. If your reference audio has background music it even tries to copy that too.
There are a few versions to pick from.
This large 7B model sounds better but it only goes up to 45 minutes. The smaller 1.5B model is smaller and faster and can go up to 90 minutes.
You’ll need around 17 to 24 GB to run the VibeVoice‑Large.
In testing people picked VibeVoice’s Large output over other top voice tools like ElevenLabs v3 and Gemini 2.5 Pro TTS.
If you'd like to access this model, you can explore the following possibilities: