VibeVoice by Microsoft is an open-source TTS model that turns your text into podcast-length multi-speaker audio with ease.
You can include up to 4 speakers in one script which makes it great for podcasts or stories with dialogue. If you’ve got a short audio recording just 5 to 20 seconds long it can use that to copy a voice.
The speech it makes isn’t flat. It adjusts how it sounds based on the mood of the text. So if the line is angry or sad or cheerful - it picks that up and changes how it talks. You don’t have to do anything special. Just write the script and it figures it out.
VibeVoice officially works in English and Mandarin but people have used it with other languages like Spanish, German, Japanese and Korean and the results hold up. If your reference audio has background music it even tries to copy that too.
There are a few versions to pick from.
The 1.5B model is smaller and faster and can go up to 90 minutes. The larger 7B model sounds better but it only goes up to 45 minutes.
You’ll need around 7 GB of VRAM for the 1.5B model.
If you'd like to access this model, you can explore the following possibilities: