AI creators tools

VibeVoice‑Large audio model

Name: VibeVoice
Version: Large
Variant: 7B‑Preview
Also Known As: VibeVoice‑7B‑Preview, VibeVoice 7B
Creator: Microsoft

VibeVoice by Microsoft is an open-source TTS model that turns your text into podcast-length multi-speaker audio with ease.

You can do up to 45-minute podcasts with 4 different voices sounding like they’re actually talking to each other.  If you’ve got a short audio recording just 5 to 20 seconds long it can use that to copy a voice. 

It supports English and Mandarin and runs on an efficient setup using slow-burn token shrinking at 7.5 Hz. 

The speech it makes isn’t flat. It adjusts how it sounds based on the mood of the text. So if the line is angry or sad or cheerful - it picks that up and changes how it talks. You don’t have to do anything special. Just write the script and it figures it out.

VibeVoice officially works in English and Mandarin but people have used it with other languages like Spanish, German, Japanese and Korean and the results hold up. If your reference audio has background music it even tries to copy that too.

There are a few versions to pick from.

This large 7B model sounds better but it only goes up to 45 minutes. The smaller 1.5B model is smaller and faster and can go up to 90 minutes. 

You’ll need around 17 to 24 GB to run the VibeVoice‑Large.

In testing people picked VibeVoice’s Large output over other top voice tools like ElevenLabs v3 and Gemini 2.5 Pro TTS. 

VibeVoice‑Large Examples

Generated on the official demo page https://0bcd5baf6c08e24956.gradio.live with 1 speaker: Frank and default CFG 1.3 Generated on September 3, 2025
Compare Models

Where To Find VibeVoice‑Large

If you'd like to access this model, you can explore the following possibilities:

Other Models by Microsoft