VibeVoice-1.5B audio model

Name: VibeVoice
Variant: 1.5B
Also Known As: VibeVoice 1.5B, VibeVoice Fast, VibeVoice Small
Creator: Microsoft

VibeVoice by Microsoft is an open-source TTS model that turns your text into podcast-length multi-speaker audio with ease.

You can include up to 4 speakers in one script which makes it great for podcasts or stories with dialogue. If you’ve got a short audio recording just 5 to 20 seconds long it can use that to copy a voice.

The speech it makes isn’t flat. It adjusts how it sounds based on the mood of the text. So if the line is angry or sad or cheerful - it picks that up and changes how it talks. You don’t have to do anything special. Just write the script and it figures it out.

VibeVoice officially works in English and Mandarin but people have used it with other languages like Spanish, German, Japanese and Korean and the results hold up. If your reference audio has background music it even tries to copy that too.

There are a few versions to pick from.

The 1.5B model is smaller and faster and can go up to 90 minutes. The larger 7B model sounds better but it only goes up to 45 minutes.

You’ll need around 7 GB of VRAM for the 1.5B model.

Key Features

Supported Languages

Model Performance Editor’s Rating

No editor performance evaluations available for this model yet.

User Ratings

Censorship

Lower = less censorship. Higher = stricter filtering.

Creativity

Expressiveness

Generation Speed

ID preservation

Prompt Following

Realism

VibeVoice-1.5B Examples

Tested at https://huggingface.co/spaces/akhaliq/VibeVoice-1.5B Default voice, CFG 1.3 Generated on September 3, 2025

Compare With Other Models

VibeVoice-1.5B audio model

Key Features

Supported Languages

Model Performance Editor’s Rating

User Ratings

VibeVoice-1.5B Examples

Where To Find VibeVoice-1.5B

Other Models by Microsoft

Related Audio Models