Fish Audio S2 Pro is a voice AI model dropped mid-March 2026 that turns text into speech. It makes spoken audio from written prompts and lets the user guide tone emotion and delivery inside the text.

The model is about 5B parameters. It runs under the Fish Audio Research License – free for research or non commercial use while paid licenses are needed for business use. The system follows a freemium style setup.

The tool supports 80 plus languages and trained on more than 10 million hours of audio. It aims for fast speech creation with low delay and streaming output.

Fish Audio built S2 Pro as a speech system that gives tight control over how a voice sounds. Instead of picking a small list of emotions you can type instructions right inside the prompt. Stuff like tone pacing or mood. So the prompt acts a bit like a script for a voice actor.

Example prompt.[whisper nervously] I dont think this is a good idea...

The model reads tags like [whisper] [laughing] or [angry tone] and changes how the words sound. You can even adjust tone word by word which gives very precise voice control.

Under the hood the system runs on a dual autoregressive setup. One larger model predicts speech meaning codes. A smaller one rebuilds the sound details. This split design helps keep audio quality high while still running fast.

Another goal was quick streaming speech. Using optimized inference through the SGLang system the model can start producing audio in about 100 ms. And it keeps generating speech in real time.

The system also handles multi speaker dialogue and voice cloning. A short reference clip about 10–30 seconds can copy a voice. You can also generate conversations between different speakers in one prompt.

Main abilities.Text to speech. Turns text into spoken audio.Voice cloning. Copies a voice from a short sample.Multi speaker dialogue. Makes conversations with several voices.Multilingual speech. Works with more than 80 languages.Streaming generation. Produces speech in real time.

Key details.Model size. Around 5B parameters.Architecture. Dual autoregressive transformer.Audio codec. RVQ codec with 10 codebooks around 21 Hz frame rate.Streaming speed. Time to first audio about 100 ms.

The model weights are on Hugging Face and the code lives on GitHub under the Fish Speech project. There is also an online playground and an API for developers who want to plug the system into apps.

Running the model locally usually needs around 10–16GB VRAM or more depending on how the model is compressed. It also includes tools for fine tuning and streaming speech generation.

Key Features

Supported Languages

Model Performance Editor’s Rating

No editor performance evaluations available for this model yet.

User Ratings

Censorship

Lower = less censorship. Higher = stricter filtering.

Creativity

Expressiveness

Generation Speed

ID preservation

Prompt Following

Realism

Fish Audio S2 Pro audio model

Key Features

Supported Languages

Model Performance Editor’s Rating

User Ratings

Fish Audio S2 Pro Examples

Where To Find Fish Audio S2 Pro

Related Audio Models