Qwen3-TTS audio model

Name: Qwen
Version: 3
Variant: TTS
Also Known As: Qwen 3 TTS
Licence: Apache License 2.0
Creator: Alibaba

Qwen3-TTS is a powerful text-to-speech model made by the Qwen team at Alibaba Cloud. It turns written words into natural-sounding speech.

It can clone voices, design new ones, handle many languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian), and lets you adjust tone, pace, and rhythm. It supports real-time use and runs under an open Apache 2.0 license, so anyone can use or build on it.

You can run it on almost any system. A Raspberry Pi with a GPU works. So does a Mac or even a phone. Just add a voice clip and a transcript. In a few minutes, you’ve got a cloned voice.

It won’t sound exactly like the real thing... but it’s close.

Here’s how it works:

Dual-track model. One part turns text into sound tokens. The other controls timing and speaking style.
Built for speed. It uses a fast 12 Hz tokenizer to start talking in under 100 ms, good for live apps.
Trained on tons of audio. Millions of hours of multi-language speech give it solid speaking skills.

Here’s what it can do:

Realistic speech. The audio sounds close to how people actually talk.
Voice cloning. Copy any voice using just 3 seconds of audio.
Voice creation. Build new voices by typing things like “nervous teenage male voice.”
Multi-language support. Handles over 10 languages including English and Chinese.
Live response. It’s fast enough to use in real-time chats or narration.
Control settings. You can adjust emotion, speed, pitch, and rhythm.
Performance holds up well. Tests show it makes strong, clear audio even on average gear

Model options:

Qwen3-TTS-12Hz-1.7B-VoiceDesign. Good for building voices from scratch in many languages.
Qwen3-TTS-12Hz-1.7B-CustomVoice. Comes with 9 high-quality presets and lets you tweak the style.
Qwen3-TTS-12Hz-1.7B-Base. Quick voice cloning from short clips.
Qwen3-TTS-12Hz-0.6B-CustomVoice. Smaller model for general voice output with less system use.
Qwen3-TTS-12Hz-0.6B-Base. Fast, light voice cloning that runs easy on most machines.

All these support real-time speech, and the bigger ones give you more control and options for voice design.

Key Features

Model Performance Editor’s Rating

No editor performance evaluations available for this model yet.

User Ratings

Censorship

Lower = less censorship. Higher = stricter filtering.

Creativity

Expressiveness

Generation Speed

ID preservation

Prompt Following

Realism

Qwen3-TTS Examples

Very impressive. The emotional tone was very fitting to the contents. Voice: Vivian. Generated on January 25, 2026

Compare With Other Models

Same voice - Vivian, but without any prompt (style instructions) results in far worse output than if you guide the tone. Generated on January 25, 2026

Compare With Other Models

Didn't do a [laugh] but at least didn't read this out either and almost done a [sigh]. Voice: Ryan. Quote from "The Big Lebowski". Generated on January 24, 2026

Qwen3-TTS audio model

Key Features

Model Performance Editor’s Rating

User Ratings

Qwen3-TTS Examples

Where To Find Qwen3-TTS

Other Models by Alibaba

Related Audio Models