Qwen3-TTS is a powerful text-to-speech model made by the Qwen team at Alibaba Cloud. It turns written words into natural-sounding speech.
It can clone voices, design new ones, handle many languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian), and lets you adjust tone, pace, and rhythm. It supports real-time use and runs under an open Apache 2.0 license, so anyone can use or build on it.
You can run it on almost any system. A Raspberry Pi with a GPU works. So does a Mac or even a phone. Just add a voice clip and a transcript. In a few minutes, you’ve got a cloned voice.
It won’t sound exactly like the real thing... but it’s close.
Here’s how it works:
- Dual-track model. One part turns text into sound tokens. The other controls timing and speaking style.
- Built for speed. It uses a fast 12 Hz tokenizer to start talking in under 100 ms, good for live apps.
- Trained on tons of audio. Millions of hours of multi-language speech give it solid speaking skills.
Here’s what it can do:
- Realistic speech. The audio sounds close to how people actually talk.
- Voice cloning. Copy any voice using just 3 seconds of audio.
- Voice creation. Build new voices by typing things like “nervous teenage male voice.”
- Multi-language support. Handles over 10 languages including English and Chinese.
- Live response. It’s fast enough to use in real-time chats or narration.
- Control settings. You can adjust emotion, speed, pitch, and rhythm.
- Performance holds up well. Tests show it makes strong, clear audio even on average gear
Model options:
- Qwen3-TTS-12Hz-1.7B-VoiceDesign. Good for building voices from scratch in many languages.
- Qwen3-TTS-12Hz-1.7B-CustomVoice. Comes with 9 high-quality presets and lets you tweak the style.
- Qwen3-TTS-12Hz-1.7B-Base. Quick voice cloning from short clips.
- Qwen3-TTS-12Hz-0.6B-CustomVoice. Smaller model for general voice output with less system use.
- Qwen3-TTS-12Hz-0.6B-Base. Fast, light voice cloning that runs easy on most machines.
All these support real-time speech, and the bigger ones give you more control and options for voice design.