Kyutai TTS
Kyutai's TTS-1.6B is a blazing fast open-source text-to-speech model with live voice output in English and French. It speaks in 220 ms flat and handles long texts like a pro.
Overview
Kyutai TTS-1.6B is blazing fast and open. This thing starts talking in 220 ms. You don't need to send the full sentence first. It just speaks as text comes in. And it's free.
They built it on a stacked Transformer setup and use a Mimi codec. It handles English and French no problem. You also get word-level timestamps so it knows exactly what it said and when.
The team behind it worked on Moshi (an experimental conversational AI) and Hibiki before. Same folks now brought this public through Hugging Face and GitHub. They didn’t just tweak stuff either—they made it stream. Both in and out. That’s perfect if you’re building something like live AI agents or assistants.
It doesn't just talk fast. It sounds natural too. That’s thanks to a trick called delayed-streams modeling. It gives the system a tiny buffer so future words can help shape how it pronounces what it says now.
How does it hold up under pressure? Pretty great. You can run 32 streams at once on a decent GPU with just 350 ms delay. It handles long texts past 30 seconds without falling apart - something most models still struggle with. It even lets you clone voices using precomputed embeddings (though the voice model isn't open-source).
If you want to try it, at the time of writing demo is available at https://unmute.sh where you can talk real time.
Licenses: Model weights under CC‑BY‑4.0; code under Apache‑2.0 (Rust backend) and MIT (Python)
Supported Languages
- English
- French
Tags
Freeware Creative Commons Attribution (CC BY) PC-based #Voice & AudioLinks
This tool is free to use when installed locally and is offered under Creative Commons Attribution (CC BY).
Useful Links
No additional links available for this tool.
This page was last updated on July 5, 2025 at 9:13 AM