CSM by Sesame AI Labs
CSM by Sesame AI Labs blends speech and text processing for real-time, natural AI voices using RVQ tokens for high-quality, low-latency speech generation.
Overview
CSM comes from Sesame AI Labs built to push AI chat tech forward by mixing speech and text processing in one go. Regular text-to-speech (TTS) systems turn words into sound but miss key details like tone shifts or pauses making them sound robotic. CSM fixes that using Residual Vector Quantization (RVQ) tokens to capture speech traits in real-time.
CSM runs on two AI models—one handles mixed text and audio the other generates speech.
Instead of working with raw audio CSM converts everything into RVQ tokens keeping voices natural while staying efficient.
Rather than process every single frame it only trains on 1/16th of them cutting down memory use while keeping quality high.
Old models take multiple steps to produce sound. CSM skips that making it great for fast AI conversations.
Model Sizes
CSM comes in three versions:
- Tiny. 1B base 100M decoder.
- Small. 3B base 250M decoder.
- Medium. 8B base 300M decoder.
All three learned from a massive dataset—1 million hours of English speech - so the voices sound natural and expressive. CSM however has limited non-English support, as most training data is in English. The developers plan to scale up the dataset and expand language support to 20+ languages. Future improvements will also include better turn-taking mechanics for truly seamless conversations.
Supported Languages
- English
Tags
Freeware Apache License 2.0 PC-based #Voice & AudioLinks
- Multi-Character Dialogue
- Pre-Built Voices
- Voices with Emotions
The Reddit reaction to CSM’s release is mostly negative with many feeling misled. The demo had sparked a lot of excitement but the open-source version is missing key features making some think it was on purpose. A lot of users believe investor pressure shaped this decision with some guessing the team saw a chance to cash in instead of fully delivering on what was promised.
One big annoyance is that CSM seems to be just a more advanced text-to-speech model instead of the real speech-to-speech system many were hoping for. Some remember earlier claims that no text-based step was needed yet the released version still depends on transcription and outside language models to work in a conversation. The model does show strong awareness of context but to some it doesn’t live up to the game-changing hype feeling more like a polished TTS system than a major leap in AI conversation.
Tech discussions suggest the demo was using extra tools—probably a transcription system and a lightweight LLM—to give the impression of real-time AI chat. The open-source version only comes with the 1B model not the full 8B multimodal setup. Even so a few users think the tech still has promise and with enough work an open-source rival could match or even improve on Sesame’s model.
Even with the letdown some people believe the open-source crowd will take CSM’s base and build something better. Others think Sesame AI Labs will either get bought out or shut down soon. Right now the mood is mostly disappointment but the tech itself might still push future AI breakthroughs.
This page was last updated on March 14, 2025 at 4:46 PM