CSM by Sesame AI Labs

CSM by Sesame AI Labs blends speech and text processing for real-time, natural AI voices using RVQ tokens for high-quality, low-latency speech generation.

Visit This Site

Overview

CSM comes from Sesame AI Labs built to push AI chat tech forward by mixing speech and text processing in one go. Regular text-to-speech (TTS) systems turn words into sound but miss key details like tone shifts or pauses making them sound robotic. CSM fixes that using Residual Vector Quantization (RVQ) tokens to capture speech traits in real-time.

CSM runs on two AI models—one handles mixed text and audio the other generates speech.

Instead of working with raw audio CSM converts everything into RVQ tokens keeping voices natural while staying efficient.

Rather than process every single frame it only trains on 1/16th of them cutting down memory use while keeping quality high.

Old models take multiple steps to produce sound. CSM skips that making it great for fast AI conversations.

Model Sizes

CSM comes in three versions:

Tiny. 1B base 100M decoder.
Small. 3B base 250M decoder.
Medium. 8B base 300M decoder.

All three learned from a massive dataset—1 million hours of English speech - so the voices sound natural and expressive. CSM however has limited non-English support, as most training data is in English. The developers plan to scale up the dataset and expand language support to 20+ languages. Future improvements will also include better turn-taking mechanics for truly seamless conversations.

Supported Languages

English

Links

Multi-Character Dialogue
Pre-Built Voices
Voices with Emotions

Educators and Trainers Creative Professionals Content Creators Media and Film Makers Marketing and Branding Specialists Voice and Audio Professionals Developers and Tech Creators Nonprofit and Advocacy Creators Small Business Owners Entertainment and Performance Artists Professional Content Creators

This tool is free to use and is offered under Apache License 2.0.

The Reddit reaction to CSM’s release is mostly negative with many feeling misled. The demo had sparked a lot of excitement but the open-source version is missing key features making some think it was on purpose. A lot of users believe investor pressure shaped this decision with some guessing the team saw a chance to cash in instead of fully delivering on what was promised.

One big annoyance is that CSM seems to be just a more advanced text-to-speech model instead of the real speech-to-speech system many were hoping for. Some remember earlier claims that no text-based step was needed yet the released version still depends on transcription and outside language models to work in a conversation. The model does show strong awareness of context but to some it doesn’t live up to the game-changing hype feeling more like a polished TTS system than a major leap in AI conversation.

Tech discussions suggest the demo was using extra tools—probably a transcription system and a lightweight LLM—to give the impression of real-time AI chat. The open-source version only comes with the 1B model not the full 8B multimodal setup. Even so a few users think the tech still has promise and with enough work an open-source rival could match or even improve on Sesame’s model.

Even with the letdown some people believe the open-source crowd will take CSM’s base and build something better. Others think Sesame AI Labs will either get bought out or shut down soon. Right now the mood is mostly disappointment but the tech itself might still push future AI breakthroughs.

Rating:

Favorite

This page was last updated on March 14, 2025 at 4:46 PM

CSM by Sesame AI Labs

Overview

Supported Languages

Tags

Links

What can it do?

Who is it for?

How much does it cost?

Community feedback and reviews