Maya1 could be a solid pick if you want a voice model you can run on your own gear. It’s built for generating voices that sound emotional and humanlike. You control how it talks using short natural descriptions and emotion tags stuck right in the text.
It’s made by Maya Research and has 3 billion parameters under the hood. You tell it what kind of voice you want, like “young woman with a British accent, upbeat,” and you can throw in things like <sigh> or <laugh> to shape the tone.
Supported Emotions: <laugh> <sigh> <whisper> <angry> <giggle> <chuckle> <gasp> <cry> and 12+ more.
Maya1 spits out 24kHz mono audio using a codec called SNAC which skips raw waveform generation keeping things fast and light for streaming. It's open-source (Apache 2.0) so you can run it yourself if you've got a strong GPU or drop it into something you're building.
They dropped it publicly in late 2025 and it's been making the rounds in AI circles. Maya Research says they’re part of South Park Commons and focus on voice AI.
The model trains in two parts. First on tons of English audio to learn general patterns. Then they fine-tune it with cleaner studio clips that come with tags for different emotions, accents, and voice types. It runs on a decoder-only transformer setup like LLaMA, and instead of guessing raw audio, it predicts tokens made for SNAC. So your text and voice cues get turned into tokens, then into clean audio.
This thing is supposed to fill the gap between stiff open-source models that don’t show much feeling and big-name TTS systems that cost per second. You can use it for game characters, podcasts, video narration, accessibility tools, voicebots, whatever. It’s built to run on a single decent GPU (think 16 GB+ cards like an RTX 4090 or better). It supports streaming and works with tools for caching and browser audio.
If you'd like to access this model, you can explore the following possibilities: