MagpieTTS Multilingual 357M is an open multilingual text to speech model released in April 2026. It turns text into spoken audio in nine languages. It is built for people who want a practical TTS model they can run with NVIDIA NeMo or use through NVIDIA’s hosted tools. One smaller model handles several languages and voices so teams do not need to manage a bunch of separate language models.
The model comes from NVIDIA through its NeMo Speech team and the wider NVIDIA speech stack. It sits next to other speech models like Parakeet ASR, Canary speech translation, Nemotron-Speech-Streaming and NVIDIA’s enterprise NIM tools.
NVIDIA says it can be used for commercial work under the NVIDIA Open Model License. NVIDIA also has a hosted and deployable NIM version with API access, a free development tier and paid or enterprise paths through NVIDIA AI Enterprise.
This model is part of the NeMo Speech stack, which is NVIDIA’s bigger speech AI setup for speech recognition, translation and TTS. That makes it more useful for teams building real voice products, because it connects to active code and deployment tools instead of sitting alone as a research release.
The model uses a transformer based text to speech setup. It predicts discrete audio codec tokens and then turns those into waveform audio. In simple terms it does not make raw sound one sample at a time. It first makes a compressed speech form, then rebuilds the final audio through a codec stage. NVIDIA says it uses an encoder decoder transformer, multi codebook prediction and refinement steps to help timing and sound quality.
The open model card lists five speaker options... Sofia, Aria, Jason, Leo and John Van Stan. NVIDIA also points to uses like voice agents and offline speech generation. The hosted NIM offer goes further and says it has more native voices and emotional speech features. But that seems tied to NVIDIA’s service layer, not just the open checkpoint by itself.
NVIDIA’s docs for Magpie-TTS in general mention things like long form inference and voice cloning with audio conditioning. But the specific multilingual 357M checkpoint on Hugging Face is described as a fixed multi speaker multilingual TTS model. So the safer read is this... the open checkpoint works best as a multilingual TTS model with preset voices, while more advanced options seem to sit in NVIDIA’s wider Magpie and NIM stack.
The output format is .wav. The open example uses NVIDIA’s NanoCodec checkpoint marked 22khz, so the practical pipeline looks like it targets about 22 kHz audio. That part is an estimate from the codec name, not a direct top line spec on the model card.
If you'd like to access this model, you can explore the following possibilities: