NVIDIA dropped PersonaPlex-7B-v1 on January 15th 2026, a new speech-to-speech AI that talks and listens at the same time, more like how people actually chat. No more awkward pattern of waiting your turn like with regular voice assistants.
It’s a 7 billion setting Transformer-based model built to keep convos natural and smooth. Instead of breaking stuff into speech recognition then text then voice, it just does it all in one go. The AI listens and talks in a flow, predicting speech and text together as it goes.
It uses two streams - one picks up your voice, the other spits out its response - both share the same brain. So it can respond while you’re still talking... even toss in stuff like “uh-huh” or cut in if needed, and it all happens fast.
You can tweak its voice and vibe with two prompt types. Voice ones shape the sound and tone. Text ones set its role and background.
Some quick tech bits. It runs on a Transformer setup called Moshi with Mimi for speech work. Uses 24 kHz audio. It’s trained on real chats and made-up convos. Runs fast, almost like talking to a real person.
Free to use on Hugging Face but you gotta agree to the terms first and while it is listed under MIT the model it's built on is only CC-BY-4.0 so make of it what you will. You might also need a beefy GPU to run it, or upload it into a rent-a-gpu cloud.
If you'd like to access this model, you can explore the following possibilities: