OmniVoice is an open-source text to speech model that works across many languages. It can make speech in hundreds of languages and copy a voice from a short clip. The team calls it omnilingual, and says it supports over 600 languages, with 646 listed on Hugging Face.

Researchers from Xiaomi Corp. built the model, including Han Zhu and others. The code is shared under the k2-fsa group, tied to tools like Kaldi, k2, icefall and sherpa. So it comes from a known speech tech group.

Inside, the model uses a diffusion style system with a non autoregressive setup. That sounds heavy, but the idea is simple. It tries to generate speech fast while keeping things clear. It maps text straight into acoustic tokens, using random masking during training and a pre trained language model to help clarity.

The team says they trained it on about 581,000 hours of multilingual data. They claim strong results across English, Chinese and other languages. Still, it is new, so these claims need more outside testing. There is also a limit noted – voice design training mostly used Chinese and English, so some smaller languages may feel less stable.

The model supports a few main features. It has a Python API and does text to speech as its core job. It can adjust accents like British or American English, and some Chinese dialects. It can clone voices from short clips or create new ones using traits like age or pitch. It also allows automatic voice choice if no speaker is given.

Outputs are audio files, usually .wav at 24 kHz. It can run in different modes. You can clone a voice from a 3 to 10 second clip, or design one from text traits. It also lets you tweak speed, duration and pronunciation, even using phonemes or pinyin. Small extras like [laughter] can be added too.

Speed is a big point. The project reports about 40 times faster than real time under certain setups, though actual speed depends on hardware.

For hardware, exact VRAM needs are not clearly listed. The model files are a few GB in size, so it should run on consumer machines. Still, memory use goes above just the model file. A rough guess is around 4 to 6 GB VRAM for GPU use, though that is not official. Apple Silicon support is mentioned too.

Key Features

Supported Languages

Afrikaans
Akan
Albanian
Amharic
Arabic
Aragonese
Armenian
Assamese
Azerbaijani
Bashkir
Basque
Belarusian
Bengali
Bosnian
Breton
Bulgarian
Burmese
Catalan
Chichewa
Chinese
Chuvash
Cornish
Croatian
Czech
Danish
Divehi
Dutch
English
Esperanto
Estonian
Filipino
Finnish
French
Galician
Ganda
Georgian
German
Greek
Gujarati
Haitian
Hausa
Hebrew
Herero
Hindi
Hungarian
Icelandic
Ido
Igbo
Indonesian
Interlingua
Inupiaq
Irish
Italian
Japanese
Javanese
Kannada
Kashmiri
Kazakh
Kinyarwanda
Korean
Kurdish
Lao
Latvian
Lingala
Lithuanian
Luxembourgish
Macedonian
Malagasy
Malay
Malayalam
Maltese
Manx
Marathi
Mongolian
Ndonga
Nepali
Norwegian
Norwegian Bokmål
Norwegian Nynorsk
Occitan
Oriya
Oromo
Ossetian
Pali
Panjabi
Pashto
Persian
Polish
Portuguese
Quechua
Romanian
Romansh
Russian
Sanskrit
Sardinian
Serbian
Shona
Sindhi
Sinhala
Slovak
Slovenian
Somali
Spanish
Swahili
Swedish
Tajik
Tamil
Tatar
Telugu
Thai
Tibetan
Tigrinya
Tswana
Turkish
Turkmen
Twi
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Western Frisian
Wolof
Xhosa
Yiddish
Yoruba
Zulu

Model Performance Editor’s Rating

No editor performance evaluations available for this model yet.

User Ratings

Censorship

Lower = less censorship. Higher = stricter filtering.

Creativity

Expressiveness

Generation Speed

ID preservation

Prompt Following

Realism

OmniVoice audio model

Key Features

Supported Languages

Model Performance Editor’s Rating

User Ratings

Where To Find OmniVoice

Related Audio Models