AI creators.tools

OmniVoice audio model

Name: OmniVoice
Licence: Apache License 2.0
Creator: Researchers from Xiaomi Corp.

OmniVoice is an open-source text to speech model that works across many languages. It can make speech in hundreds of languages and copy a voice from a short clip. The team calls it omnilingual, and says it supports over 600 languages, with 646 listed on Hugging Face.

Researchers from Xiaomi Corp. built the model, including Han Zhu and others. The code is shared under the k2-fsa group, tied to tools like Kaldi, k2, icefall and sherpa. So it comes from a known speech tech group.

Inside, the model uses a diffusion style system with a non autoregressive setup. That sounds heavy, but the idea is simple. It tries to generate speech fast while keeping things clear. It maps text straight into acoustic tokens, using random masking during training and a pre trained language model to help clarity.

The team says they trained it on about 581,000 hours of multilingual data. They claim strong results across English, Chinese and other languages. Still, it is new, so these claims need more outside testing. There is also a limit noted – voice design training mostly used Chinese and English, so some smaller languages may feel less stable.

The model supports a few main features. It has a Python API and does text to speech as its core job. It can adjust accents like British or American English, and some Chinese dialects. It can clone voices from short clips or create new ones using traits like age or pitch. It also allows automatic voice choice if no speaker is given.

Outputs are audio files, usually .wav at 24 kHz. It can run in different modes. You can clone a voice from a 3 to 10 second clip, or design one from text traits. It also lets you tweak speed, duration and pronunciation, even using phonemes or pinyin. Small extras like [laughter] can be added too.

Speed is a big point. The project reports about 40 times faster than real time under certain setups, though actual speed depends on hardware.

For hardware, exact VRAM needs are not clearly listed. The model files are a few GB in size, so it should run on consumer machines. Still, memory use goes above just the model file. A rough guess is around 4 to 6 GB VRAM for GPU use, though that is not official. Apple Silicon support is mentioned too.

Key Features
Supported Languages
  • Afrikaans
  • Akan
  • Albanian
  • Amharic
  • Arabic
  • Aragonese
  • Armenian
  • Assamese
  • Azerbaijani
  • Bashkir
  • Basque
  • Belarusian
  • Bengali
  • Bosnian
  • Breton
  • Bulgarian
  • Burmese
  • Catalan
  • Chichewa
  • Chinese
  • Chuvash
  • Cornish
  • Croatian
  • Czech
  • Danish
  • Divehi
  • Dutch
  • English
  • Esperanto
  • Estonian
  • Filipino
  • Finnish
  • French
  • Galician
  • Ganda
  • Georgian
  • German
  • Greek
  • Gujarati
  • Haitian
  • Hausa
  • Hebrew
  • Herero
  • Hindi
  • Hungarian
  • Icelandic
  • Ido
  • Igbo
  • Indonesian
  • Interlingua
  • Inupiaq
  • Irish
  • Italian
  • Japanese
  • Javanese
  • Kannada
  • Kashmiri
  • Kazakh
  • Kinyarwanda
  • Korean
  • Kurdish
  • Lao
  • Latvian
  • Lingala
  • Lithuanian
  • Luxembourgish
  • Macedonian
  • Malagasy
  • Malay
  • Malayalam
  • Maltese
  • Manx
  • Marathi
  • Mongolian
  • Ndonga
  • Nepali
  • Norwegian
  • Norwegian Bokmål
  • Norwegian Nynorsk
  • Occitan
  • Oriya
  • Oromo
  • Ossetian
  • Pali
  • Panjabi
  • Pashto
  • Persian
  • Polish
  • Portuguese
  • Quechua
  • Romanian
  • Romansh
  • Russian
  • Sanskrit
  • Sardinian
  • Serbian
  • Shona
  • Sindhi
  • Sinhala
  • Slovak
  • Slovenian
  • Somali
  • Spanish
  • Swahili
  • Swedish
  • Tajik
  • Tamil
  • Tatar
  • Telugu
  • Thai
  • Tibetan
  • Tigrinya
  • Tswana
  • Turkish
  • Turkmen
  • Twi
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Welsh
  • Western Frisian
  • Wolof
  • Xhosa
  • Yiddish
  • Yoruba
  • Zulu
Model Performance Editor’s Rating
No editor performance evaluations available for this model yet.
User Ratings
Censorship
--
Lower = less censorship. Higher = stricter filtering.
Creativity
--
Expressiveness
--
Generation Speed
--
ID preservation
--
Prompt Following
--
Realism
--
No sample outputs available for this model yet.

Where To Find OmniVoice

If you'd like to access this model, you can explore the following possibilities: