k2-fsa OmniVoice Turns Text To Speech In 600 Languages Offline

OmniVoice is an open source text-to-speech system that converts written words into spoken audio across more than six hundred languages. The software enables instant voice matching and allows users to adjust vocal traits without specialized audio equipment.
k2-fsa released this tool to help creators generate natural speech directly on personal computers. This approach removes the need for cloud services while keeping private recordings completely offline.
Model Size: 3GB & VRAM GPU: requirements vary
Architecture and generation modes
- Clone voices using reference clips as short as three seconds.
- Build custom speaker profiles by changing age, pitch, accent, and delivery style.
- Add sound effects like laughter or sighs directly into the input text.
- Fix difficult pronunciations using simple phonetic spelling guides.
- Process large batches of scripts across multiple graphics cards.
Content creators and independent studios will find these options useful for producing localized voiceovers without hiring large teams. The system runs entirely on personal hardware once installed, allowing users to test scripts and adjust outputs immediately before final export.
Technical design and performance tradeoffs
The project team rebuilt traditional speech pipelines to skip the usual middle steps that slow down audio generation.
"Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens,"
noted the development team in a technical paper.
Users should keep styling limitations in mind, as the voice design feature currently performs best with English and Mandarin data. Less common languages might produce uneven pacing when combining specific accent prompts with synthetic speech.
Audio producers can access the primary repository on GitHub, review the complete technical paper, or grab the model weights from Hugging Face.