Aratako Brings MioTTS Inference for Fast Local Voice Cloning

Iconography of speech featuring a head and speech bubble with the text MioTTS

MioTTS Inference is a text-to-speech system that uses large language models to generate natural-sounding speech. The project offers multiple model sizes ranging from 0.1B to 2.6B parameters, allowing users to choose between speed and quality based on their hardware.

Developer Aratako created this tool to address the need for efficient, high-quality audio synthesis on consumer hardware. It includes a custom neural audio codec called MioCodec, designed to reduce latency while maintaining audio fidelity.

MioTTS Inference's key features and flexibility

  • Zero-shot voice cloning from short reference audio clips.
  • Bilingual support for English and Japanese speech synthesis.
  • Compatible with popular inference frameworks like llama.cpp, Ollama, and vLLM.
  • REST API access for easy integration into existing workflows.
  • Best-of-N selection for improved audio output quality.

Content creators and developers working on multilingual projects can use this tool for voiceovers, dialogue generation, or accessibility features. The variety of model sizes makes it practical for systems with limited resources, as even the smallest model achieves real-time factor speeds of 0.04 to 0.05.

Devs notes and installation

The developer, who primarily works in Japanese, is actively seeking feedback on English prosody quality. Aratako explained the motivation behind the project:

'The main focus was to achieve high-fidelity audio at the 0.1B parameter scale.'

This led to creating MioCodec as a separate component to optimize the generation pipeline. Licenses vary by model based on the underlying base model used, with some available under Apache 2.0 and others under Falcon-LLM or LFM Open licenses.

Installation requires cloning the repository and setting up dependencies, with flash-attention recommended for optimal performance. Users can run the speech synthesis API alongside their preferred LLM inference server, making it compatible with existing setups.

Get MioTTS Inference on GitHub or try the interactive demo on Hugging Face. Model weights are available in the MioTTS collection.