OpenMOSS MOSS-TTS Speech Studio for home GPUs

Graphical speech bubbles over rolling bokeh hills

MOSS-TTS Family is an open-source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high-fidelity audio generation across complex real-world scenarios, including long-form speech, multi-speaker dialogue, voice design, sound effects, and real-time streaming text-to-speech.

The project breaks down audio generation into five production-ready models that work independently or together as a complete pipeline. This approach solves the problem of needing different tools for different audio tasks, giving creators a unified system for everything from voice cloning to environmental sound design.

Model Size: 8B parameters & VRAM GPU: 8GB required

What MOSS-TTS can do

  • Generate stable long-form speech lasting tens of minutes without quality degradation.
  • Clone voices from a single reference audio sample with high accuracy.
  • Create custom voices from text prompts without requiring reference audio.
  • Produce environmental sound effects for games, films, and interactive media.
  • Support 20 languages including Chinese, English, Japanese, German, and French.

Game developers and content creators working on limited hardware can benefit from these models. The 8B flagship model now runs on consumer GPUs with 8GB of VRAM thanks to recent optimizations, making professional-grade audio generation accessible without enterprise hardware.

Recent updates and improvements to the model

The development team has made significant strides in accessibility and performance this month. A llama.cpp implementation now enables PyTorch-free inference, allowing lightweight on-device deployment through GGUF weights and ONNX runtime. SGLang backend support was also added, delivering roughly three times faster generation throughput.

According to the project documentation:

'the 8B model fits onto 8GB GPUs'

after significant VRAM optimization. Fine-tuning tutorials are now available for both MossTTSDelay and MossTTSLocal architectures, letting users customize models for specific use cases. The models perform well on benchmarks, with MOSS-TTSD v1.0 outperforming closed-source competitors like Doubao and Gemini 2.5-pro in subjective evaluations.

You can find MOSS-TTS on GitHub and explore the models on their Hugging Face.