MOSS-TTS-v1.5 Lands With Precise Pause Controls And 31-Language Synthesis

    
        By vramkickedin    
     | 
    
            May 31, 2026 at 6:04 pm        
    
     | 
    
        2 min read

MOSS-TTS-v1.5 is an upgraded open-source text-to-speech model from the OpenMOSS team, building on their earlier 1.0 release. It keeps zero-shot voice cloning, long-form generation, and multilingual capabilities while delivering more stable and natural-sounding speech. The model now supports 31 languages and adds precise controls like explicit pause markers and language tags.

The OpenMOSS-Team created this version to improve voice cloning consistency and multilingual performance. They focused on making the model more reliable when the reference audio is much longer than the text to be spoken. It runs locally on consumer GPUs, giving privacy-conscious users full-quality TTS without depending on the cloud.

Explicit pause control and language tagging

Key improvements in v1.5

Stronger multilingual synthesis with language tags.
More consistent voice cloning across repeats.
Explicit pause markers like [pause 3.2s].
Better handling of long references with short text.

This tool suits serious hobbyists and small studios who need local, private text-to-speech for multiple languages. You can produce studio-quality voiceovers and character voices without sending data to external APIs. The improved cloning stability makes it easier to keep a consistent voice across long audio projects.

Development and upgrade notes

The model uses the same API as MOSS-TTS 1.0, so upgrading is straightforward. Installing FlashAttention 2 is optional but can speed up inference on supported GPUs. For the best multilingual results, always specify a language tag—omitting it may slightly reduce performance in some languages.

“MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching.” — Source: Reddit

Project Links