Rednote-Hilab Drops Dots.tts A 2B Param Speech Model That Clones Voices Natively

    
        By vramkickedin    
     | 
    
            June 17, 2026 at 3:07 pm        
    
     | 
    
        2 min read

Dots.tts is a new 2-billion-parameter text-to-speech model that converts text directly into high-fidelity 48 kHz audio without relying on discrete audio codec tokens. The system operates fully end-to-end, using an autoregressive backbone to predict continuous latent audio patches one step at a time. It supports zero-shot voice cloning, meaning it can mimic a speaker’s voice from just a short reference recording.

RedNote’s HiLab team released the model under the permissive Apache 2.0 license, including the full code, pretrained weights, and two specialized variants. The base checkpoint was trained on about 1.5 million hours of speech and serves as the recommended foundation for fine-tuning. The release also provides a self-corrective-aligned version for best cloning fidelity and a MeanFlow‑distilled version that speeds up inference to just 4 sampling steps.

Continuous architecture and zero-shot cloning

Key Features

2 billion parameters, end-to-end autoregressive design.
No discrete tokenizer; fully continuous latent space.
48 kHz audio synthesis with BigVGAN-style decoder.
Zero-shot voice cloning from a few seconds of audio.
BPE text input, no phoneme preprocessing needed.
Flow-matching head with classifier-free guidance control.
Fine-tuning code and three released checkpoints included.
Streaming audio generation for low-latency applications.

Voice actors, content creators, and small studios can use dots.tts to quickly generate natural-sounding speech that matches a target speaker’s timbre, all on local hardware. The Apache license makes it safe for commercial projects, while the fully local pipeline helps privacy-conscious professionals keep sensitive voice data off the cloud. Fine-tuning support also lets advanced users adapt the model to niche accents or specific vocal styles.

Developer notes and known limitations

The team highlights that high-fidelity voice cloning can be misused for impersonation or disinformation, so they ask users to pair the tool with consent policies and synthetic-speech detection. On the multilingual MiniMax benchmark, dots.tts maintains strong speaker similarity across all languages, but word error rates are higher for script-divergent or under-represented languages like Arabic, Hindi, and Turkish. The current release focuses on speech only; singing and combined speech‑and‑sound generation are not covered.

"dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94% / 1.30% / 6.60% and SIM scores of 81.0 / 77.1 / 79.5 on the zh / en / zh-hard test sets, respectively." — Source: GitHub

Project Links