Qwen Launches Qwen3 TTS Multilingual Text-to-Speech AI

    
        By vramkickedin    
     | 
    
            January 30, 2026 at 3:17 pm        
    
     | 
    
        2 min read

Qwen has introduced Qwen3 TTS, a versatile text-to-speech series trained on over 5 million hours of speech data across 10 different languages. The new AI technology delivers exceptional capabilities in voice cloning, generation, and control with unprecedented low-latency streaming performance.

Advanced Audio Capabilities

The Qwen3 TTS models showcase remarkable features that set them apart from existing text-to-speech technologies:

Support for 3-second voice cloning
Multilingual generation across 10 languages
Ultra-low first-packet latency (97-101 ms)
Natural language-based voice design and control
Streaming text input and audio output
Supports speaker-consistent multilingual generation

Model Family and How It Generates

The model family includes 5 models and a tokenizer, offering developers very flexible options for voice generation. Variants of the models include:

Qwen3-TTS-12Hz-1.7B-CustomVoice
Qwen3-TTS-12Hz-1.7B-VoiceDesign
Qwen3-TTS-12Hz-1.7B-Base
Qwen3-TTS-12Hz-0.6B-CustomVoice
and Qwen3-TTS-12Hz-0.6B-Base

'We evaluate the model’s ability to clone unseen voices by measuring content consistency—specifically Word Error Rate (WER)—on the public Seed-TTS,' says one developer, noting their Zero-Shot Speech Generation. Another states, regarding the Multilingual Speech Generation 'To assess linguistic versatility, we examine both content intelligibility and speaker similarity in a zero-shot multilingual setting using the multilingual test set'.