Qwen Launches Qwen3 ASR 1.7B with Top Accuracy

    
        By vramkickedin    
     | 
    
            February 21, 2026 at 1:37 am        
    
     | 
    
        3 min read

Qwen has revealed the Qwen3-ASR family, a new suite of two automatic speech recognition models that includes the Qwen3-ASR-1.7B and Qwen3-ASR-0.6B alongside the Qwen3-ForcedAligner-0.6B. These models support language identification and speech recognition for 52 languages and dialects, leveraging the audio understanding capabilities of the Qwen3-Omni model.

Similar to their Qwen3-TTS line up, the model also falls under the Apache 2.0 license. This new family targets various real-world scenarios, offering capabilities ranging from long-form transcription to singing voice recognition. The 1.7B version targets high performance, while the smaller 0.6B version prioritizes efficiency for on-device applications.

Qwen3-ASR's core features & capabilities

Supports 30 languages and 22 Chinese dialects.
Runs under 8GB VRAM.
Unified streaming and offline inference.
Recognition capabilities for complex audio types.
Up to 20 minutes (1200 seconds) of audio length.
Integration of a non-autoregressive forced alignment model for timestamp prediction in 11 languages.
Utilization of a dynamic flash attention window ranging from 1s to 8s for flexible processing.

Benchmark results & performance metrics

Performance testing reveals that Qwen3-ASR-0.6B achieves an impressive average time-to-first-token (TTFT) as low as 92ms under specific concurrency settings. In high-demand environments, this compact model can transcribe 2,000 seconds of speech in just one second at a concurrency of 128.

The architecture employs an AuT encoder, an attention-encoder-decoder based model performing 8 times downsampling on Fbank features to yield a 12.5Hz token rate. This design allows for robust handling of both short chunks for streaming and long queries for offline analysis. The training pipeline incorporated 40 million hours of pseudo-labeled ASR data during the encoder pretraining stage alone.

Devs expert analysis & insights

The development team emphasizes the rigorous internal evaluation process used to benchmark these models against real-world challenges. The project paper notes that the authors built internal benchmarks covering complex acoustic environments, dialects, and speech from elders and kids to validate performance beyond open-sourced datasets.

'The experiments reveal that the 1.7B version achieves state-of-the-art performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy–efficiency trade-off,'

the authors state in the technical report. They further highlight that the Qwen3-ForcedAligner-0.6B

'delivers highly accurate forced-alignment timestamps and inherits the key capabilities of Qwen3-ASR, including multilingual and long-form speech support, enabling scalable labeling of speech-transcript pairs.'

This comprehensive approach suggests a focus on practical deployment utility rather than just benchmark scores.