Qwen Launches Qwen3 ASR 1.7B with Top Accuracy

Qwen has revealed the Qwen3-ASR family, a new suite of two automatic speech recognition models that includes the Qwen3-ASR-1.7B and Qwen3-ASR-0.6B alongside the Qwen3-ForcedAligner-0.6B. These models support language identification and speech recognition for 52 languages and dialects, leveraging the audio understanding capabilities of the Qwen3-Omni model.
Similar to their Qwen3-TTS line up, the model also falls under the Apache 2.0 license. This new family targets various real-world scenarios, offering capabilities ranging from long-form transcription to singing voice recognition. The 1.7B version targets high performance, while the smaller 0.6B version prioritizes efficiency for on-device applications.
Qwen3-ASR's core features & capabilities
- Supports 30 languages and 22 Chinese dialects.
- Runs under 8GB VRAM.
- Unified streaming and offline inference.
- Recognition capabilities for complex audio types.
- Up to 20 minutes (1200 seconds) of audio length.
- Integration of a non-autoregressive forced alignment model for timestamp prediction in 11 languages.
- Utilization of a dynamic flash attention window ranging from 1s to 8s for flexible processing.
Benchmark results & performance metrics
Performance testing reveals that Qwen3-ASR-0.6B achieves an impressive average time-to-first-token (TTFT) as low as 92ms under specific concurrency settings. In high-demand environments, this compact model can transcribe 2,000 seconds of speech in just one second at a concurrency of 128.
The architecture employs an AuT encoder, an attention-encoder-decoder based model performing 8 times downsampling on Fbank features to yield a 12.5Hz token rate. This design allows for robust handling of both short chunks for streaming and long queries for offline analysis. The training pipeline incorporated 40 million hours of pseudo-labeled ASR data during the encoder pretraining stage alone.
Devs expert analysis & insights
The development team emphasizes the rigorous internal evaluation process used to benchmark these models against real-world challenges. The project paper notes that the authors built internal benchmarks covering complex acoustic environments, dialects, and speech from elders and kids to validate performance beyond open-sourced datasets.
'The experiments reveal that the 1.7B version achieves state-of-the-art performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy–efficiency trade-off,'
the authors state in the technical report. They further highlight that the Qwen3-ForcedAligner-0.6B
'delivers highly accurate forced-alignment timestamps and inherits the key capabilities of Qwen3-ASR, including multilingual and long-form speech support, enabling scalable labeling of speech-transcript pairs.'
This comprehensive approach suggests a focus on practical deployment utility rather than just benchmark scores.
Learn more about Qwen3 ASR 1.7B?
- Read their project paper here.
- Access the Qwen3 ASR 1.7B model on Hugging Face here.