Qwen Launches Qwen3 ASR 1.7B with Top Accuracy

Graphic image of the Qwen mascot

Qwen has revealed the Qwen3-ASR family, a new suite of two automatic speech recognition models that includes the Qwen3-ASR-1.7B and Qwen3-ASR-0.6B alongside the Qwen3-ForcedAligner-0.6B. These models support language identification and speech recognition for 52 languages and dialects, leveraging the audio understanding capabilities of the Qwen3-Omni model.

Similar to their Qwen3-TTS line up, the model also falls under the Apache 2.0 license. This new family targets various real-world scenarios, offering capabilities ranging from long-form transcription to singing voice recognition. The 1.7B version targets high performance, while the smaller 0.6B version prioritizes efficiency for on-device applications.

Qwen3-ASR's core features & capabilities

  • Supports 30 languages and 22 Chinese dialects.
  • Runs under 8GB VRAM.
  • Unified streaming and offline inference.
  • Recognition capabilities for complex audio types.
  • Up to 20 minutes (1200 seconds) of audio length.
  • Integration of a non-autoregressive forced alignment model for timestamp prediction in 11 languages.
  • Utilization of a dynamic flash attention window ranging from 1s to 8s for flexible processing.

Benchmark results & performance metrics

Performance testing reveals that Qwen3-ASR-0.6B achieves an impressive average time-to-first-token (TTFT) as low as 92ms under specific concurrency settings. In high-demand environments, this compact model can transcribe 2,000 seconds of speech in just one second at a concurrency of 128.

The architecture employs an AuT encoder, an attention-encoder-decoder based model performing 8 times downsampling on Fbank features to yield a 12.5Hz token rate. This design allows for robust handling of both short chunks for streaming and long queries for offline analysis. The training pipeline incorporated 40 million hours of pseudo-labeled ASR data during the encoder pretraining stage alone.

Devs expert analysis & insights

The development team emphasizes the rigorous internal evaluation process used to benchmark these models against real-world challenges. The project paper notes that the authors built internal benchmarks covering complex acoustic environments, dialects, and speech from elders and kids to validate performance beyond open-sourced datasets.

'The experiments reveal that the 1.7B version achieves state-of-the-art performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy–efficiency trade-off,'

the authors state in the technical report. They further highlight that the Qwen3-ForcedAligner-0.6B

'delivers highly accurate forced-alignment timestamps and inherits the key capabilities of Qwen3-ASR, including multilingual and long-form speech support, enabling scalable labeling of speech-transcript pairs.'

This comprehensive approach suggests a focus on practical deployment utility rather than just benchmark scores.

Learn more about Qwen3 ASR 1.7B?