Yuriyvnv Refines Dutch Speech Data With WAVe Update

    
        By vramkickedin    
     | 
    
            March 26, 2026 at 9:49 pm        
    
     | 
    
        2 min read

WAVe-1B-Multimodal-NL is a 1 billion parameter model that checks the quality of synthetic speech at the word level. It examines how well spoken audio matches its written transcript, catching errors that other methods often miss.

Developer yuriyvnv created this tool to solve a specific problem in speech technology: synthetic audio used for training speech recognition systems often contains subtle flaws. The model was trained on CommonVoice 16.1 Dutch data using five different corruption strategies to recognize quality issues.

How WAVe improves speech data quality

Detects mispronunciations and timing errors in synthetic speech.
Identifies prosody issues that sentence-level assessments miss.
Provides per-word quality scores for precise filtering.
Reduces ASR training steps by 34% compared to baseline methods.
Requires 30% less synthetic data while maintaining performance.
Outputs both overall quality scores and detailed alignment metrics.

Speech recognition developers working with Dutch language data can use this model to automatically filter their synthetic training datasets. Instead of manually reviewing thousands of audio samples, users can run audio through WAVe and receive a quality score between 0 and 1, with scores above 0.8 considered safe for training.

Behind the development

The model combines two pre-trained systems: XLM-RoBERTa for text understanding and Wav2Vec2-BERT 2.0 for audio processing. A custom alignment module connects these encoders to measure how well each word in the transcript matches its corresponding audio segment. According to the project description:

'this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.'

The developer notes that the model achieves a clean similarity score of 0.77 while pushing corrupt similarity down to 0.36, creating clear separation between good and bad samples.

Get WAVe-1B-Multimodal-NL on Hugging Face. Also, their source code is available on GitHub.