OpenMOSS introduces MOVA for Video Audio Sync

    
        By vramkickedin    
     | 
    
            February 22, 2026 at 10:07 pm        
    
     | 
    
        2 min read

OpenMOSS released MOVA on January 29, 2026, introducing an open-source foundation model designed to generate synchronized video and audio content simultaneously. The system employs a Mixture-of-Experts (MoE) architecture with 32 billion total parameters, of which 18 billion are active during inference to handle Image-Text to Video-Audio (IT2VA) tasks. This release includes model weights, inference code, training pipelines, and LoRA fine-tuning scripts, aiming to advance research in joint multimodal modeling.

MOVA's features & technical capabilities

Native bimodal generation, producing video and audio.
Available in 360p and 720p models.
Multilingual speech with state-of-the-art lip-sync.
MoE architecture with 18B active parameters out of 32B.
Asymmetric dual-tower design combining a 14B video DiT and a 1.3B audio DiT.
Fully open source!

Benchmark Results & Performance Metrics

Independent evaluations on Verse-Bench reveal that MOVA-720p achieves an LSE-D score of 7.094 and an LSE-C score of 7.452 when utilizing Dual CFG. The model processes video latents and audio latents through distinct VAEs—specifically the Wan2.1 video VAE and a DAC-style audio VAE from HunyuanVideo-Foley—to ensure high-fidelity output.

Inference benchmarks on an RTX 4090 GPU show that component-wise offloading requires 48GB VRAM with a step time of 37.5 seconds, whereas layerwise group offloading reduces VRAM usage to approximately 12GB but extends step time to 42.3 seconds. The data pipeline retains 26.39% of raw video duration after rigorous quality filtering and alignment checks.

Insights from the OpenMOSS team

The technical report highlights a fundamental issue in current generation pipelines:

'Audio is indispensable for real-world video, yet generation models have largely overlooked audio components.'

The authors argue that

'current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality.'

To address this, OpenMOSS designed an asymmetric dual-tower architecture that couples pre-trained video and audio towers. This allows the model to generate high-quality, synchronized content while maintaining efficient interaction between modalities.