OpenMOSS introduces MOVA for Video Audio Sync

OpenMOSS released MOVA on January 29, 2026, introducing an open-source foundation model designed to generate synchronized video and audio content simultaneously. The system employs a Mixture-of-Experts (MoE) architecture with 32 billion total parameters, of which 18 billion are active during inference to handle Image-Text to Video-Audio (IT2VA) tasks. This release includes model weights, inference code, training pipelines, and LoRA fine-tuning scripts, aiming to advance research in joint multimodal modeling.
MOVA's features & technical capabilities
- Native bimodal generation, producing video and audio.
- Available in 360p and 720p models.
- Multilingual speech with state-of-the-art lip-sync.
- MoE architecture with 18B active parameters out of 32B.
- Asymmetric dual-tower design combining a 14B video DiT and a 1.3B audio DiT.
- Fully open source!
Benchmark Results & Performance Metrics
Independent evaluations on Verse-Bench reveal that MOVA-720p achieves an LSE-D score of 7.094 and an LSE-C score of 7.452 when utilizing Dual CFG. The model processes video latents and audio latents through distinct VAEs—specifically the Wan2.1 video VAE and a DAC-style audio VAE from HunyuanVideo-Foley—to ensure high-fidelity output.
Inference benchmarks on an RTX 4090 GPU show that component-wise offloading requires 48GB VRAM with a step time of 37.5 seconds, whereas layerwise group offloading reduces VRAM usage to approximately 12GB but extends step time to 42.3 seconds. The data pipeline retains 26.39% of raw video duration after rigorous quality filtering and alignment checks.
Insights from the OpenMOSS team
The technical report highlights a fundamental issue in current generation pipelines:
'Audio is indispensable for real-world video, yet generation models have largely overlooked audio components.'
The authors argue that
'current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality.'
To address this, OpenMOSS designed an asymmetric dual-tower architecture that couples pre-trained video and audio towers. This allows the model to generate high-quality, synchronized content while maintaining efficient interaction between modalities.
Start using MOVA now
- Read MOVA's paper here.
- Access both models on Hugging Face.
- Check out MOVA's GitHub.