LongCat-Video-Avatar-1.5 Materializes Studio-Quality Talking Heads Locally

Semi-transparent holographic human bust displayed on a clean matte dark surface next to a single physical studio condenser microphone.

LongCat-Video-Avatar-1.5 is a new open-source model for generating talking avatar videos from audio paired with text or image references. It produces realistic human characters, stylized animations, and coordinated multi-person conversations with synchronized lip movements. The release puts a premium on production-grade stability and fast local inference.

The Meituan LongCat team built this upgraded framework to improve audio-driven video synthesis. It swaps the previous audio encoder for Whisper-Large, delivering noticeably smoother lip dynamics. The project supports local deployment for tasks like Audio-Text-to-Video, Audio-Image-to-Video, and video continuation.

Stable lip sync and production-ready avatars

Key Features
  • Enhanced lip sync using Whisper-Large audio encoder.
  • Full-body stability and strict identity consistency.
  • Works with anime, animals, and multi-person scenes.
  • Fast 8-step inference with DMD2 distillation.
  • Supports audio-text-to-video and image-to-video workflows.
  • Handles single and dual-stream audio inputs.

Small studios and privacy-conscious creators can generate lifelike avatar videos without relying on cloud services. Hobbyists with capable GPUs benefit from INT8 quantization and step distillation for smoother local runs. Marketing teams, educators, and content producers will find it practical for consistent branded or training videos.

What developers should know

Version 1.5 replaces the older Wav2Vec2 audio encoder with OpenAI’s Whisper-Large, significantly improving lip sync quality. The model hasn’t been exhaustively evaluated for every downstream application, so developers should verify safety and accuracy in their own scenarios. It is released under the MIT License, but the license explicitly does not grant any rights to Meituan trademarks or patents.

“We are excited to announce the release of LongCat-Video-Avatar 1.5, an upgraded open-source framework that prioritizes extreme empirical optimization and production-readiness for audio-driven human video generation.” — Source: Hugging Face