daVinci-MagiHuman Conjures Expressive Talking Videos from Text

Mannequin holding and operating a professional video camera with a purple background

daVinci-MagiHuman is an open-source audio-video generation model that creates synchronized video and audio content from text prompts. The model uses a single-stream Transformer architecture to process text, video, and audio together through self-attention, avoiding the complexity of multi-stream designs.

Developed by GAIR and Sand.ai, this 15-billion parameter model focuses on human-centric generation tasks. It produces expressive facial performances, natural speech coordination, realistic body motion, and precise audio-video synchronization while supporting multiple languages.

Model Size: 15B parameters & VRAM GPU: requirements vary

What daVinci-MagiHuman features

  • Generates synchronized video and audio from text prompts using a unified architecture.
  • Supports six languages: Mandarin, Cantonese, English, Japanese, Korean, German, and French.
  • Produces a 5-second 256p video in 2 seconds on a single H100 GPU.
  • Upscales content to 1080p resolution in approximately 38 seconds.
  • Achieves 80% win rate against Ovi 1.1 and 60.9% against LTX 2.3 in human evaluations.
  • Includes base model, distilled model, super-resolution model, and inference code.

Content creators working on avatar-style videos or multilingual presentations may find this tool useful for generating realistic talking-head content. The model's ability to coordinate facial expressions with spoken dialogue makes it suitable for applications requiring natural human performance, such as educational materials, virtual presenters, or dubbed content across supported languages.

Architecture design choices

The research team designed daVinci-MagiHuman with a 'sandwich architecture' where the first and last 4 layers use modality-specific projections while the middle 32 layers share parameters across modalities. The model does not use explicit timestep embeddings, instead inferring the denoising state directly from input latents. This approach simplifies training and inference infrastructure.

For efficiency, the developers combined several techniques including model distillation, latent-space super-resolution, and a Turbo VAE decoder. The paper notes that the single-stream design

'avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure.'

Users should note that first-time runs will be slower due to model compilation and cache warmup before reaching the reported inference speeds.

Check out the code for daVinci-MagiHuman on GitHub or download the model weights from Hugging Face. Read the full research paper on arXiv.