NVlabs LongLive: Real-Time AI Video Magic

Technical demonstration of LongLive frame sink

LongLive is a framework from NVlabs that generates long videos in real time with interactive prompting. The system produces videos up to 240 seconds while maintaining visual consistency across frames.

Built on a frame-level autoregressive architecture, LongLive achieves 20.7 FPS on a single NVIDIA H100 GPU. The model was trained in just 32 GPU-days—a notably efficient timeline for video generation systems.

Model Size: 1.3B parameters & VRAM GPU: requirements vary

Key framework capabilities

  • Real-time video generation at 20.7 frames per second
  • Support for videos up to 240 seconds in length
  • Dynamic prompt switching during generation
  • Self-training method requiring no video datasets
  • Smooth transitions between user prompts mid-video

Researchers and developers experimenting with interactive video applications can use LongLive for real-time content generation. The self-training approach also makes it useful for teams without access to large video datasets.

How LongLive Works Differently

The research team notes that LongLive "relies only on a set of prompts to teach the model with interaction ability." This eliminates the need for curated video training data.

The framework uses three technical approaches: a KV-recache mechanism for prompt changes, streaming long tuning to align training with inference, and short window attention with frame-level sinks. User studies showed a 41% improvement in overall video quality compared to baseline methods.

LongLive is released under a CC-BY-NC 4.0 license for non-commercial research and development.

Read the research paper on arXiv, explore the Hugging Face model page, or visit the GitHub repository.