Lance Unifies Image And Video Generation And Editing In One Lightweight Model

Lance is a new open-source AI model that handles image and video tasks like understanding, generation, and editing all in one place. It was trained entirely from scratch with only 128 A100 GPUs, making it efficient to build. With just 3 billion active parameters, it performs competitively against much larger models.
Researchers at ByteDance created Lance to unify multiple media tasks without relying on massive model size or separate tools. They designed a dual-stream architecture with decoupled pathways so that understanding and generation skills can improve together without interfering. The team also released ready-to-use scripts, benchmarks, and a Gradio demo to make testing easier.
What Lance can do
- Text-to-image and text-to-video generation.
- Image and video editing with natural language.
- Visual question answering for images and videos.
- Detailed video captioning and understanding.
- Only 3B parameters with strong benchmark scores.
- Trained from scratch on just 128 GPUs.
- Dual-stream MoE prevents task interference.
- Ready-to-run scripts and Gradio demo.
This release is useful for developers and researchers who want a single lightweight model to handle multiple visual AI tasks. Hobbyists with capable hardware can experiment with image and video tools without juggling separate models. The 40GB VRAM requirement and open-source license make it more accessible for local and private AI workflows.
How it performs and what’s next
The model outperforms other open-source unified models of its scale, especially on video generation benchmarks like VBench where it scored 85.11. Lance does not rely on an external LLM for prompt rewriting, which simplifies the pipeline compared to some competitors. The training recipe uses a staged approach and modality-aware positional encoding to reduce interference between different visual tokens.
"Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities." — Source: arXiv Paper