Fudan-FUXI Unveils Omni-Video 2 AI Tool

Omni-Video 2 is a unified video editing and generation framework that combines a text-to-video diffusion model with vision-language understanding. The system can generate videos from text descriptions and edit existing footage with precise control over changes. It supports text-to-video creation, video-to-video editing, and mixed-condition generation all within a single pipeline.
Fudan-FUXI developed this tool to solve a common problem in video AI: turning short, simple prompts into detailed, accurate edits. The model uses a vision-language component that reads source videos and editing instructions, then predicts exactly what the final result should look like. This approach converts vague requests into specific instructions about content, attributes, and motion changes.
Model Size: 69.2GB & VRAM GPU: requirements vary
What Omni-Video 2 can do
- Generate videos from text descriptions with high quality output
- Edit existing videos with precise control over specific elements
- Remove or add objects while preserving background details
- Change backgrounds and environments smoothly
- Handle complex motion editing across multiple frames
- Process multi-element transformations including lighting and appearance changes
Video editors and content creators working on complex projects may find this tool useful for making detailed changes without manually editing each frame. The ability to understand and execute compositional instructions means users can request multiple changes at once, such as adjusting lighting while also modifying specific objects, rather than running separate edits sequentially.
Technical design choices
The team built Omni-Video 2 to scale efficiently by connecting pretrained multimodal language models directly to video diffusion models. A lightweight adapter injects conditional tokens into the system, allowing it to reuse existing generative capabilities without requiring a complete rebuild. According to the researchers:
'we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality.'
The model was tested on the FiVE benchmark for fine-grained editing and VBench for generation tasks, showing strong performance in following complex instructions while maintaining competitive quality for video generation.
The 69.2GB model size means users will need substantial storage and likely a high-end GPU for local inference. Access OmniVideo2-A14B on Hugging Face. Read the full details on the project page or review the paper on arXiv.