Keye-VL-2.0-30B-A3B Brings Native Agent Tools To Long Video Ai

Futuristic hourglass with sparse glowing nodes flowing through wireframe polygons.

Keye-VL-2.0-30B-A3B is a new open-source multimodal model designed to understand long videos and perform agent tasks like code execution and web search. It uses a sparse attention mechanism called DSA to process up to 256,000 tokens efficiently, keeping reasoning accurate across hour‑long footage. The model also integrates built‑in tool use, making it the first in the Keye series to ship with native agent capabilities.

Kwai‑Keye built this 30B‑class flagship by combining a custom vision encoder with a large language model optimized via custom kernels and parallelism. The team curated a data pipeline that strengthens OCR, chart, and table understanding, while synthetic chain‑of‑thought data improves reasoning continuity. In benchmarks, it matches or surpasses Gemini‑3‑Flash on fine‑grained temporal grounding and outperforms much larger open‑source competitors on long‑video understanding.

Efficient long‑video reasoning

Key Features
  • Precise event localization in hour‑long videos.
  • Near‑lossless reasoning over 256K token context.
  • Sparse attention keeps computation costs low.
  • Built‑in agent tools for code, search, and APIs.
  • Beats 200B+ parameter models on VideoMME V2.
  • Custom Docker image and SGLang server support.

This model fits developers and researchers who need to analyze large video archives locally without sending data to the cloud. Privacy‑conscious professionals can run complex visual queries on consumer‑grade GPUs thanks to the efficient inference stack. Small agencies can leverage its agent features to automate tasks like searching through footage or generating code from visual inputs.

What makes it different

The system is the first multimodal model to deploy DeepSeek Sparse Attention in a production release, achieving strong efficiency gains while keeping reasoning intact. It improves accuracy as the input frame count grows—unlike most competitors that degrade—with scores rising from 35.3% at 64 frames to 42.4% at 512 frames on VideoMME V2. Setup is straightforward through a pre‑built Docker image or a source build from the custom SGLang branch, with clear examples provided for image and video inference.

“As the first multi-modal model to land DSA in production, Keye-VL-2.0-30B-A3B delivers nearly lossless reasoning over 256K ultra-long context.” — Source: Hugging Face