Marlin-2B Pins Down Every Second Of Your Video

A film frame crafted from frosted glass with a subtle digital grid texture embedded within its edges.

Marlin-2B is a new open-source video language model that extracts structured descriptions and second‑precise timestamps from video footage. It answers the two questions developers most often ask about a video: what is happening, and when.

NemoStation built Marlin by fine‑tuning the Qwen3.5-2B model with a carefully curated dataset and an efficient two‑stage training pipeline to achieve strong results on dense captioning and temporal grounding benchmarks.

Efficient video analysis on consumer hardware

Key Features
  • Top accuracy on CaReBench dense captioning.
  • Matches Gemini Flash on temporal grounding.
  • Runs on a single consumer GPU.
  • Simple .caption() and .find() methods.
  • Structured output with second‑precise timestamps.

The model is ideal for developers and small teams who need to analyze video locally without sending data to cloud services. Privacy‑conscious professionals can run it on their own hardware to pull structured event timelines from security footage, meeting recordings, or user‑generated videos. Hobbyists and researchers also benefit from a competitive video understanding tool that fits on a single gaming‑class GPU.

What’s under the hood

The training data combines public video annotations with dense re‑annotations produced by Gemini‑3‑Flash and targeted human review, resulting in about 400,000 high‑quality clip‑level examples. A two‑stage process fine‑tuned the model first with supervised learning, then with SimPO preference optimization on a single H100 GPU, avoiding the need for a reference model. The team plans to publish a full recipe paper, and a minor artifact exists where every raw response begins with a <think> token, though the built‑in helper methods strip it automatically.

“Marlin is a 2B video VLM tuned for the two questions developers actually like ask their videos: what is happening, and when?” — Source: Hugging Face