MiniCPM-V-4.6 Packs Private Visual AI Into Phones

A massive sleek smartphone rendered in translucent frosted glass with ghostly thumbnails of photographs.

MiniCPM-V-4.6 is a new open-source multimodal model that brings image and video understanding directly to smartphones and small computers. It answers questions about photos and video clips without a cloud connection by combining a compact vision encoder with a 0.8-billion-parameter language brain. The release delivers accurate visual reasoning while staying small enough to run privately on everyday hardware.

OpenBMB, the team behind VoxCPM2 and MiniCPM-o 4.5, built this version to slash the heavy compute that vision-language tasks usually require. They focused on keeping strong accuracy while shrinking the model’s cost so it can serve edge devices like phones and tablets. The result is a lightweight tool that helps developers and power users build private AI applications without sacrificing capability.

Efficient vision model for phones and small gpus

What it can do
  • Scores 13 on AI Intelligence Index with 19x fewer tokens.
  • Surpasses Ministral 3 3B on key benchmarks.
  • Matches 2B-level performance on many vision tasks.
  • Cuts visual encoding compute by over 50%.
  • Supports mixed 4x and 16x token compression.
  • Deploys on iOS, Android, and HarmonyOS.
  • Works with vLLM, Ollama, llama.cpp, and more.
  • Ships in multiple quantized formats for local use.

This model fits privacy-conscious professionals who want to analyze images without sending data to the cloud. Small agencies can integrate it into local tools for document review, visual search, or automated captioning. Hobbyists with consumer GPUs get fast, private video understanding inside a memory-friendly footprint.

Developer notes and hardware tips

The model is fully open source and includes adaptation code for three mobile platforms, though video decoding may need the PyAV library if the default torchcodec package triggers CUDA version conflicts. It runs about 1.5 times the token throughput of the base Qwen3.5-0.8B model while requiring far less processing power for vision. No future roadmap has been posted yet, but the team provides quantized GGUF, BNB, AWQ, and GPTQ variants to help users squeeze the model onto a wide range of hardware.

"MiniCPM-V 4.6 scores 13 on the Artificial Analysis Intelligence Index benchmark, outperforming Qwen3.5-0.8B's score of 10 with 19x fewer token cost, and Qwen3.5-0.8B-Thinking's score of 11 with 43x fewer token cost." — Source: Hugging Face