OpenBMB Debuts MiniCPM-o-4_5: Real-Time Vision and Voice AI

MiniCPM-o-4.5 is a multimodal AI model that processes vision, speech, and live streaming inputs in real-time. Developed by OpenBMB, this 9-billion parameter model can see, listen, and speak simultaneously through its full-duplex architecture, enabling fluid conversation experiences without input and output streams blocking each other.
The model is built using SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B components in an end-to-end design. It achieves an OpenCompass score of 77.6 on visual benchmarks, surpassing GPT-4o and approaching Gemini 2.5 Flash in performance.
Model Size: 19GB & VRAM GPU: requirements vary
MiniCPM-o-4_5 core capabilities
- Full-duplex multimodal live streaming that processes video and audio.
- Bilingual real-time speech conversation with configurable voices in English and Chinese.
- High-resolution image processing up to 1.8 million pixels and video at 10 frames per second.
- Strong OCR performance on document parsing.
- Support for more than 30 languages with trustworthy behavior matching Gemini 2.5 Flash.
- Quantized models available in int4 and GGUF formats for efficient local deployment.
Users running local AI systems on consumer hardware may find this model useful for building voice assistants, document analysis tools, or real-time video conversation applications. The llama.cpp-omni implementation supports half-duplex speech conversation on Apple M3/M4/M5 chips with at least 16GB RAM or NVIDIA GPUs with 12GB memory.
Deployment and model performance votes
The development team notes that the full-duplex omni-modal live streaming capability is an experimental feature that still needs improvement. Speech synthesis can occasionally mispronounce characters in full-duplex mode, and the model may sometimes mix English and Chinese in its responses.
The team has optimized the model for various hardware configurations. In bf16 format, MiniCPM-o-4.5 requires 19GB GPU memory and achieves 154.3 tokens per second decoding speed. The int4 quantized version reduces memory usage to 11GB while achieving 212.3 tokens per second. According to the project documentation:
'with a fully C++ implementation of MiniCPM-o 4.5 and quantized weights, llama.cpp-omni supports half-duplex speech realtime conversation and full-duplex omnimodal live streaming.'
Get MiniCPM-o-4_5 on GitHub and Hugging Face.