Ming UniVision: AI Images Transformed

# Ming UniVision: Unified Image AI with Continuous Visual Tokenization
inclusionAI has unveiled Ming UniVision, a groundbreaking multimodal large language model (MLLM) that unifies image understanding, generation, and editing within a single continuous autoregressive framework.
Project Foundation
On October 7, 2025, inclusionAI released Ming UniVision, introducing MingTok — the first continuous visual tokenizer designed to seamlessly support both image understanding and generation. The project aims to eliminate traditional quantization challenges by creating a unified, continuous latent space for visual representation.
Core Technical Innovation
Ming UniVision introduces three key technical breakthroughs:
- First continuous unified tokenizer for vision tasks
- Next-token prediction framework across understanding and generation
- 3.5× faster training convergence compared to existing models
Technical Architecture
The model's architecture features a three-stage approach:
- Low-level Encoder: Converts images into compact, continuous latent codes
- Semantic Decoder: Transforms latent codes into high-dimensional semantic features
- Pixel Decoder: Ensures high-fidelity image reconstruction
Performance Benchmarks
In comprehensive evaluations, Ming UniVision demonstrated competitive performance:
- Multimodal Understanding: Achieved 78.5 on MMBench
- Visual Generation: Scored 0.85 overall on GenEval benchmark
- Image Reconstruction: Reached 31.09 PSNR with low rFID of 0.38
Key Capabilities
The model supports advanced multimodal interactions including:
- Iterative image enhancement
- Seamless understanding and generation
- Multi-round in-context editing
- Direct feature-to-feature transformations
Learn More about Ming UniVision
- Explore the project details on their Project Page.
- The full technical paper is available Here.
- Source code can be found on their GitHub Repository.