Ming UniVision: AI Images Transformed

Screenshot of Ming-UniVision example by inclusionAI

# Ming UniVision: Unified Image AI with Continuous Visual Tokenization

inclusionAI has unveiled Ming UniVision, a groundbreaking multimodal large language model (MLLM) that unifies image understanding, generation, and editing within a single continuous autoregressive framework.

Project Foundation

On October 7, 2025, inclusionAI released Ming UniVision, introducing MingTok — the first continuous visual tokenizer designed to seamlessly support both image understanding and generation. The project aims to eliminate traditional quantization challenges by creating a unified, continuous latent space for visual representation.

Core Technical Innovation

Ming UniVision introduces three key technical breakthroughs:

  • First continuous unified tokenizer for vision tasks
  • Next-token prediction framework across understanding and generation
  • 3.5× faster training convergence compared to existing models

Technical Architecture

The model's architecture features a three-stage approach:

  • Low-level Encoder: Converts images into compact, continuous latent codes
  • Semantic Decoder: Transforms latent codes into high-dimensional semantic features
  • Pixel Decoder: Ensures high-fidelity image reconstruction

Performance Benchmarks

In comprehensive evaluations, Ming UniVision demonstrated competitive performance:

  • Multimodal Understanding: Achieved 78.5 on MMBench
  • Visual Generation: Scored 0.85 overall on GenEval benchmark
  • Image Reconstruction: Reached 31.09 PSNR with low rFID of 0.38

Key Capabilities

The model supports advanced multimodal interactions including:

  • Iterative image enhancement
  • Seamless understanding and generation
  • Multi-round in-context editing
  • Direct feature-to-feature transformations

Learn More about Ming UniVision