InclusionAI brings Any to Any with Ming-flash-omni-2.0 LLM

InclusionAI Ming logo on a digital wave

Ming-flash-omni-2.0 is a unified multimodal model from inclusionAI that processes images, text, audio, and video while generating both speech and images. Built on a Mixture-of-Experts (MoE) architecture with 6 billion active parameters out of 100, the model handles diverse tasks within a single framework.

This release addresses the need for open-source alternatives to proprietary multimodal systems. inclusionAI has made all code and model weights publicly available, allowing researchers and developers to build on the technology without relying on closed platforms.

Model Size: 209GB & VRAM GPU: requirements vary

What Ming-flash-omni-2.0 can do

  • Expert-level visual recognition of plants, animals, landmarks, and cultural artifacts.
  • Unified acoustic synthesis combining speech, audio, and music in a single pipeline.
  • Zero-shot voice cloning with controllable attributes like emotion and timbre.
  • High-dynamic image generation, editing, and object removal.
  • Streaming video conversation with free modality switching.

Developers building applications that require multiple AI capabilities can run this single model instead of maintaining separate systems for vision, speech, and generation tasks. The unified architecture removes the need for task-specific fine-tuning, streamlining development workflows for complex multimodal projects.

Technical approach to Ming Flash

The system uses dedicated encoders to extract tokens from different input types, which are processed through modality-specific routers within the MoE framework. An advanced audio decoder handles natural-sounding speech synthesis, while integrated image generation capabilities support context-aware editing.

According to the research paper:

'our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support.'

The provided code includes utilities for distributing the model across multiple GPUs, and installation requires flash_attention_2 support with NVIDIA compatibility.

Getting started