Tongyi-MAI Z Image Is Finally Here

A showcase of Z Image editing capabilities

Z Image is a 6-billion parameter text-to-image model from Tongyi-MAI that generates high-quality images on consumer hardware. The long awaited model requires less than 16GB VRAM, making it accessible without enterprise-grade equipment.

Released on November 27, 2025, Z Image challenges the industry trend of ever-larger models by prioritizing efficient training. The full training run cost approximately $630,000—a fraction of what comparable models typically require.

Model Size: 6B parameters & VRAM GPU: under 16GB required

What Z Image Offers

  • Photorealistic image generation from text prompts
  • Bilingual support for Chinese and English text rendering
  • Z-Image-Turbo variant for sub-second image generation
  • Z-Image-Edit for instruction-based image manipulation
  • Scalable Single-Stream Diffusion Transformer architecture

Content creators, small agencies, and hobbyists with mid-range GPUs can use Z Image for marketing visuals, concept art, and image editing tasks. The bilingual capability also benefits teams working across English and Chinese-speaking markets.

Why the Team Took a Different Approach

The Z Image team states that leading open-source models now range from 20B to 80B parameters, making them

'impractical for inference and fine-tuning on consumer-grade hardware.'

Their goal was to prove that principled design can match brute-force scaling.

Z-Image-Turbo currently ranks 8th on the Artificial Analysis Text-to-Image Leaderboard—the highest position among open-source models. The team attributes this efficiency to training on real-world data rather than synthetic distillation.

Z Image provides a practical option for users who want competitive image quality without investing in expensive hardware.

Read the research paper on arXiv, browse the Hugging Face page, or visit the GitHub repository. The model is also available through Comfy.