Alibaba-PAI Launches Z Image Turbo Fun ControlNet Union 2.1

Display of canny function for Z Image Turbo Fun Controlnet Union 2.1

# Z Image Turbo Fun ControlNet Union 2.1: Advanced Image Generation Update

Alibaba-PAI has released a significant update to their Z Image Turbo Fun ControlNet Union model, introducing version 2.1 with multiple performance and technical improvements. The latest release addresses previous model limitations and introduces enhanced image generation capabilities across various control conditions.

Key Model Enhancements

The updated model introduces several critical improvements:

  • Added a new lite model with Control Latents applied on 5 layers (only 1.9GB)
  • Resolved mask randomness and overfitting issues in previous control models
  • Restructured dataset with multi-resolution control images (512~1536)
  • Improved training schedules for better image generation consistency
  • Supports multiple control conditions including Canny, HED, Depth, Pose, and MLSD
  • Inpainting mode now fully supported

Technical Performance Optimization

During development, the team discovered and addressed significant performance challenges. 'During testing, we found that applying ControlNet to Z-Image-Turbo caused the model to lose its acceleration capability and become blurry,' the developers noted. To counter this, they performed 8-step distillation on the version 2.1 model, which demonstrated substantially improved performance.

Model Configuration Details

The new version includes multiple model variants:

  • Z-Image-Turbo-Fun-ControlNet-Union-2.1-2601-8steps: Enhanced mask diversity and training schedule
  • Z-Image-Turbo-Fun-ControlNet-Tile-2.1-2601-8steps: Improved resolution and training approach
  • Z-Image-Turbo-Fun-ControlNet-Union-2.1-lite-2601-8steps: Lighter model suitable for lower-spec machines

Training and Implementation Insights

The 2.0 model was trained on a comprehensive dataset of 1 million high-quality images, covering both general and human-centric content. The training was performed at 70,000 steps and at 1328 resolution using BFloat16 precision with 2.1 gaining an additional 11,000 steps.

Recommended Usage

Developers recommend using a control_context_scale between 0.65 and 0.90 for optimal results. 'For better stability, we highly recommend using a detailed prompt,' the documentation advises. The model supports 8-step inference and offers improved generation quality compared to previous versions.

Learn More

Grab it on their hugging face.