Breathe Life into Masks with Z-Image-SAM-ControlNet

    
        By vramkickedin    
     | 
    
            April 7, 2026 at 1:58 am        
    
     | 
    
        2 min read

Z-Image-SAM-ControlNet is a new control model designed to transform segmented images into photorealistic pictures. It functions as a ControlNet for the Tongyi-MAI/Z-Image base model, allowing users to guide image generation using specific structural maps.

Created by developer neuralvfx, this tool bridges the gap between simple segmentation masks and high-quality visual output. It processes input from Segment Anything Model (SAM) style images to produce detailed renders while maintaining the original composition.

Model Size: 19GB & VRAM GPU: requirements vary

Functional features and control capabilities

Converts segmented input images into photorealistic outputs.
Trained natively at a resolution of 1024x1024 pixels.
Optimizes performance when inference is scaled to 1.5k resolution or higher.
Learned from a dataset of 200,000 segmented images sourced from laion2b-squareish.
Includes a specific model patch for integration with ComfyUI workflows.
Offers Python code support for the Hugging Face Diffusers library.

Digital artists and small creative teams can use this model to streamline their rendering process. Instead of painting details manually, they can use simple segmentation maps to define object placement, letting the AI handle textures and lighting. This approach saves significant time for anyone needing to visualize concepts quickly or iterate on composition layouts.

Technical notes and usage

The developer mentioned that the training set consisted of 200,000 images, which is smaller than typical datasets used for similar models. However, the model maintains strong adherence to the control input.

'This is on the smaller side for ControlNet training, but the control holds up surprisingly well!'

the developer stated.

Users can install the model through ComfyUI by placing the weights in the model patches folder or by cloning the repository for Diffusers use. Neuralvfx recommends scaling control images to at least 1.5k resolution to achieve the best structural accuracy. The workflow includes options for auto-segmentation, allowing users to generate the necessary input masks directly within their interface.

Get control with Z-Image-SAM-ControlNet on Hugging Face.