Qwen3.5-4B-Base-ZitGen-V1 Transforms Images Into Text Prompts

    
        By vramkickedin    
     | 
    
            April 23, 2026 at 2:43 pm        
    
     | 
    
        2 min read

Qwen3.5-4B-Base-ZitGen-V1 is a lightweight, fine-tuned model built to convert images into detailed text instructions for AI generators. It focuses on producing highly specific commands optimized for Z-Image Turbo workflows.

Independent developer lolzinventor created this release to address the manual effort typically required for prompt engineering. Artists and workflow designers can use it to quickly generate generation-ready text from reference artwork.

Model Size: from 4.8GB & VRAM GPU: requirements vary

Core capabilities for automated prompt generation

Converts visual references into detailed, generation-ready text instructions.
Utilizes a custom dataset built through multi-step AI image comparison.
Runs efficiently through standard llama-server inference tools.
Supports direct integration into ComfyUI automation pipelines.

Creative professionals working with image generation pipelines can plug this model directly into local servers to automate captioning tasks. By removing the need for manual prompt drafting, users can maintain consistent output quality while scaling their daily production volume.

Behind the iterative training process

The dataset powering this model relies on a fully automated loop instead of manual human labeling. An initial prompt generates an image, which the system then compares to the original target to identify visual gaps. A language model writes a revised text instruction to close those gaps, and the cycle repeats four to six times until the output closely matches the goal.

Developers filtered the resulting prompts to remove formatting errors before final training.

"What makes this fine-tune unique is that the dataset (images + prompts) were generated by LLMs tasked with using the ComfyUI API to regenerate a target image,"

noted the developer in a post. This self-correcting loop ensures the final model learns prompt structures that actually work with specific image engines.

You can download the fully trained weights and test the image-to-prompt workflow directly on Hugging Face.