Yovecent Activates UDM-GRPO To Smooth Image Creation

Cluster of small square tiles slowly merging together to form a giant smooth luminous pearl on the right side of the frame.

Yovecent has released UDM-GRPO, an open-source framework that combines uniform discrete diffusion with reinforcement learning for text-to-image generation. The system stabilizes training and improves output quality by treating the fully rendered image as the primary optimization target.

Developed alongside researchers from BAAI, this project addresses the instability that typically occurs when applying standard reinforcement learning to discrete diffusion networks. Teams building local generation tools can now integrate policy updates without sacrificing computational stability.

Model Size: 4.51GB & VRAM GPU: requirements vary

Stabilizing reinforcement learning for discrete image models

  • Treats the fully rendered image as the primary action for reliable optimization.
  • Reconstructs generation paths through the standard diffusion process to match training data.
  • Implements a reduced-step training approach to cut down computation time.
  • Removes guidance requirements during policy updates to streamline workflows.

Creators generating marketing visuals or experimenting with local creative pipelines can use these adjustments to reduce training overhead. Teams running models on standard desktop hardware will see measurable accuracy gains across standard tests without rewriting core architectures.

Addressing training shifts in discrete networks

The team observed that directly applying standard reinforcement learning to discrete diffusion models causes unpredictable training behavior and minimal quality improvements. Aligning probability paths with the original distribution and simplifying guidance steps helps maintain steady progress across evaluation suites.

"Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution,"

noted the developers. You can explore the scripts on GitHub, download weights from Hugging Face, or read the full research in the technical paper.