ComfyUI-PiD Bypasses VAE for One-Step Pixel Diffusion Upscaling

ComfyUI-PiD is a new custom node that brings NVIDIA’s Pixel Diffusion Decoder (PiD) directly into ComfyUI workflows. Instead of using a traditional VAE to decode latent images, the tool performs a conditional pixel diffusion process that combines decoding with upscaling in a single step. It supports multiple backbones, including Z-Image, Flux, and SD3, and can turn a small generated base into a sharp 2K or 4K output.
Developer Merserk created this community wrapper to make PiD accessible to local AI users who want higher-quality image enlargement inside ComfyUI. The node is designed for workflows where standard VAE upscaling falls short, offering an alternative that blends resolution increase and detail refinement. Because everything runs on a user’s own hardware, no cloud upload is required.
Pixel diffusion decoding with staged VRAM control
- Direct decode from latent to image.
- Staged workflow cuts VRAM demand significantly.
- Subprocess sampling frees CUDA memory after use.
- Auto-downloads checkpoints and assets on first run.
- Supports Z-Image, Flux, SD3, and more backbones.
- KSampler Capture grabs intermediate latents and sigma.
- Sequential block offload for extra memory savings.
This node is ideal for AI artists and creators who work with ComfyUI locally. It lets them produce high-resolution images without leaving their familiar interface or juggling separate upscaling tools. Privacy-focused users will appreciate that all processing stays on their own machine and never touches external servers.
Developer notes and known limits
Merserk notes that this is an unofficial community wrapper, so it is not an official NVIDIA or ComfyUI project. PiD sampling runs in a separate Python process to release CUDA memory, but very large outputs still demand significant VRAM. Users should review NVIDIA’s model card for license and usage terms before any commercial work.
"PiD is NVIDIA’s Pixel Diffusion Decoder approach: instead of a normal VAE decode, it treats latent-to-image decoding as conditional pixel diffusion, combining decode + upscale into one step." — Source: Reddit