Nvidia PiD Fuses Upscaling And Decoding For Instant 4K Images
Nvidia has released PiD, a pixel diffusion decoder that speeds up high‑resolution image generation from latent models. It reformulates the standard decoder as a conditional diffusion model, denoising directly in pixel space to unify upscaling and decoding into one step. The result is a plug‑in module that can turn a small latent image into a crisp 4K picture in a single pass, saving time and memory.
Nvidia’s research team created PiD to solve the slow, resource‑hungry decoding step in latent diffusion and autoregressive image models. Traditional decoders only try to invert the encoder instead of adding new details, but PiD uses a generative approach to synthesize extra sharpness. This release provides pretrained checkpoints for popular backbones like Flux, Stable Diffusion 3, and DINOv2, making it easy to swap in a faster, higher‑quality decoder.
A single‑pass decoder for crisp 4K images
- 4x and 8x upscaling with 4‑step distillation.
- Works with Flux, SD3, DINOv2, and SigLIP backbones.
- Decodes 512px latents to 2048px in one pass.
- Requires only 13 GB VRAM on a consumer GPU.
- Handles multiple aspect ratios natively.
- 2K and 4K variants provided for each backbone.
AI artists, local AI enthusiasts, and small studios that need high‑resolution outputs without renting cloud servers will benefit the most. PiD plugs into existing Flux or SD3 workflows and generates 2K or 4K images on a single consumer GPU like an RTX 5090 in seconds. Privacy‑conscious professionals who keep all generation on their own hardware can also take advantage of its low‑memory, single‑pass design.
What’s next and known trade‑offs
The project is strictly for non‑commercial research at this stage, since the license prohibits any commercial use. PiD checkpoints are distributed only as distilled 4‑step EMA weights, so you cannot retrain them with the provided files alone. On the plus side, the repository bundles all required VAE encoder weights for Flux, SD3, DINOv2, and SigLIP, meaning a single download gives you everything needed to run the decoder end‑to‑end.
“PiD decodes latents of 512 × 512 images into 2048 × 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6× faster than cascaded diffusion‑based super‑resolution pipelines with better visual fidelity.” — Source: arXiv