ComfyUI-FeatherOps Injects AMD RDNA3 GPUs With A 50% Diffusion Speed Boost

ComfyUI-FeatherOps is a new custom node for ComfyUI that accelerates diffusion model inference on AMD RDNA3 and RDNA3.5 graphics cards like the Strix Halo. It uses a hand-written HIP kernel to perform fast mixed-precision matrix multiplication, pairing fp16 inputs with fp8 weights even when the GPU lacks native fp8 hardware. The result is faster image and video generation in popular ComfyUI workflows without changing your model files.
The project was built from scratch by developer woct0rdho, who wrote the kernel in HIP with intrinsics and inline assembly rather than relying on libraries like CK or Triton. It tackles a known performance gap where standard fp16 operations on AMD consumer GPUs often fail to fully utilize the tensor cores. By packing weights in fp8 format, loading them into fast on-chip memory, and upcasting on the fly during the compute loop, FeatherOps cuts VRAM bandwidth usage and improves instruction-level parallelism—especially helpful for the large matrix multiplications found in diffusion transformer models.
Designed for AMD RDNA3 GPU owners
- Custom HIP kernel free of Tensile and Triton abstractions.
- Loads fp8 weights and upcasts during computation.
- Prepacked weight layout speeds VRAM-to-LDS transfers.
- Works on Strix Halo and other RDNA3 GPUs.
- Reaches 43 TFLOPS with torch.compile dispatch overhead.
- Supports LoRA and torch.compile for extra throughput.
- 30–50% speedup reported on Wan and Anima models.
This node is meant for professionals and advanced hobbyists who run ComfyUI locally on AMD RDNA3 hardware like the Strix Halo or RX 7000 series. They can generate images and videos noticeably faster without switching to Nvidia GPUs or offloading to the cloud. Because all computation stays on the local machine, privacy-conscious users keep full control over their data while cutting wait times.
Developer insights and limitations
The developer notes that FeatherOps currently only speeds up non-attention linear layers, so its benefits stack with a separate FlashAttention installation. A dedicated fp16 × fp16 kernel and improved fp8 quantization are planned for future updates, and attention operation optimization is also under investigation. The kernel was tested specifically on Strix Halo but should work across RDNA3 cards, although actual speedups may vary with driver and PyTorch versions.
“I've tried PC sampling and thread tracing but I could not fully understand the bottleneck.” — Source: GitHub