Lucebox-Hub Supercharges AMD Strix Halo With DFlash And PFlash

    
        By vramkickedin    
     | 
    
            May 19, 2026 at 7:39 pm        
    
     | 
    
        2 min read

Lucebox-hub is a collection of hand-tuned LLM inference servers that push consumer GPUs to their limits. The latest release adds DFlash speculative decoding and PFlash speculative prefill for AMD Ryzen AI MAX+ 395 iGPUs with 128 GB unified memory. A 27-billion-parameter model now runs over two times faster than standard open-source engines on the same AMD hardware.

Luce-Org creates these per-chip optimizations to make local AI private, fast, and free from vendor lock-in. Their DFlash and PFlash techniques, originally built for NVIDIA’s RTX 3090, now work on Strix Halo, delivering the same speedup. Anyone with this AMD mini PC can now handle long-context tasks that previously felt out of reach.

Strix halo systems gain dflash and pflash

Key Features

2.23x decode speedup over llama.cpp HIP.
3.05x faster prefill at 16K context.
End-to-end workload 2.5x faster wall clock.
Fits models up to 100 GiB in 128 GB RAM.
Uses ROCm 7.2.2 with simple CMake build.
Future BSA kernel to slash prefill further.

This release targets developers and researchers using AMD Ryzen AI MAX+ systems who want fast, private LLM inference. It lets you run a 27B model interactively and process long documents without cloud costs. The generous 128 GiB memory also opens the door to 100B+ MoE models on a single device.

Developer notes and upcoming plans

The HIP port currently runs prefill without the BSA scoring kernel, leaving about a 3.4x performance gap until a rocWMMA-native sparse attention kernel arrives. Planned updates include multi-row decode GEMV and tile shape tuning for Strix Halo, plus large MoE model support like Qwen3.5-122B-A10B. Multi-GPU and Vulkan paths are not yet included on this HIP stack.