Lucebox-Hub Supercharges AMD Strix Halo With DFlash And PFlash

A translucent Ryzen AI MAX+ processor chip floats prominently on the right side of the view.

Lucebox-hub is a collection of hand-tuned LLM inference servers that push consumer GPUs to their limits. The latest release adds DFlash speculative decoding and PFlash speculative prefill for AMD Ryzen AI MAX+ 395 iGPUs with 128 GB unified memory. A 27-billion-parameter model now runs over two times faster than standard open-source engines on the same AMD hardware.

Luce-Org creates these per-chip optimizations to make local AI private, fast, and free from vendor lock-in. Their DFlash and PFlash techniques, originally built for NVIDIA’s RTX 3090, now work on Strix Halo, delivering the same speedup. Anyone with this AMD mini PC can now handle long-context tasks that previously felt out of reach.

Strix halo systems gain dflash and pflash

Key Features
  • 2.23x decode speedup over llama.cpp HIP.
  • 3.05x faster prefill at 16K context.
  • End-to-end workload 2.5x faster wall clock.
  • Fits models up to 100 GiB in 128 GB RAM.
  • Uses ROCm 7.2.2 with simple CMake build.
  • Future BSA kernel to slash prefill further.

This release targets developers and researchers using AMD Ryzen AI MAX+ systems who want fast, private LLM inference. It lets you run a 27B model interactively and process long documents without cloud costs. The generous 128 GiB memory also opens the door to 100B+ MoE models on a single device.

Developer notes and upcoming plans

The HIP port currently runs prefill without the BSA scoring kernel, leaving about a 3.4x performance gap until a rocWMMA-native sparse attention kernel arrives. Planned updates include multi-row decode GEMV and tile shape tuning for Strix Halo, plus large MoE model support like Qwen3.5-122B-A10B. Multi-GPU and Vulkan paths are not yet included on this HIP stack.

"That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon." — Source: Reddit