Antirez Shrinks DeepSeek V4 Locally With Deepseek-V4-GGUF

A new quantized file for DeepSeek V4 Flash, called Deepseek-V4-GGUF, shrinks the massive AI model so it can run on high-end consumer hardware. It’s a set of GGUF format files specifically tuned for the DS4 inference engine, created by well-known developer antirez. By using mixed-precision quantization, the files let users run the powerful reasoning model on a single machine instead of a data center cluster.
Antirez designed two main versions to match different hardware budgets. He applied lighter compression to critical decision-making components while aggressively shrinking the numerous routed experts that handle fewer tokens. This approach preserves much of the model’s original behavior while slashing memory requirements.
A quantization recipe built for local hardware
- Q2 version fits on 128 GB Mac machines.
- Q4 version needs around 256 GB of RAM.
- Router and attention projections kept at Q8_0 precision.
- Routed experts compressed to IQ2_XXS or Q4_K.
- Embeddings and learned router weights stored at F16.
- Optional MTP file for faster speculative decoding.
- Norms, sinks, and bias tensors left at F32.
- Hash-routing tables kept as I32 integers.
These files are for privacy-conscious professionals and serious hobbyists with Apple Silicon Macs or workstations packing lots of unified memory. Anyone who wants to avoid sending data to the cloud can now query a top-tier model entirely offline. The smaller file opens the door for local code analysis and complex reasoning tasks on a single powerful computer.
What the developer says about the project
The two GGUF files are byte-for-byte identical in every component except the routed experts, which change from a 2-bit to a 4-bit quantization method. This asymmetric strategy relies on the idea that each expert processes only a fraction of tokens, so heavier compression there causes less quality loss. The model’s license is MIT, and while the base model copyright belongs to DeepSeek, the GGUF versions are redistributed under the original release terms.
"The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts." — Source: Hugging Face