Llmfan46 has released Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved, a modified version of Qwen3.6-35B-A3B that cuts unwanted refusals by 88% while keeping all 19 multi-token prediction (MTP) layers fully intact. The model uses an abliteration […]
News
Beellama.cpp is a fork of the popular llama.cpp project that squeezes extra speed and memory efficiency out of local GGUF model inference. It adds DFlash speculative decoding, TurboQuant KV‑cache compression, […]
ds4.pinokio is a new launcher and browser interface that brings the massive DeepSeek V4 Flash AI model to Apple Silicon Macs. It builds on the ds4.c Metal-only inference engine created […]
The NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 release packs three distinct reasoning model sizes — 30 billion, 23 billion, and 12 billion parameters — into a single checkpoint file. Rather than requiring separate training runs, […]
Coming across tokens-per-second benchmarks is easy, but truly understanding what "47 tok/s" feels like while you work is much harder. A new open-source tool called Tokenspeed solves this problem by […]
ExLlamaV3 is an inference library that lets you run large language models on consumer graphics cards. It introduces the EXL3 quantization format, which compresses models to very low bitrates while […]
Unsloth has released Qwen3.6-27B-GGUF-MTP, a quantized model file that preserves the multi-token prediction (MTP) layers from Qwen’s latest 27-billion-parameter language model. This GGUF format makes it possible to run the […]
The new release, needle, is a tiny 26-million parameter open-source AI model purpose-built for function calling, or tool use. It interprets a user's plain text query and outputs a structured […]
Lucebox-hub is a collection of hand-tuned LLM inference servers that push consumer GPUs to their limits. The latest release adds DFlash speculative decoding and PFlash speculative prefill for AMD Ryzen […]