News

May 23, 2026
Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved Fewer Refusals

Llmfan46 has released Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved, a modified version of Qwen3.6-35B-A3B that cuts unwanted refusals by 88% while keeping all 19 multi-token prediction (MTP) layers fully intact. The model uses an abliteration […]

Read More
May 23, 2026
Anbeeld Supercharges Local AI With Beellama.cpp Speed Overhaul

Beellama.cpp is a fork of the popular llama.cpp project that squeezes extra speed and memory efficiency out of local GGUF model inference. It adds DFlash speculative decoding, TurboQuant KV‑cache compression, […]

Read More
May 23, 2026
ds4.pinokio Slots a Full DeepSeek V4 Brain Into Apple Silicon Macs With One Click

ds4.pinokio is a new launcher and browser interface that brings the massive DeepSeek V4 Flash AI model to Apple Silicon Macs. It builds on the ds4.c Metal-only inference engine created […]

Read More
May 23, 2026
NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 Unfolds Three Models

The NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 release packs three distinct reasoning model sizes — 30 billion, 23 billion, and 12 billion parameters — into a single checkpoint file. Rather than requiring separate training runs, […]

Read More
May 23, 2026
Tokenspeed Streams Fake Tokens To Let You Feel LLM Speed

Coming across tokens-per-second benchmarks is easy, but truly understanding what "47 tok/s" feels like while you work is much harder. A new open-source tool called Tokenspeed solves this problem by […]

Read More
May 19, 2026
ExLlamaV3 Supercharges Home AI with Triple-Speed DFlash Decoding

ExLlamaV3 is an inference library that lets you run large language models on consumer graphics cards. It introduces the EXL3 quantization format, which compresses models to very low bitrates while […]

Read More
May 19, 2026
Unsloth Drops Qwen3.6-27B-GGUF-MTP For 2x Faster Local AI

Unsloth has released Qwen3.6-27B-GGUF-MTP, a quantized model file that preserves the multi-token prediction (MTP) layers from Qwen’s latest 27-billion-parameter language model. This GGUF format makes it possible to run the […]

Read More
May 19, 2026
Tiny AI Needle Stitches Seamless Tool Calling For Budget Phones

The new release, needle, is a tiny 26-million parameter open-source AI model purpose-built for function calling, or tool use. It interprets a user's plain text query and outputs a structured […]

Read More
May 19, 2026
Lucebox-Hub Supercharges AMD Strix Halo With DFlash And PFlash

Lucebox-hub is a collection of hand-tuned LLM inference servers that push consumer GPUs to their limits. The latest release adds DFlash speculative decoding and PFlash speculative prefill for AMD Ryzen […]

Read More
1 9 10 11 12 13 59