Liquid AI Drops LFM2.5-8B-A1B: A Lean AI That Runs Locally

LFM2.5-8B-A1B is an open model that puts 8.3 billion total parameters into a package where only 1.5 billion activate per token, letting it run assistant workloads efficiently on local hardware. It handles multi-step tool calls, structured outputs, and 128,000-token contexts across nine languages. The model is a drop-in upgrade that moves real-world agent behavior from proof-of-concept to daily-driver status on consumer GPUs and even CPUs.
Liquid AI built this release by scaling up the pre-training to 38 trillion tokens and adding a large reinforcement learning stage on top of their LFM2 architecture. That investment slashed the hallucination score from a near-failing −78 to a usable −25 on the AA-Omniscience Index while boosting instruction following and math reasoning. The gains make it competitive with models that activate twice as many parameters, without giving up the memory savings that small labs and privacy-focused workflows need.
Fast, low-memory inference that fits prosumer hardware
- 8.3B total parameters, only 1.5B active per token.
- 128,000 token context window.
- Outperforms larger dense and MoE models on agent tasks.
- Up to 18.5K output tokens per second on one GPU.
- Day-one support for llama.cpp, MLX, vLLM, and SGLang.
- 91.8% on IFEval instruction following benchmark.
- 64% on BFCLv3 function‑calling benchmark.
This model fits anyone running a local personal assistant on a single GPU, a MacBook, or a privacy‑sensitive edge device away from the cloud. Hobbyists and small agencies get a reasoning‑tuned companion that chains multiple API calls without ballooning VRAM, while the fast CPU path helps shops that can’t dedicate a GPU to every task. Professionals who need tool‑heavy workflows in regulated environments gain a model that stays on their own metal and still follows complex multi‑step instructions.
What’s under the hood and where it still trips
The hybrid backbone mixes gated short convolutions with a small number of grouped query attention blocks, a design that delivers twice the CPU decode speed of similarly sized transformers. Fine‑tuning is strongly advised for domain‑specific work, and the model now supports function calling with Pythonic tool‑call tokens for cleaner integration. Heavy coding tasks and pure knowledge‑base QA without retrieval remain weak spots where a larger dense model or paired retriever would be a better fit.
“On-device personal assistant: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices.” — Source: Hugging Face