KeyLM-75M Proves Less Is More with Just 18B Tokens

Eclipse-Senpai has released KeyLM-75M, a compact 75 million-parameter language model trained from scratch on roughly 18 billion tokens. This base text-completion model outputs plain text completions and is accompanied by an instruction-tuned version for chat and simple task following. An Apache 2.0 license allows both hobbyists and small businesses to use, modify, and redistribute the weights freely.
The model was created by a solo community developer going by Eclipse-Senpai. Training consumed only a tiny fraction of the data typical for models this size which is comparable to SmolLM-135M that uses over 600 billion tokens. The goal is to demonstrate that useful, lightweight language models can be built with far less compute while still being practical on consumer GPUs.
Efficient inference on everyday hardware
- 75 million parameters, 24 transformer layers.
- Trained from scratch on 18B tokens.
- Instruction-tuned chat version available separately.
- GGUF quantized format for CPU deployment.
- 2048-token context window with RoPE embeddings.
- Grouped-query attention for faster generation.
- Apache 2.0 open source license.
- English-only, out-of-the-box text completion.
This model suits privacy-conscious professionals who want a fully offline, inspectable language tool that runs comfortably on a single consumer GPU or even a laptop. Hobbyists and students can study its from-scratch training recipe and use the tiny footprint to experiment with fine-tuning or local assistants without cloud costs. Small agencies may find it useful for lightweight automation like drafting short copy, summarizing, or generating simple structured text.
Honest performance and known limits
The developer is upfront that KeyLM-75M possesses minimal factual knowledge and scores near random on knowledge-intensive benchmarks. Instruction tuning only adds instruction-following ability; reasoning and factual accuracy remain largely unchanged. Safety alignment is absent, so users need to add their own output filtering before any public-facing use.
“On IFEval (instruction following) the 75M instruct model scores slightly higher than the original SmolLM-135M-Instruct at about half the parameters and a fraction of the training data.” — Source: Reddit