Tiny AI Needle Stitches Seamless Tool Calling For Budget Phones

    
        By vramkickedin    
     | 
    
            May 19, 2026 at 7:57 pm        
    
     | 
    
        2 min read

The new release, needle, is a tiny 26-million parameter open-source AI model purpose-built for function calling, or tool use. It interprets a user's plain text query and outputs a structured JSON command to trigger a specific tool, like checking the weather or controlling smart home devices. By stripping the architecture down to pure attention mechanisms and removing all feed-forward networks, the project proves that massive models are overkill for this specific retrieval-and-assembly task.

The team at cactus-compute created needle by distilling knowledge from Google’s Gemini 3.1 model. They conducted investigations that challenged the common belief that on-device agents require large models for tool calling. Their solution is an experimental, hyper-efficient system designed to run smoothly on budget phones, smartwatches, and glasses.

Built for pure attention, no MLPs

Key Features

26 million parameters using encoder-decoder architecture.
Achieves 6000 tokens/sec prefill on consumer devices.
Decode speed hits 1200 tokens per second.
Removes all FFN layers for maximum efficiency.
Finetune locally on Mac or PC via UI.
Generates training data automatically using Gemini.
Supports INT4 precision during training.
Pre-trained on 200B tokens in 27 hours.

This tool is for developers and hobbyists who want to build agentic experiences on resource-limited hardware. Privacy-focused professionals will benefit from keeping tool-calling data local, eliminating the need to send queries to a cloud service. Small teams can easily adapt needle to their own custom APIs by running a finetuning process directly on a personal laptop.

What the developers say

The developers note that while needle beats much larger models like Qwen-0.6B on single-shot function calling, its narrow design means it lacks the conversational depth of general-purpose alternatives. Small models can be temperamental, so they strongly recommend testing with personal tools and finetuning with at least 120 examples per tool to avoid overfitting. The team believes the "no FFN" (or feed-forward network design) generalizes to any task where a model has access to external structured knowledge, with more experimental results planned for future publication.

"We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it." — Source: Reddit

Project Links