StepFun Delivers Step-3.7-Flash MoE Vision Model for Local AI Agents

    
        By vramkickedin    
     | 
    
            May 31, 2026 at 10:30 pm        
    
     | 
    
        3 min read

Step-3.7-Flash is a 198-billion-parameter vision‑language model that uses a sparse mixture‑of‑experts design to activate only about 11 billion parameters per token. It handles images and text natively through a 1.8‑billion‑parameter vision encoder paired with a 196‑billion‑parameter language backbone, delivering up to 400 tokens per second with a 256k‑token context window. Developers can choose from three reasoning levels—low, medium, and high—to balance speed, cost, and depth for different workloads.

The model was created by StepFun to tackle agentic workflows that combine perception, search, and reasoning at scale. It was purpose‑built for high‑throughput production environments where autonomous agents must parse long documents, run multi‑step verification loops, or manage concurrent coding tasks. That focus on reliable tool orchestration and instruction‑following sets it apart from generic chat assistants.

Built for agentic accuracy and speed

Key Features

Sparse MoE architecture activates only 11B parameters.
256k context window with three reasoning levels.
Vision encoder reads UI wireframes and data charts.
Tops SimpleVQA (Search) benchmark at 79.2.
Leads ClawEval‑1.1 agent‑workflow benchmark at 67.1.
Second place on SWE‑Bench PRO code‑repair test.
Up to 400 tokens per second throughput.
Runs locally on Mac Studio with 128GB memory.

This tool suits privacy‑conscious professionals and small agencies that need vision‑language AI without sending sensitive documents to the cloud. Hobbyists with high‑memory workstations can also experiment with a model that achieves frontier‑level visual grounding and autonomous coding task performance. By running locally via vLLM, SGLang, or llama.cpp, users keep full control over data while tapping into the same capabilities offered through StepFun’s API.

What you need for local deployment

Running Step‑3.7‑Flash on your own hardware demands at least 128 GB of unified memory—machines like a Mac Studio, NVIDIA DGX Station, or AMD Ryzen AI Max+ 395‑based system are the realistic minimum. The model ships in several quantized formats (including FP8, BF16, and Q4_K_S GGUF) and works with popular serving frameworks, but expect to set aside around 120 GB of VRAM or unified memory for the GGUF version plus runtime overhead. StepFun also offers a hosted API through its platform and partners like OpenRouter and NVIDIA NIM for teams that prefer not to manage the infrastructure.

“Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding.” — Source: Hugging Face

Project Links