ai-sage sparks fast local AI with GigaChat 3.1 Lightning

Yellow nodes connected to resemble a lighting bolt

GigaChat 3.1 Lightning is a compact language model built for fast local inference on consumer hardware. It uses a Mixture-of-Experts architecture with 10 billion total parameters, but only activates 1.8 billion during inference to keep processing efficient.

The model was created by ai-sage and released under an MIT license for open use. It targets English and Russian speakers specifically, though the training covers 14 languages total, making it useful for multilingual assistant workloads.

Speed and efficiency features

  • 10B total parameters with only 1.8B active at inference for reduced compute needs.
  • 256k context window supports long conversations and document analysis.
  • Native FP8 training improves inference speed without quality loss.
  • Multi-Token Prediction generates multiple tokens per forward pass.
  • Multi-head Latent Attention compresses memory usage for long contexts.
  • Strong tool calling capability with 0.76 score on BFCLv3 benchmark.

Small agencies and hobbyists building chatbots or assistants can benefit from this model's balance of capability and speed. The architecture compresses the KV cache into a latent representation, which cuts memory usage during long conversations. Users running local AI for coding help, reasoning tasks, or function calling workflows will find the 256k context window handles substantial input without truncation.

Training and development choices

The ai-sage team pretrained this model from scratch using their own compute rather than fine-tuning an existing model. As the developers note,

'Both models are pretrained from scratch using our own data and compute — thus, it's not a DeepSeek finetune.'

The training corpus included roughly 5.5 trillion synthetic tokens covering mathematics, code, and question-answer pairs.

Native FP8 training during the preference optimization stage allows the model to run faster than BF16 in some cases. Throughput tests show FP8 combined with Multi-Token Prediction delivers 38% faster inference compared to the baseline. The GGUF format available on Hugging Face works with llama.cpp for deployment on consumer GPUs.

Get GigaChat3.1-10B-A1.8B-GGUF on Hugging Face. View the full GigaChat 3.1 collection for additional model variants.