NVIDIA Unlocks Speed with New gpt-oss-puzzle-88B Model

NVIDIA has released gpt-oss-puzzle-88B, a large language model built for efficient deployment on H100-class hardware. The model uses a Mixture-of-Experts architecture with 88 billion parameters and is designed to handle reasoning tasks faster than its parent model while maintaining accuracy.
This release targets production environments where speed and memory efficiency matter. NVIDIA created it using Puzzle, a framework that automatically optimizes neural network architecture after training. The result is a model that takes up less memory and runs faster than the original, making it practical for companies running AI at scale.
Model Size: 88B parameters & VRAM GPU: requirements vary
Performance and efficiency features
- Delivers up to 2.82× faster throughput on a single H100 GPU compared to the parent model.
- Supports context lengths up to 128K tokens for processing long documents.
- Offers three reasoning effort modes: low, medium, and high for balancing speed versus depth.
- Reduces KV-cache memory footprint by approximately 40% through selective window attention.
- Ready for commercial use under NVIDIA's licensing terms.
Teams deploying chatbots or reasoning systems can adjust the effort level based on their needs. The low setting provides quick responses for simple questions, while the high setting enables multi-step reasoning for complex problems. This flexibility helps organizations control costs when running the model in production.
How NVIDIA built gpt-oss-puzzle-88B
The development team used a technique called neural architecture search to reshape the model after its initial training. This process removed unnecessary experts from certain layers while keeping important ones intact. According to the project documentation, the model
'achieves 1.63× throughput improvement in long-context scenarios on an 8×H100 node'
compared to the original version.
Knowledge distillation helped the smaller model learn from its larger parent, followed by reinforcement learning to sharpen reasoning skills. The team also applied MXFP4 quantization to expert weights and FP8 scaling to the key-value cache, which effectively doubles the token capacity that fits in memory.
Get gpt-oss-puzzle-88B on Hugging Face.