NVIDIA Unlocks Speed with New gpt-oss-puzzle-88B Model

    
        By vramkickedin    
     | 
    
            March 31, 2026 at 4:09 pm        
    
     | 
    
        2 min read

NVIDIA has released gpt-oss-puzzle-88B, a large language model built for efficient deployment on H100-class hardware. The model uses a Mixture-of-Experts architecture with 88 billion parameters and is designed to handle reasoning tasks faster than its parent model while maintaining accuracy.

This release targets production environments where speed and memory efficiency matter. NVIDIA created it using Puzzle, a framework that automatically optimizes neural network architecture after training. The result is a model that takes up less memory and runs faster than the original, making it practical for companies running AI at scale.

Model Size: 88B parameters & VRAM GPU: requirements vary

Performance and efficiency features

Delivers up to 2.82× faster throughput on a single H100 GPU compared to the parent model.
Supports context lengths up to 128K tokens for processing long documents.
Offers three reasoning effort modes: low, medium, and high for balancing speed versus depth.
Reduces KV-cache memory footprint by approximately 40% through selective window attention.
Ready for commercial use under NVIDIA's licensing terms.

Teams deploying chatbots or reasoning systems can adjust the effort level based on their needs. The low setting provides quick responses for simple questions, while the high setting enables multi-step reasoning for complex problems. This flexibility helps organizations control costs when running the model in production.

How NVIDIA built gpt-oss-puzzle-88B

The development team used a technique called neural architecture search to reshape the model after its initial training. This process removed unnecessary experts from certain layers while keeping important ones intact. According to the project documentation, the model

'achieves 1.63× throughput improvement in long-context scenarios on an 8×H100 node'

compared to the original version.

Knowledge distillation helped the smaller model learn from its larger parent, followed by reinforcement learning to sharpen reasoning skills. The team also applied MXFP4 quantization to expert weights and FP8 scaling to the key-value cache, which effectively doubles the token capacity that fits in memory.

Get gpt-oss-puzzle-88B on Hugging Face.