NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 Unfolds Three Models

A single translucent teardrop-shaped gem containing two smaller nested teardrops within.

The NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 release packs three distinct reasoning model sizes — 30 billion, 23 billion, and 12 billion parameters — into a single checkpoint file. Rather than requiring separate training runs, the smaller 23B and 12B models are built directly inside the 30B parent and can be extracted zero-shot with a provided script. This design cuts storage use, simplifies deployment, and allows the same set of weights to serve different speed and accuracy needs.

NVIDIA, who also made the Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 and Lyra 2.0 models created this elastic model by applying a novel post-training method to the Nemotron 3 Nano 30B. The entire family was produced with just 160 billion additional tokens — roughly 0.6% of the original pretraining budget. A learnable router maps any parameter budget to the optimal nested configuration, so all three variants share the same underlying weights.

One checkpoint, three model sizes

Key features
  • One BF16 file holds 30B, 23B, and 12B models.
  • Zero-shot slicing script extracts smaller variants instantly.
  • All three sizes require only 58.9 GB total storage.
  • Elastic budget control can boost throughput up to 1.9×.
  • 12B variant runs on an RTX 5080 or Pro 6000.
  • BF16 accuracy matches or exceeds the original 30B.
  • FP8 and NVFP4 quantized recovery stays above 97%.
  • Commercial use allowed; six languages supported.

If you’ve ever wanted to work with a 30B-class model on a single consumer GPU, the 12B version fits comfortably on an RTX 5080. Developers who need to test different model sizes for various tasks can extract them from one download instead of managing three separate checkpoints. Small teams also get a practical way to trade off speed and quality — using a smaller model for heavy reasoning work and the 30B for final answers.

Current limits and what’s next

The dynamic elastic budget control feature, which swaps model sizes during a single generation, currently requires a custom inference path and isn’t yet available in standard vLLM. The extracted 23B and 12B models work normally with vLLM after being sliced out of the parent checkpoint. Native vLLM integration for seamless mid-generation switching is under active development by NVIDIA.

“Think of this as like scalable video coding, you have a UHD stream, but strip some layers and you have a HD, or SD stream, it's all a single file stream, not multiple ones.” — Source: Reddit