BitCPM4-CANN-8B Slashes Memory Use 6x While Keeping 95% Smarts

    
        By vramkickedin    
     | 
    
            May 28, 2026 at 11:52 am        
    
     | 
    
        2 min read

BitCPM4-CANN-8B is a new 8-billion-parameter language model that compresses its weights to just three possible values, cutting memory use by roughly six times compared to full-precision versions. The model was trained entirely on Huawei Ascend NPUs with a custom system that applies quantization-aware training to the full process. Despite the extreme compression, it still retains 95.7% of the performance of its full-precision MiniCPM4 counterpart across 11 benchmarks.

The release comes from the BitCPM team at Openbmb, who built the first publicly reported 1.58-bit training pipeline on domestic NPU hardware at this scale. They produced a family of four ternary models—ranging from 0.5B to 8B parameters—and made them openly available on Hugging Face. This effort provides reusable low-bit training infrastructure for the Ascend ecosystem and proves that aggressive quantization can work without exotic inference libraries.

Pseudo-quantized models for drop-in use

Key Features

Compresses weights to ternary values {-1, 0, 1}.
About six times less inference memory required.
Trained natively on Huawei Ascend 910B NPUs.
Only 5% training throughput overhead versus full-precision.
Pseudo-quantized format works with standard Transformers code.
Family includes 0.5B, 1B, 3B, and 8B sizes.

These models are meant for developers and hobbyists who run large models on consumer GPUs or edge devices with limited memory. The memory savings let them handle longer contexts or run more serving replicas on the same hardware. Privacy-conscious professionals also gain a route to fully local inference that avoids cloud dependencies.

Training details and current limitations

The released models use a pseudo-quantized format, meaning the weights are stored in standard floating-point but already packed with ternary values from training. This design lets anyone load them immediately with the Hugging Face Transformers library—no special quantization libraries or custom kernels are needed. The team notes that the smallest 0.5B variant retains only 90.1% of full-precision performance, showing that extreme quantization is more damaging when a model’s capacity is already limited.

"The models in this repository are in pseudo-quantized (fake quantization) format. This means the weights are stored in standard floating-point format with ternary values already applied during training. " — Source: Hugging Face

Project Links