New Project APEX Shrinks Heavy AI Files For Quick PC Use

    
        By vramkickedin    
     | 
    
            April 10, 2026 at 10:38 am        
    
     | 
    
        2 min read

APEX is a new model compression method that reduces file size while preserving output accuracy. By adjusting precision levels based on specific layer and data roles, the technique matches high-fidelity results using roughly half the storage space.

Created by the developers behind the LocalAI platform, this release removes the hardware barriers that block private AI deployment. Users can now run complex reasoning tasks on standard office graphics cards instead of depending on rented cloud servers.

Model Size: varies & VRAM GPU: 13-24GB required

Optimized compression for sparse expert models

Assigns different precision levels to model edges and shared components.
Reproduces full-quality text generation at significantly reduced file sizes.
Uses varied training data to improve real-world task accuracy and reduce output drift.
Increases token generation speeds across every available configuration tier.
Integrates with standard inference software without requiring code modifications.

Professionals managing private datasets can deploy these compressed archives directly on local workstations. Security-focused teams running sensitive projects gain reliable processing speeds without transmitting information to external networks.

Balancing compression ratios and output consistency

The engineering group discovered that applying identical compression across all model parts wastes memory on inactive components. Extensive testing showed that lowering precision for rarely used weights while protecting active layers maintains stable performance. The researchers evaluated over twenty-five configuration combinations before finalizing five distinct storage tiers.

KL divergence metrics provided a clearer view of model stability during these trials.

'KL divergence tells a story perplexity doesn't,'

explained the creators in a forum announcement. Open-source scripts will arrive soon for users who want to replicate the experiments on different architectures.

You can examine the technical methodology on GitHub or retrieve the optimized weights from Hugging Face.