Torch-Nvenc-Compress Turns Idle Video Chips Into AI Data Superchargers

An video encoder chip made of clear glass and delicate orange circuitry where the cable connects to two GPU blocks.

Torch-Nvenc-Compress is a new open-source library that uses a GPU’s idle video encoding chip to compress machine learning data. Instead of serving video streams, the normally dormant NVENC hardware compresses tensors like large language model (LLM) memory caches and image-generation activations. This shrinks the data before it travels across slow network cables, effectively speeding up the connection between two computers.

Independent developer Shootthesound created the project to solve a specific hardware problem. Nvidia removed its fast NVLink bridge from consumer graphics cards like the 4090 and 5090, making it hard to pool their memory for large models. By pairing this compression codec with a simple Thunderbolt cable, users can recover NVLink-class bandwidth and link two or more consumer GPUs into a single training rig without buying expensive workstation cards.

Reclaiming pooled VRAM on a budget

What the tool delivers
  • 6x lossless compression on diffusion model data.
  • Near 3x lossless compression on LLM caches.
  • Sub-millisecond encoding per frame on a 5090.
  • Zero-copy operation directly from GPU memory.
  • Runs concurrently with main GPU compute tasks.
  • Links two consumer GPUs over a Thunderbolt cable.
  • Measured 5x faster on slow residential broadband.

This project directly helps local AI hobbyists and privacy-conscious professionals who run models at home. It makes it feasible to split a massive 70-billion-parameter model across two consumer GPUs without the training process grinding to a halt. A user with a desktop and a laptop can pool their graphics memory to run models that would otherwise require a single graphics card costing over $7,000.

What the developer says about current limits

The codec’s core compression and speed results are all measured and reproducible, but some end-to-end tests are still pending. The real-world wall-clock time for distributed training over a Thunderbolt cable is currently based on solid math from measured bandwidth and compression ratios, but lacks physical hardware validation. The developer is also looking for community help to run benchmarks on dual-GPU setups and integrate the library into real long-context LLM workflows.

"This project has been months of independent research and engineering — designing the PCA + codec pipeline, validating it across 1,735 FLUX captures, writing the direct Video Codec SDK bindings from scratch" — Source: GitHub