Compression

Research papers, repositories, and articles about compression

Showing 4 of 4 items

chopratejas/headroom

Headroom compresses tool outputs, logs, and RAG chunks before they ever hit the model, often cutting tokens by 60–95%. It acts as a library, proxy, and MCP server so you can slash running costs without sacrificing answer quality.

27,419

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.

Mingluo Su, Huan Wang

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.

Ruchika Chavhan, Malcolm Chadwick

KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.

Jian Chen, Zhuoran Wang