Compression
Research papers, repositories, and articles about compression
Showing 4 of 4 items
chopratejas/headroom
Headroom compresses tool outputs, logs, and RAG chunks before they ever hit the model, often cutting tokens by 60–95%. It acts as a library, proxy, and MCP server so you can slash running costs without sacrificing answer quality.
ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.
NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices
NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.
KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs
KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.