Efficiency

Research papers, repositories, and articles about efficiency

Showing 23 of 23 items

ggml-org/llama.cpp

llama.cpp keeps pushing local LLM performance on CPUs and small GPUs. It’s still the reference for running big models on modest hardware. If you care about running the AI cheaply or on-device, you should track every major change here.

115,330

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

FlashPrefill discovers sparse attention patterns during the prefill phase and drops low-importance connections on the fly. It reports huge speedups on 256K-token contexts while still matching baseline accuracy.

Qihang Fan, Huaibo Huang

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash pairs a tiny diffusion model with a big LLM to draft and verify text in big chunks. It’s currently one of the highest-upvoted speedup methods on Hugging Face.

Jian Chen, Yesheng Liang

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.

Jian Chen, Yesheng Liang

Over-Searching in Search-Augmented Large Language Models

This work shows that search‑augmented models often call tools even when search hurts answers or wastes tokens. It introduces a cost‑aware metric and mitigation tricks, so teams can dial back needless retrieval instead of just adding more context.

Roy Xie, Deepak Gopinath

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.

Yunlong Chu, Minglai Shao

Make Your LVLM KV Cache More Lightweight

Targets the memory blow-up from vision tokens in large vision–language models when you run the AI. Uses a prompt-aware method, LightKV, to merge redundant vision tokens before decoding. If you ship LVLMs, this is a concrete way to cut GPU memory and costs without killing quality. ([arxiv.org](https://arxiv.org/list/cs.CV/pastweek?show=100))

Anonymous (ICLR and TMLR drafts; arXiv metadata lists named authors)

TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning

TIME teaches dialogue models to drop short "thinking" blocks only when time gaps or context shifts actually demand deeper reasoning. Models keep answers compact while still reasoning hard when conversations get tricky or span days instead of seconds.

Susmit Das

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Video2LoRA watches a video once, then predicts LoRA weights that let a frozen vision-language model handle that video efficiently. You keep quality while cutting the visual token budget and time-to-first-answer. This is very relevant if video-heavy agents are choking your GPU bill.

Manan Suri, Sarvesh Baskar

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

RRAttention keeps only a fraction of attention blocks by rotating which positions each head looks at in a round-robin pattern. It recovers almost full-attention accuracy while skipping about half the computation at 128K-token context lengths.

Siran Liu, Guoxia Wang

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Claims an exact, error-free formulation of linear attention derived from a continuous-time view of transformer dynamics. The authors argue they can match the behavior of standard softmax attention while enjoying linear-time complexity, avoiding the approximation errors that plague many fast-attention variants. If the theory and practice hold up, this could become a key building block for large-context models and resource-constrained deployments.

Jingdi Lei, Di Zhang

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.

Changcheng Li, Jiancan Wu

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

VideoAR builds a large visual autoregressive model that predicts videos frame by frame across multiple scales. It narrows the quality gap with diffusion models while needing far fewer steps, which makes long video generation cheaper to run.

Longbin Ji, Xiaoxiong Liu

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.

Lizhuo Luo, Shenggui Li

AdaTooler-V: Adaptive Tool-Use for Images and Videos

AdaTooler-V teaches vision-language models when to call external tools, not just how. That cuts unnecessary tool calls, reducing costs while often boosting accuracy on vision tasks.

Chaoyang Wang, Kaituo Feng

Efficient Training on Multiple Consumer GPUs with RoundPipe

Introduces a new pipeline schedule that avoids tight weight sharing constraints across stages when customizing large models. Targets setups with several consumer GPUs and slow interconnects, squeezing more throughput from cheap hardware. If your lab or startup runs on gamer cards, this is immediately actionable. ([huggingface.co](https://huggingface.co/papers/2604.27085))

Yibin Luo, Shiwei Gao

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.

Boqiang Zhang, Lei Ke

Image Diffusion Preview with Consistency Solver

From DeepMind, this work uses consistency-based solvers to let users preview diffusion model outputs much more quickly than running a full sampling schedule. The idea is to generate rough-but-faithful previews that can guide prompt iteration and editing, then refine on demand. It’s another example of how inference-side tricks—not just bigger models—are improving practical usability of image generation.

Fu-Yun Wang, Hao Zhou

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.

Mingluo Su, Huan Wang

RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.

Yuhang Liu, Ruijie Wang

KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.

Jian Chen, Zhuoran Wang

SFTok: Bridging the Performance Gap in Discrete Tokenizers

SFTok narrows the quality gap between discrete and continuous image tokenizers using multi-step reconstruction tricks. That matters if you want autoregressive image or video generators that scale.

Qihang Rao, Borui Zhang

Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks

Using genetic programming, the author mines custom activation functions from physics data and reuses them in ecology models. These bespoke activations match accuracy with far fewer parameters.

Anas Hajbi