Efficiency
Research papers, repositories, and articles about efficiency
Showing 23 of 23 items
ggml-org/llama.cpp
llama.cpp keeps pushing local LLM performance on CPUs and small GPUs. It’s still the reference for running big models on modest hardware. If you care about running the AI cheaply or on-device, you should track every major change here.
FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
FlashPrefill discovers sparse attention patterns during the prefill phase and drops low-importance connections on the fly. It reports huge speedups on 256K-token contexts while still matching baseline accuracy.
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash pairs a tiny diffusion model with a big LLM to draft and verify text in big chunks. It’s currently one of the highest-upvoted speedup methods on Hugging Face.
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.
Over-Searching in Search-Augmented Large Language Models
This work shows that search‑augmented models often call tools even when search hurts answers or wastes tokens. It introduces a cost‑aware metric and mitigation tricks, so teams can dial back needless retrieval instead of just adding more context.
SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.
Make Your LVLM KV Cache More Lightweight
Targets the memory blow-up from vision tokens in large vision–language models when you run the AI. Uses a prompt-aware method, LightKV, to merge redundant vision tokens before decoding. If you ship LVLMs, this is a concrete way to cut GPU memory and costs without killing quality. ([arxiv.org](https://arxiv.org/list/cs.CV/pastweek?show=100))
TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning
TIME teaches dialogue models to drop short "thinking" blocks only when time gaps or context shifts actually demand deeper reasoning. Models keep answers compact while still reasoning hard when conversations get tricky or span days instead of seconds.
Video2LoRA: Parametric Video Internalization for Vision-Language Models
Video2LoRA watches a video once, then predicts LoRA weights that let a frozen vision-language model handle that video efficiently. You keep quality while cutting the visual token budget and time-to-first-answer. This is very relevant if video-heavy agents are choking your GPU bill.
RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
RRAttention keeps only a fraction of attention blocks by rotating which positions each head looks at in a round-robin pattern. It recovers almost full-attention accuracy while skipping about half the computation at 128K-token context lengths.
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Claims an exact, error-free formulation of linear attention derived from a continuous-time view of transformer dynamics. The authors argue they can match the behavior of standard softmax attention while enjoying linear-time complexity, avoiding the approximation errors that plague many fast-attention variants. If the theory and practice hold up, this could become a key building block for large-context models and resource-constrained deployments.
Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.
VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
VideoAR builds a large visual autoregressive model that predicts videos frame by frame across multiple scales. It narrows the quality gap with diffusion models while needing far fewer steps, which makes long video generation cheaper to run.
DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
AdaTooler-V teaches vision-language models when to call external tools, not just how. That cuts unnecessary tool calls, reducing costs while often boosting accuracy on vision tasks.
Efficient Training on Multiple Consumer GPUs with RoundPipe
Introduces a new pipeline schedule that avoids tight weight sharing constraints across stages when customizing large models. Targets setups with several consumer GPUs and slow interconnects, squeezing more throughput from cheap hardware. If your lab or startup runs on gamer cards, this is immediately actionable. ([huggingface.co](https://huggingface.co/papers/2604.27085))
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.
Image Diffusion Preview with Consistency Solver
From DeepMind, this work uses consistency-based solvers to let users preview diffusion model outputs much more quickly than running a full sampling schedule. The idea is to generate rough-but-faithful previews that can guide prompt iteration and editing, then refine on demand. It’s another example of how inference-side tricks—not just bigger models—are improving practical usability of image generation.
ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.
RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning
Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.
KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs
KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.
SFTok: Bridging the Performance Gap in Discrete Tokenizers
SFTok narrows the quality gap between discrete and continuous image tokenizers using multi-step reconstruction tricks. That matters if you want autoregressive image or video generators that scale.
Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks
Using genetic programming, the author mines custom activation functions from physics data and reuses them in ecology models. These bespoke activations match accuracy with far fewer parameters.