HuggingFace AI Papers
Trending AI papers and research featured on HuggingFace.
Showing 50 of 114 items
JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
JetSpec adds a new draft head so you can propose large token trees in one forward pass while staying consistent with the base model. On Qwen3 models it reaches up to ~9.6x speedups on math without tanking quality, and integrates with vLLM. If you serve heavy workloads, this is a must-read for cutting the cost to run the AI. ([huggingface.co](https://huggingface.co/papers/2606.18394))
Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
Shows that in tool-use RL, models often "forget" how to call tools because specific control tokens spike in probability, breaking format while the underlying skill stays. Interleaving supervised updates with RL and adding richer hints stabilizes training across formats and tasks. If your agent RL runs keep collapsing, this paper is a playbook. ([huggingface.co](https://huggingface.co/papers/2606.26027))
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Argues you can reuse the policy and reference from RL post-training to define a "progress advantage" signal instead of training a separate process reward model. This gives dense step-wise scores for agents while avoiding another fragile model in the loop. If you're drowning in reward-model complexity, this suggests a cheaper alignment path. ([huggingface.co](https://huggingface.co/papers/2606.26080))
GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents
Builds a carefully matched benchmark where GUI agents and command-line agents solve identical desktop tasks under the same checks. Finds GUI agents fail on long, brittle interactions, while CLI agents are limited by missing skills, not raw intelligence. If you design computer-use stacks, this tells you where to invest next. ([huggingface.co](https://huggingface.co/papers/2606.24551))
Information-Aware KV Cache Compression for Long Reasoning
InfoKV mixes attention scores with an information-theory signal that tracks how much a token affects future predictions. This lets the model drop uninformative tokens while keeping rare but important ones, improving long-context reasoning under tight memory. If you fight KV blowup, this suggests a smarter eviction policy. ([huggingface.co](https://huggingface.co/papers/2606.26875))
When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
Analyzes 67 models and shows that any system choosing a single model’s answer is capped by how often all models fail together. Provides practical bounds on how much routing or voting can help. If you're building ensemble/agent stacks, this sets a hard ceiling you should calculate. ([huggingface.co](https://huggingface.co/papers/2606.27288))
Hallucination in World Models is Predictable and Preventable
Builds a big benchmark with ground-truth simulators to show where visual world models drift from reality. Identifies three failure modes and three simple signals that reliably flag them. If you deploy action-driven world models, you can use these signals as runtime tripwires. ([huggingface.co](https://huggingface.co/papers/2606.27326))
Discretizing Reward Models
Shows that continuous reward models often assign very different scores to equally good answers, which encourages reward hacking and bad policies. Clustering rewards into a few discrete levels using Monte Carlo dropout reduces this oversensitivity and leads to better RL outcomes. If you're training policies on reward models, this is a strong argument to discretize. ([huggingface.co](https://huggingface.co/papers/2606.21795))
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
Introduces GauntletBench, a web-based testbed with video editors, workflow tools, 3D apps, and more, focused on tough perception and reasoning tasks. Even the best agents hit only ~19% success while non-expert humans clear 80%+. If you think your agent is "human level," try it here. ([huggingface.co](https://huggingface.co/papers/2606.14397))
CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Simulates a small coffee supply chain where agents run farms, roasters, and retailers over 90 days. Different models show very different communication styles and profit profiles. If you care about economic alignment and multi-agent markets, CoffeeBench is a ready-made sandbox. ([huggingface.co](https://huggingface.co/papers/2606.16613))
Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Diagnoses a subtle numeric bias in current 4‑bit training formats and proposes a uniform alternative that stays stable on models up to 124B parameters. Hardware and training teams should read closely.
Context-Aware RL for Agentic and Multimodal LLMs
Teaches models to pick the right context out of nearly identical options, improving long-horizon tasks and visual question answering. Use this if your agents cherry-pick the wrong evidence.
Playful Agentic Robot Learning
Robots practice through playful exploration, then reuse those skills for real tasks. If you script every task by hand, this points to a cheaper, more scalable path.
HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Finds that filtered first-person human videos can beat costly robot demonstrations for pretraining. If you’re collecting robot data manually, you should test this cheaper pipeline.
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Lets coding agents run real robots in a closed loop and continuously improve policies with minimal human babysitting. Robotics groups should treat this as a design template for autonomous labs.
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence
Gives agents a toolbox for understanding changing 3D scenes across views and time. Use this if your vision agents lose track of objects once the camera moves.
DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis
Curates thousands of “clean” scenes for testing 3D view generation without messy backgrounds. If your models cheat by using clutter, this dataset will expose them.
LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
Tracks customer data and policy state in a separate ledger so agents stop making forbidden tool calls. If you run support bots, this is directly actionable.
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
Shows how to test agents so scores actually predict field performance, not just benchmark bragging rights. If you own an eval suite, you should copy this framework.
DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects
Dexterous robot hands learn to move articulated objects by reasoning about contact, not just motion paths. Try this if you’re hitting brittleness in contact-heavy manipulation tasks.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
HuggingGPT treats a large language model as a conductor that calls out to many specialist models on HuggingFace. It shows how a text model plus a rich model hub can handle complex tasks spanning vision, speech, and language.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
The InternVL 2.5 work pushes an open multimodal model to match or beat top proprietary systems on tough benchmarks. It digs into how model size, data curation, and smart test-time tricks together move the performance frontier.
Back to Bytes: Revisiting Tokenization Through UTF-8
The authors propose UTF8Tokenizer, which maps bytes directly to token IDs and encodes control signals using old-school control bytes. This keeps embedding tables tiny, speeds up tokenization, and can be bolted onto existing models to improve convergence without changing how you run them.
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
ThaiSafetyBench compiles nearly two thousand Thai prompts, many grounded in local culture, to probe model safety. The authors also release a classifier that matches GPT-4.1’s judgments, giving the community a reusable Thai safety watchdog.
tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation
tasksource standardizes how hundreds of NLP datasets map inputs and labels into a common schema. That makes it much easier to train and test multi-task models without hand-writing fragile preprocessing code for each dataset.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
This 2019 paper launched the Transformers library, giving a clean API around many transformer models and pretrained checkpoints. It turned cutting-edge NLP into a reusable software layer that underpins most open-source LLM work today.
The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
This paper systematically measures how settings like batch size and max tokens affect throughput for common LLM engines. It shows that smart hyperparameter tuning can beat naive defaults by double-digit percentages, even when hardware stays the same.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
This paper proposes a single model that predicts future world state, plans in language, and outputs robot actions. It uses an autoregressive backbone tied to a "world expert" module for physical dynamics. Think of it as a step toward robots that learn from video and instructions without separate planning stacks.
VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding
VideoKR gives you 315k tough reasoning questions over 145k expert videos. It’s built to push models beyond captioning toward real multi-step explanations. Use it to pressure-test any video model that claims "understanding" rather than just pattern matching.
LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing
LoomVideo packs video generation and editing into a single 5B model that talks to a multimodal language backbone. A clever "scale-and-add" trick lets it edit videos without doubling sequence length, so you get big speedups at similar quality. If you’re exploring small but strong video models, this is a new anchor point.
EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management
EvoDS is a data science agent that learns new tools and manages its own memory over time. It treats both "what skills to learn" and "what to remember" as separate learning problems. If you’re turning analytics workflows into long-lived agents, this is a concrete blueprint.
Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination
The authors break code problems into atomic pieces, then recombine them to generate harder tasks for reinforcement learning with verifiable rewards. This produces richer training data than simple template expansion and boosts code performance across domains. It’s a strong signal that smarter task generation matters as much as bigger models.
Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution
Code2LoRA turns an entire repository into a lightweight adapter instead of more prompt tokens. It supports static snapshots and an "evolving" mode that tracks commits with a GRU. If you run code models at scale, this is a practical way to cut context while staying up to date.
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring
Hide-and-Seek detects when vision-language-action robots are about to fail, using only trajectory-level labels. It learns which individual actions signal trouble without step-by-step annotation. If you run embodied agents, this is a practical way to catch bad executions before they break hardware.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM turns heavy video diffusion models into fast, interactive "video world" simulators. It provides a full pipeline from data to few-step generators that run close to real time. If you care about agents in simulated worlds, this is an end-to-end recipe you can actually clone and run.
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
AgentDoG 1.5 is a small family of guardrail models trained on a detailed agent-risk taxonomy with surprisingly few samples. They can sit in front of powerful agents, flag dangerous actions, and run cheaply. If you build tool-using agents, this is emerging as a standard safety baseline to copy or test against.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA trains a single system to handle many robot tasks, environments, and bodies, instead of maintaining separate models. It shows strong generalization in manipulation, navigation, and trajectory prediction. If you’re running fleets of robots, this points to one shared brain rather than dozens of bespoke ones.
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
LongDS stresses test data-analysis agents over thousands of turns built from real Kaggle notebooks. Even top models collapse as sessions grow, with huge drops in late-turn accuracy. If you ship analytic agents, you should be benchmarking on LongDS or something like it, not just short chat tasks.
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation
CollectionLoRA distills dozens of separate image-style LoRAs into a single adapter without the usual "style bleeding". You get 50 visual effects with one small file. If you host many custom styles, this is a direct way to slash storage and server overhead.
Map2World: Segment Map Conditioned Text to 3D World Generation
Generates full 3D worlds from user-drawn segment maps, then adds fine detail with a separate enhancement network. Uses priors from existing asset generators to generalize across domains with limited training data. If you care about simulation, robotics, or game tools, this is a blueprint for controllable world generation. ([huggingface.co](https://huggingface.co/papers/2605.00781))
Let ViT Speak: Generative Language-Image Pre-training
Trains a Vision Transformer to predict language tokens directly from image tokens using a standard language-model objective. Removes contrastive tricks and extra decoders while staying competitive on many multimodal benchmarks. If you maintain vision backbones for language models, this is a simpler pretraining recipe to test. ([huggingface.co](https://huggingface.co/papers/2605.00809))
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
Trains a tokenizer and autoregressive image model together, letting generation feedback directly improve the tokenization scheme. Hits state-of-the-art ImageNet 256×256 scores without guidance. If you build discrete image generators, this supports fusing tokenizer and generator into one training pipeline. ([huggingface.co](https://huggingface.co/papers/2605.00503))
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Lays out a five-level roadmap for visual generation, from basic image mapping up to interactive world modeling for agents. Argues the next race is about structure, memory, and causality, not prettier pictures. If you work on vision models, benchmark against these levels, not just FID-style metrics. ([huggingface.co](https://huggingface.co/papers/2604.28185))
Heterogeneous Scientific Foundation Model Collaboration
Introduces Eywa, a framework that lets language models coordinate with domain‑specific scientific models across non-text data. Treats those models as tools inside an agent system and studies planning strategies across them. If you’re building AI for science, this shows how to wire specialized models into one reasoning loop. ([huggingface.co](https://huggingface.co/papers/2604.27351))
Co-Evolving Policy Distillation
Unifies two popular post‑training styles and shows why naively merging many expert policies can lose capabilities. Proposes a bidirectional distillation loop where student and experts improve together. If you juggle multiple specialist models, this offers a more stable way to fold them into one. ([huggingface.co](https://huggingface.co/papers/2604.27083))
Efficient Training on Multiple Consumer GPUs with RoundPipe
Introduces a new pipeline schedule that avoids tight weight sharing constraints across stages when customizing large models. Targets setups with several consumer GPUs and slow interconnects, squeezing more throughput from cheap hardware. If your lab or startup runs on gamer cards, this is immediately actionable. ([huggingface.co](https://huggingface.co/papers/2604.27085))
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Builds a benchmark where tasks and environments keep changing, and evaluation checks whether an agent actually executed real workflows. Uses logs and structured assessments, not just final answers. If you are deploying agents into production operations, this is much closer to what you actually care about. ([huggingface.co](https://huggingface.co/papers/2604.28139))
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Treats methods, not papers, as first-class nodes in a huge evolution graph of AI research. Lets you query how techniques emerged, combined, and replaced each other, then use that to rate or generate new ideas. If you invest in research strategy, this is basically a map of the territory. ([huggingface.co](https://huggingface.co/papers/2604.28158))
InCoder-32B-Thinking: Industrial Code World Model for Thinking
Trains a 32B-parameter code model on synthetic “thinking traces” and hardware execution logs. Targets chip design, GPU tuning, and embedded code with explicit reasoning steps.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
Fuses a contrastive vision encoder and a self-supervised encoder, then feeds the combined tokens into a language model. Yields stronger visual understanding and grounding benchmarks.