Back to AI Lab

HuggingFace AI Papers

Trending AI papers and research featured on HuggingFace.

Showing 50 of 114 items

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetSpec adds a new draft head so you can propose large token trees in one forward pass while staying consistent with the base model. On Qwen3 models it reaches up to ~9.6x speedups on math without tanking quality, and integrates with vLLM. If you serve heavy workloads, this is a must-read for cutting the cost to run the AI. ([huggingface.co](https://huggingface.co/papers/2606.18394))

Lanxiang Hu, Zhaoxiang Feng

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

Shows that in tool-use RL, models often "forget" how to call tools because specific control tokens spike in probability, breaking format while the underlying skill stays. Interleaving supervised updates with RL and adding richer hints stabilizes training across formats and tasks. If your agent RL runs keep collapsing, this paper is a playbook. ([huggingface.co](https://huggingface.co/papers/2606.26027))

Yupu Hao, Zhuoran Jin

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Argues you can reuse the policy and reference from RL post-training to define a "progress advantage" signal instead of training a separate process reward model. This gives dense step-wise scores for agents while avoiding another fragile model in the loop. If you're drowning in reward-model complexity, this suggests a cheaper alignment path. ([huggingface.co](https://huggingface.co/papers/2606.26080))

Changdae Oh, Wendi Li

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

Builds a carefully matched benchmark where GUI agents and command-line agents solve identical desktop tasks under the same checks. Finds GUI agents fail on long, brittle interactions, while CLI agents are limited by missing skills, not raw intelligence. If you design computer-use stacks, this tells you where to invest next. ([huggingface.co](https://huggingface.co/papers/2606.24551))

Xiao Zhou, Siyue Zhang

Information-Aware KV Cache Compression for Long Reasoning

InfoKV mixes attention scores with an information-theory signal that tracks how much a token affects future predictions. This lets the model drop uninformative tokens while keeping rare but important ones, improving long-context reasoning under tight memory. If you fight KV blowup, this suggests a smarter eviction policy. ([huggingface.co](https://huggingface.co/papers/2606.26875))

Jushi Kai, Zhuiri Xiao

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

Analyzes 67 models and shows that any system choosing a single model’s answer is capped by how often all models fail together. Provides practical bounds on how much routing or voting can help. If you're building ensemble/agent stacks, this sets a hard ceiling you should calculate. ([huggingface.co](https://huggingface.co/papers/2606.27288))

Josef Chen

Hallucination in World Models is Predictable and Preventable

Builds a big benchmark with ground-truth simulators to show where visual world models drift from reality. Identifies three failure modes and three simple signals that reliably flag them. If you deploy action-driven world models, you can use these signals as runtime tripwires. ([huggingface.co](https://huggingface.co/papers/2606.27326))

Nicklas Hansen, Xiaolong Wang

Discretizing Reward Models

Shows that continuous reward models often assign very different scores to equally good answers, which encourages reward hacking and bad policies. Clustering rewards into a few discrete levels using Monte Carlo dropout reduces this oversensitivity and leads to better RL outcomes. If you're training policies on reward models, this is a strong argument to discretize. ([huggingface.co](https://huggingface.co/papers/2606.21795))

Vijay Viswanathan, Shiqi Wang

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Introduces GauntletBench, a web-based testbed with video editors, workflow tools, 3D apps, and more, focused on tough perception and reasoning tasks. Even the best agents hit only ~19% success while non-expert humans clear 80%+. If you think your agent is "human level," try it here. ([huggingface.co](https://huggingface.co/papers/2606.14397))

Mykola Vysotskyi, Runqi Lin

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Simulates a small coffee supply chain where agents run farms, roasters, and retailers over 90 days. Different models show very different communication styles and profit profiles. If you care about economic alignment and multi-agent markets, CoffeeBench is a ready-made sandbox. ([huggingface.co](https://huggingface.co/papers/2606.16613))

Issa Sugiura, Daichi Hattori

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Diagnoses a subtle numeric bias in current 4‑bit training formats and proposes a uniform alternative that stays stable on models up to 124B parameters. Hardware and training teams should read closely.

Qian Zhao, Kunlong Chen

Context-Aware RL for Agentic and Multimodal LLMs

Teaches models to pick the right context out of nearly identical options, improving long-horizon tasks and visual question answering. Use this if your agents cherry-pick the wrong evidence.

Peiyang Xu, Bangzheng Li

Playful Agentic Robot Learning

Robots practice through playful exploration, then reuse those skills for real tasks. If you script every task by hand, this points to a cheaper, more scalable path.

Junyi Zhang, Jiaxin Ge

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Finds that filtered first-person human videos can beat costly robot demonstrations for pretraining. If you’re collecting robot data manually, you should test this cheaper pipeline.

Juncheng Ma, Jianxin Bi

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Lets coding agents run real robots in a closed loop and continuously improve policies with minimal human babysitting. Robotics groups should treat this as a design template for autonomous labs.

Wenli Xiao, Jia Xie

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Gives agents a toolbox for understanding changing 3D scenes across views and time. Use this if your vision agents lose track of objects once the camera moves.

Yalun Dai, Hao Li

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Curates thousands of “clean” scenes for testing 3D view generation without messy backgrounds. If your models cheat by using clutter, this dataset will expose them.

Cheng-You Lu, Yi-Shan Hung

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Tracks customer data and policy state in a separate ledger so agents stop making forbidden tool calls. If you run support bots, this is directly actionable.

Md Nayem Uddin, Amir Saeidi

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Shows how to test agents so scores actually predict field performance, not just benchmark bragging rights. If you own an eval suite, you should copy this framework.

Dhaval C. Patel, Kaoutar El Maghraoui

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

Dexterous robot hands learn to move articulated objects by reasoning about contact, not just motion paths. Try this if you’re hitting brittleness in contact-heavy manipulation tasks.

Tianshan Zhang, Yijia Duan

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace

HuggingGPT treats a large language model as a conductor that calls out to many specialist models on HuggingFace. It shows how a text model plus a rich model hub can handle complex tasks spanning vision, speech, and language.

Yongliang Shen, Kaitao Song

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

The InternVL 2.5 work pushes an open multimodal model to match or beat top proprietary systems on tough benchmarks. It digs into how model size, data curation, and smart test-time tricks together move the performance frontier.

Zhe Chen, Weiyun Wang

Back to Bytes: Revisiting Tokenization Through UTF-8

The authors propose UTF8Tokenizer, which maps bytes directly to token IDs and encodes control signals using old-school control bytes. This keeps embedding tables tiny, speeds up tokenization, and can be bolted onto existing models to improve convergence without changing how you run them.

Amit Moryossef, Clara Meister

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

ThaiSafetyBench compiles nearly two thousand Thai prompts, many grounded in local culture, to probe model safety. The authors also release a classifier that matches GPT-4.1’s judgments, giving the community a reusable Thai safety watchdog.

Trapoom Ukarapol, Nut Chukamphaeng

tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation

tasksource standardizes how hundreds of NLP datasets map inputs and labels into a common schema. That makes it much easier to train and test multi-task models without hand-writing fragile preprocessing code for each dataset.

Damien Sileo

HuggingFace's Transformers: State-of-the-art Natural Language Processing

This 2019 paper launched the Transformers library, giving a clean API around many transformer models and pretrained checkpoints. It turned cutting-edge NLP into a reusable software layer that underpins most open-source LLM work today.

Thomas Wolf, Lysandre Debut

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

This paper systematically measures how settings like batch size and max tokens affect throughput for common LLM engines. It shows that smart hyperparameter tuning can beat naive defaults by double-digit percentages, even when hardware stays the same.

Matias Martinez

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

This paper proposes a single model that predicts future world state, plans in language, and outputs robot actions. It uses an autoregressive backbone tied to a "world expert" module for physical dynamics. Think of it as a step toward robots that learn from video and instructions without separate planning stacks.

Yi Yang, Zhihong Liu

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR gives you 315k tough reasoning questions over 145k expert videos. It’s built to push models beyond captioning toward real multi-step explanations. Use it to pressure-test any video model that claims "understanding" rather than just pattern matching.

Lin Fu, Zheyuan Yang

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

LoomVideo packs video generation and editing into a single 5B model that talks to a multimodal language backbone. A clever "scale-and-add" trick lets it edit videos without doubling sequence length, so you get big speedups at similar quality. If you’re exploring small but strong video models, this is a new anchor point.

Jianzong Wu, Hao Lian

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

EvoDS is a data science agent that learns new tools and manages its own memory over time. It treats both "what skills to learn" and "what to remember" as separate learning problems. If you’re turning analytics workflows into long-lived agents, this is a concrete blueprint.

Zherui Yang, Fan Liu

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

The authors break code problems into atomic pieces, then recombine them to generate harder tasks for reinforcement learning with verifiable rewards. This produces richer training data than simple template expansion and boosts code performance across domains. It’s a strong signal that smarter task generation matters as much as bigger models.

Jiasheng Zheng, Boxi Cao

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Code2LoRA turns an entire repository into a lightweight adapter instead of more prompt tokens. It supports static snapshots and an "evolving" mode that tracks commits with a GRU. If you run code models at scale, this is a practical way to cut context while staying up to date.

Liliana Hotsko, Yinxi Li

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Hide-and-Seek detects when vision-language-action robots are about to fail, using only trajectory-level labels. It learns which individual actions signal trouble without step-by-step annotation. If you run embodied agents, this is a practical way to catch bad executions before they break hardware.

Seongheon Park, Wendi Li

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

minWM turns heavy video diffusion models into fast, interactive "video world" simulators. It provides a full pipeline from data to few-step generators that run close to real time. If you care about agents in simulated worlds, this is an end-to-end recipe you can actually clone and run.

Min Zhao, Hongzhou Zhu

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5 is a small family of guardrail models trained on a detailed agent-risk taxonomy with surprisingly few samples. They can sit in front of powerful agents, flag dangerous actions, and run cheaply. If you build tool-using agents, this is emerging as a standard safety baseline to copy or test against.

Dongrui Liu, Yu Li

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA trains a single system to handle many robot tasks, environments, and bodies, instead of maintaining separate models. It shows strong generalization in manipulation, navigation, and trajectory prediction. If you’re running fleets of robots, this points to one shared brain rather than dozens of bespoke ones.

Qiuyue Wang, Mingsheng Li

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS stresses test data-analysis agents over thousands of turns built from real Kaggle notebooks. Even top models collapse as sessions grow, with huge drops in late-turn accuracy. If you ship analytic agents, you should be benchmarking on LongDS or something like it, not just short chat tasks.

Kewei Xu, Xiaoben Lu

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

CollectionLoRA distills dozens of separate image-style LoRAs into a single adapter without the usual "style bleeding". You get 50 visual effects with one small file. If you host many custom styles, this is a direct way to slash storage and server overhead.

Fangtai Wu, Hailong Guo

Map2World: Segment Map Conditioned Text to 3D World Generation

Generates full 3D worlds from user-drawn segment maps, then adds fine detail with a separate enhancement network. Uses priors from existing asset generators to generalize across domains with limited training data. If you care about simulation, robotics, or game tools, this is a blueprint for controllable world generation. ([huggingface.co](https://huggingface.co/papers/2605.00781))

Jaeyoung Chung, Suyoung Lee

Let ViT Speak: Generative Language-Image Pre-training

Trains a Vision Transformer to predict language tokens directly from image tokens using a standard language-model objective. Removes contrastive tricks and extra decoders while staying competitive on many multimodal benchmarks. If you maintain vision backbones for language models, this is a simpler pretraining recipe to test. ([huggingface.co](https://huggingface.co/papers/2605.00809))

Yan Fang, Mengcheng Lan

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Trains a tokenizer and autoregressive image model together, letting generation feedback directly improve the tokenization scheme. Hits state-of-the-art ImageNet 256×256 scores without guidance. If you build discrete image generators, this supports fusing tokenizer and generator into one training pipeline. ([huggingface.co](https://huggingface.co/papers/2605.00503))

Wenda Chu, Bingliang Zhang

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Lays out a five-level roadmap for visual generation, from basic image mapping up to interactive world modeling for agents. Argues the next race is about structure, memory, and causality, not prettier pictures. If you work on vision models, benchmark against these levels, not just FID-style metrics. ([huggingface.co](https://huggingface.co/papers/2604.28185))

Keming Wu, Zuhao Yang

Heterogeneous Scientific Foundation Model Collaboration

Introduces Eywa, a framework that lets language models coordinate with domain‑specific scientific models across non-text data. Treats those models as tools inside an agent system and studies planning strategies across them. If you’re building AI for science, this shows how to wire specialized models into one reasoning loop. ([huggingface.co](https://huggingface.co/papers/2604.27351))

Zihao Li, Jiaru Zou

Co-Evolving Policy Distillation

Unifies two popular post‑training styles and shows why naively merging many expert policies can lose capabilities. Proposes a bidirectional distillation loop where student and experts improve together. If you juggle multiple specialist models, this offers a more stable way to fold them into one. ([huggingface.co](https://huggingface.co/papers/2604.27083))

Naibin Gu, Chenxu Yang

Efficient Training on Multiple Consumer GPUs with RoundPipe

Introduces a new pipeline schedule that avoids tight weight sharing constraints across stages when customizing large models. Targets setups with several consumer GPUs and slow interconnects, squeezing more throughput from cheap hardware. If your lab or startup runs on gamer cards, this is immediately actionable. ([huggingface.co](https://huggingface.co/papers/2604.27085))

Yibin Luo, Shiwei Gao

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Builds a benchmark where tasks and environments keep changing, and evaluation checks whether an agent actually executed real workflows. Uses logs and structured assessments, not just final answers. If you are deploying agents into production operations, this is much closer to what you actually care about. ([huggingface.co](https://huggingface.co/papers/2604.28139))

Chenxin Li, Zhengyang Tang

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Treats methods, not papers, as first-class nodes in a huge evolution graph of AI research. Lets you query how techniques emerged, combined, and replaced each other, then use that to rate or generate new ideas. If you invest in research strategy, this is basically a map of the territory. ([huggingface.co](https://huggingface.co/papers/2604.28158))

Yujun Wu, Dongxu Zhang

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Trains a 32B-parameter code model on synthetic “thinking traces” and hardware execution logs. Targets chip design, GPU tuning, and embedded code with explicit reasoning steps.

Jian Yang, Wei Zhang

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Fuses a contrastive vision encoder and a self-supervised encoder, then feeds the combined tokens into a language model. Yields stronger visual understanding and grounding benchmarks.

Ankan Deria, Komal Kumar