HuggingFace AI Papers
Trending AI papers and research featured on HuggingFace.
Showing 35 of 35 items
Image Diffusion Preview with Consistency Solver
From DeepMind, this work uses consistency-based solvers to let users preview diffusion model outputs much more quickly than running a full sampling schedule. The idea is to generate rough-but-faithful previews that can guide prompt iteration and editing, then refine on demand. It’s another example of how inference-side tricks—not just bigger models—are improving practical usability of image generation.
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Presents DrivePI, a 4D (3D + time) multimodal large model for autonomous driving that unifies perception, prediction, and planning. Instead of separate stacks, DrivePI treats driving as a holistic spatial-temporal understanding problem, ingesting sensor data and outputting both scene interpretations and future trajectories. It’s another sign that end-to-end or semi end-to-end ‘driving MLLMs’ are becoming a serious research direction.
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.
WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment
Proposes WebOperator, a web agent framework that uses action-aware tree search to plan sequences of browser actions rather than issuing greedy commands. By modeling the future impact of clicks, form fills, and navigations, the agent can backtrack from bad branches and robustly complete multi-step web tasks. It’s part of the growing trend from ‘prompt a browser wrapper’ toward genuinely search-based web agents.
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Claims an exact, error-free formulation of linear attention derived from a continuous-time view of transformer dynamics. The authors argue they can match the behavior of standard softmax attention while enjoying linear-time complexity, avoiding the approximation errors that plague many fast-attention variants. If the theory and practice hold up, this could become a key building block for large-context models and resource-constrained deployments.
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Introduces NL2Repo-Bench, a benchmark where coding agents must generate or modify entire repositories from natural language specifications, rather than solving single-file LeetCode-style tasks. It evaluates long-horizon planning, tool use, and consistency across files and modules. This is a big step toward evaluating code agents in settings that look like real software projects instead of toy problems.
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Describes the QwenLong-L1.5 post-training recipe for extending LLM context windows while keeping reasoning quality intact. The work focuses not just on positional encodings but also on memory management strategies and training curricula that keep long-context performance from collapsing. This is highly relevant for anyone trying to turn a baseline LLM into a stable long-context model without re‑training from scratch.
Memory in the Age of AI Agents
A substantial survey that systematizes the fast-growing literature on ‘agent memory’—how agentic LLM systems store, retrieve, and evolve information over time. It proposes a taxonomy across forms (token, parametric, latent), functions (factual, experiential, working) and dynamics, and catalogs existing benchmarks and frameworks. If you’re building agent systems with nontrivial memory, this is quickly becoming the reference map of the territory.
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
ReFusion is a masked diffusion model for text that decodes in parallel over contiguous ‘slots’ instead of individual tokens. By combining diffusion-based planning with autoregressive infilling, it recovers much of the quality of strong autoregressive LLMs while massively speeding up generation and allowing KV-cache reuse. This is one of the more serious attempts to rethink LLM decoding beyond the usual left-to-right paradigm.
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
StereoSpace is a diffusion-based monocular-to-stereo system that learns geometric consistency purely from viewpoint conditioning, without explicitly predicting depth or doing warping. The authors also propose a strictly "geometry-free at test time" evaluation protocol and show their method produces sharper parallax and more comfortable stereo than existing depth- or warp-based pipelines.
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Omni-Attribute is an open-vocabulary attribute encoder that learns to isolate specific visual factors—like style, lighting, or expression—rather than entangling everything into a single holistic embedding. Using curated positive/negative pairs and a dual generative/contrastive objective, it produces attribute-specific embeddings that are better for retrieval, personalization, and compositional image generation.
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
DuetSVG proposes a unified multimodal model that generates both raster images and SVG code jointly, using the image stream to guide SVG token decoding. By letting the model "see" what it’s drawing during generation, it produces vector graphics that are more visually faithful, semantically correct, and syntactically clean than text-only SVG generators.
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
MiSI-Bench introduces "Microscopic Spatial Intelligence"—the ability to reason about invisible molecular 3D structures—and builds a massive VLM benchmark spanning 163k QA pairs over 4k molecules. Current VLMs lag well behind humans on many tasks, but a tuned 7B model can exceed human performance on some spatial transformations, highlighting both the promise and the need for domain knowledge in scientific AGI.
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
InternGeometry is a geometry-solving LLM agent that reaches medalist-level performance on IMO geometry problems by tightly integrating with a symbolic engine. It proposes auxiliary constructions and propositions, verifies them symbolically, reflects on the feedback, and is trained with a complexity-boosting RL curriculum—achieving 44/50 problems solved using a tiny fraction of the data required by AlphaGeometry 2.
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
T-pro 2.0 is an open-weight Russian large language model focused on hybrid reasoning: it can answer directly or emit explicit reasoning traces, and it’s optimized for low-latency inference via speculative decoding. Alongside the model, the authors release a Russian instruction corpus, a math benchmark, and an EAGLE-based inference stack, making it a practical foundation for Russian-language reasoning applications.
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
The authors augment multimodal LLMs with a "Video Toolkit" and a STAR (Spatiotemporal Reasoning) framework that orchestrates calls to temporal and spatial tools for video question answering. Instead of treating the video as a black-box embedding, the model actively localizes key regions over time using tools, yielding sizable gains on VideoMME and LongVideoBench when wrapped around GPT-4o.
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
HF pitches Confucius Code Agent as an industrial-strength open coding agent with hierarchical working memory, persistent notes, and a meta-agent that continuously refines configurations. If you care about reproducible, extensible coding agents rather than opaque SaaS tools, this is a substantial systems paper. ([huggingface.co](https://huggingface.co/papers/2512.10398))
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
FACTS is positioned as a one-stop leaderboard for LLM factuality, aggregating automated-judge scores from multimodal, parametric, search-augmented, and document-grounded tasks. It’s a natural next target for model releases that want to claim they’re less hallucinatory in practice, not just on isolated QA datasets. ([huggingface.co](https://huggingface.co/papers/2512.10791))
Evaluating Gemini Robotics Policies in a Veo World Simulator
Uses a fine-tuned Veo video model as a generative world simulator for robot policy evaluation, covering in-distribution tasks, OOD generalization axes, and physical/semantic safety tests. The key takeaway is that high-fidelity video models can stand in for many expensive real-world trials while still predicting policy rankings and vulnerabilities reliably. ([huggingface.co](https://huggingface.co/papers/2512.10675))
Stronger Normalization-Free Transformers
HF highlights this work for showing that a carefully designed point-wise activation (Derf) can fully replace normalization layers in Transformers and still improve performance across multiple domains. For practitioners, it points toward simpler, potentially faster architectures without layer norm’s synchronization and batch-size headaches. ([huggingface.co](https://huggingface.co/papers/2512.10938))
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
ReViSE defines a new Reason-Informed Video Editing task and benchmark, then introduces a unified video model that edits while continuously self-evaluating its own reasoning. A built-in VLM judges whether the edited video logically satisfies the instruction, providing self-reflective feedback that tightens the link between "understanding" and actual visual edits.
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA gets called out as a way to align role-playing agents along many competing dimensions simultaneously using multi-objective RL and thought-augmented rollouts. It’s especially relevant if you’re trying to get smaller models to behave like premium chatbots in complex, persona-heavy domains. ([huggingface.co](https://huggingface.co/papers/2512.09756))
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.
Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
HF frames Fed-SE as a way to let LLM agents "self-evolve" across different clients and environments without sharing raw trajectories. For people deploying agents in regulated or siloed settings, it’s an interesting recipe for federated RL that reduces gradient conflicts across heterogeneous tasks. ([huggingface.co](https://huggingface.co/papers/2512.08870))
Thinking with Images via Self-Calling Agent
Introduces sCoT, where a main language agent delegates visual subtasks to self-calling subagents rather than running a fully interleaved multimodal CoT. This makes high-resolution visual reasoning more data- and compute-efficient while still beating strong baselines on HR-Bench and related multimodal benchmarks. ([huggingface.co](https://huggingface.co/papers/2512.08511))
DragMesh: Interactive 3D Generation Made Easy
DragMesh offers a real-time framework for interactively generating articulated 3D motion by decoupling kinematics from motion generation, using a dual-quaternion VAE and FiLM conditioning. For 3D/graphics folks, it’s a signal that interactive, physically plausible articulation is becoming practical, not just offline. ([huggingface.co](https://huggingface.co/papers/2512.06424))
BEAVER: An Efficient Deterministic LLM Verifier
BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
X-Humanoid presents a scalable way to "robotize" human videos, turning ordinary human motion into humanoid-robot video at scale. By adapting a powerful video generative model and building a large synthetic paired dataset in Unreal Engine, it can translate complex third-person human motions into physically plausible humanoid animations, unlocking web-scale data for embodied AI.
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
VQRAE introduces a unified visual tokenizer that can simultaneously support high-level multimodal understanding and discrete-token image generation. Building on a pretrained vision encoder and a high-dimensional semantic VQ codebook, it yields continuous semantic features for reasoning and discrete tokens for reconstruction, showing that quantizing semantic encoders with large codebooks can preserve both meaning and detail.