ArXiv AI Papers

Latest artificial intelligence and machine learning research papers from ArXiv.

Showing 22 of 22 items

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.

Jianxiong Gao, Zhaoxi Chen

MMhops-R1: Multimodal Multi-hop Reasoning

Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.

Tao Zhang, Ziqi Zhang

Towards Scalable Pre-training of Visual Tokenizers for Generation

Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.

Jingfeng Yao, Yuda Song

Recurrent Video Masked Autoencoders

Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.

Daniel Zoran, Nikhil Parthasarathy

MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data

Introduces MedInsightBench, a benchmark for ‘analytics agents’ that must reason over multimodal medical data—think tables, images, and reports—to extract multi-step clinical insights rather than just answer single questions. The tasks force agents to chain together retrieval, interpretation, and aggregation across data sources, closer to what real analytics workflows look like in hospitals. This is important if you care about LLM agents that move beyond toy QA into realistic decision support.

Zhenghao Zhu, Chuxue Cao

MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations

Proposes a multi‑agent architecture where specialized conversational agents coordinate to decide when and how to ask clarification questions in ambiguous multi‑turn tasks. Instead of a monolithic assistant, MAC assigns roles and coordination rules so that the ‘right’ agent takes the lead on resolving uncertainty. This is a nice complement to SpeakRL: one focuses on *whether* to clarify, the other on *who* clarifies and how to coordinate in complex workflows.

Emre Can Acikgoz, Jinoh Oh

SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning

Argues that current task-oriented agents are over-optimized as passive followers and under-use conversation as an action. SpeakRL introduces a reinforcement-learning setup that rewards models for asking clarifying questions when the user’s intent is ambiguous, balancing ‘asking’ vs ‘acting’. On synthetic task-oriented dialogue scenarios, the trained agents substantially improve task completion rates without bloating the number of turns, suggesting that proactive clarification is a powerful, underused control knob.

Emre Can Acikgoz, Jinoh Oh

Error-Driven Prompt Optimization for Arithmetic Reasoning

Targets the surprisingly hard problem of getting small on‑prem LLMs to do reliable arithmetic over tabular data in regulated environments. The authors propose an error-driven loop that clusters the model’s wrong answers, derives new prompt rules to address those failure modes, and iteratively refines a code-generation agent. On a finance-style deployment with a 4B-parameter model, this strategy reportedly boosts arithmetic accuracy to around 70% while keeping all computation inside the secure environment.

Árpád Pándy, Róbert Lakatos

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Builds on Direct Preference Optimization but tackles its weak learning signal when both preferred and rejected responses share similar flaws. RPO adds a hint-guided reflection step that encourages the model to produce more contrastive, informative preference pairs before optimizing them. The result is a more stable and data-efficient on-policy alignment pipeline that still avoids full RLHF/RLAIF complexity.

Zihui Zhao, Zechang Li

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.

Asa Cooper Stickland, Jan Michelfeit

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

Reframes GUI agent interaction history as a program with variables and control flow, using this structure to decide what to retain or discard in context. Combined with a global belief-state mechanism, AgentProg significantly improves long-horizon task success on AndroidWorld and a new benchmark, avoiding the context bloat and semantic loss that plague prior compression schemes. ([arxiv.org](https://arxiv.org/abs/2512.10371?utm_source=openai))

Shizuo Tian, Hao Wen

Stronger Normalization-Free Transformers

Introduces Derf, a simple point-wise activation that replaces normalization layers like LayerNorm and RMSNorm while improving generalization across vision, speech, DNA sequence modeling, and GPT-style language models. The authors systematically study properties of point-wise functions, run a large-scale search, and show Derf outperforms prior normalization-free approaches (e.g., Dynamic Tanh) with similar or better stability. ([arxiv.org](https://arxiv.org/abs/2512.10938))

Mingzhi Chen, Taiming Lu

Reverse Thinking Enhances Missing Information Detection in Large Language Models

Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))

Yuxin Liu, Chaojie Gu

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Meta describes Confucius Code Agent (CCA), an open-source AI "software engineer" built on the Confucius SDK with hierarchical working memory, persistent cross-session notes, and robust tool orchestration. On SWE-Bench-Pro it reaches 54.3% Resolve@1, substantially outperforming prior coding agents while emphasizing transparency and extensibility for industrial-scale workflows. ([huggingface.co](https://huggingface.co/papers/2512.10398))

Zhaodong Wang, Zhenting Qi

Thinking with Images via Self-Calling Agent

Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))

Wenxi Yang, Yuzhong Zhao

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

FACTS is a multi-part leaderboard that evaluates LLM factuality across image-based QA, closed-book QA, search-augmented QA, and document-grounded long-form responses, using automated judge models. It’s designed as a long-lived suite with public and private splits, giving a single factuality score while still exposing failure modes across modalities and tool-use settings. ([huggingface.co](https://huggingface.co/papers/2512.10791))

Aileen Cheng, Alon Jacovi

Evaluating Gemini Robotics Policies in a Veo World Simulator

Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Gemini Robotics Team, Coline Devin

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Targets RL for diffusion LLMs by introducing d-TreeRPO, which uses tree-structured rollouts and bottom-up advantage computation with verifiable outcome rewards for fine-grained credit assignment. The method also adds a time-scheduled self-distillation loss to improve probability estimates, yielding large gains on Sudoku, Countdown, GSM8K, and Math500 over existing RL baselines. ([arxiv.org](https://arxiv.org/abs/2512.09675?utm_source=openai))

Leyi Pan, Shuchang Tao

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))

Chonghua Liao, Ke Wang

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

GeoDM: Geometry-aware Distribution Matching for Dataset Distillation

Proposes GeoDM, a dataset distillation framework that performs distribution matching in a product space of Euclidean, hyperbolic, and spherical manifolds, with learnable curvature and weights. This geometry-aware approach yields lower generalization error bounds and consistently outperforms prior distillation methods by better aligning synthetic and real-data manifolds. ([arxiv.org](https://arxiv.org/abs/2512.08317?utm_source=openai))

Xuhui Li, Zhengquan Luo

Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents

Fed-SE is a federated learning framework for LLM agents that must improve across heterogeneous environments under strict privacy constraints. It combines local parameter-efficient fine-tuning on high-return trajectories with global aggregation in a low-rank subspace, reducing negative transfer and boosting average success rates by ~18% over federated baselines. ([huggingface.co](https://huggingface.co/papers/2512.08870))

Xiang Chen, Yuling Shi