Back to AI Lab

ArXiv AI Papers

Latest artificial intelligence and machine learning research papers from ArXiv.

Showing 50 of 114 items

Make Your LVLM KV Cache More Lightweight

Targets the memory blow-up from vision tokens in large vision–language models when you run the AI. Uses a prompt-aware method, LightKV, to merge redundant vision tokens before decoding. If you ship LVLMs, this is a concrete way to cut GPU memory and costs without killing quality. ([arxiv.org](https://arxiv.org/list/cs.CV/pastweek?show=100))

Anonymous (ICLR and TMLR drafts; arXiv metadata lists named authors)

ObjectGraph: From Document Injection to Knowledge Traversal — A Native File Format for the Agentic Era

Proposes a new file format that treats documents as typed graphs instead of long strings dumped into context windows. Agents query and traverse nodes, cutting tokens used by up to ~95% while keeping task accuracy. If your agents still paste whole PDFs into prompts, this hints at a cleaner architecture layer. ([arxiv.org](https://arxiv.org/abs/2604.27820))

Mohit Dubey, Open Gigantic

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Argues most terminal-agent benchmarks are written like prompts, so they test instruction clarity, not real capability. Provides a playbook for building adversarial, hard, and readable tasks, plus a catalog of common reward-hacking failure modes. If you rely on agent benchmark scores, sanity-check your tasks against this checklist before bragging. ([papers.cool](https://papers.cool/arxiv/2604.28093?utm_source=openai))

Ivan Bercovich

Characterizing the Consistency of the Emergent Misalignment Persona

Fine-tunes an aligned model on narrow harmful tasks and studies how that misaligned "persona" behaves across many scenarios. Finds patterns in how self-reports, harmful actions, and domain choices line up or diverge. If you care about frontier safety, mine this for concrete tests instead of relying on vibes about "misalignment". ([arxiv.org](https://arxiv.org/abs/2604.28082))

Anietta Weckauff, Yuchen Zhang

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

Lets language models write Answer Set Programs, then uses feedback from a symbolic solver to iteratively fix their code. Shows this combo handles default rules and exceptions better than standard constraint solvers on diverse logic tasks. If you are building reasoning-heavy agents, this is a concrete recipe for bolting on symbolic reliability. ([arxiv.org](https://arxiv.org/abs/2604.27960))

Adam Ishay, Joohyung Lee

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

Directly compares workflow graphs managed by external orchestrators to a single prompt that spells out the whole procedure. For travel, tech support, and claims flows, one big prompt beats complex agent tooling on quality and failure rates. If your product is more orchestration code than prompt, this paper says simplify before you scale. ([arxiv.org](https://arxiv.org/abs/2604.27891))

Simon Dennis, Michael Diamond

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Surveys how teams use reinforcement learning plus GUI interaction to push beyond simple desktop macros into always-on "digital inhabitants". Breaks the space into offline, online, and hybrid strategies, and highlights trends like world-model training and process-level rewards. If you’re automating real GUI workflows, treat this as a roadmap, not just a survey. ([arxiv.org](https://arxiv.org/abs/2604.27955))

Junan Hu, Jian Liu

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Builds thousands of synthetic "computers" with realistic files and calendars to simulate month-long knowledge work for AI agents. Each run spans 8+ hours and ~2,000 steps, yielding dense signals for training long-horizon productivity agents. If you are designing office copilots or agent training curricula, copy this setup to cheaply generate rich experience data. ([arxiv.org](https://arxiv.org/abs/2604.28181))

Tao Ge, Baolin Peng

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Builds a massive graph of how AI methods evolve across 1M+ papers, with over 9M typed edges between techniques. Lets agents and humans trace method lineages, score idea novelty, and auto-generate new research directions. If you scout research or design AI research agents, treat this as a new data layer, not just another paper.

Yujun Wu, Dongxu Zhang

Rethinking Agentic Reinforcement Learning In Large Language Models

Synthesizes the fast-growing literature on reinforcement learning for agent-style language models, from environment design to safety and compute limits. Argues the key shift is treating models as long-lived decision-makers, not one-shot text generators. If you’re planning big training runs for agents, use this as a design checklist, not just a citation. ([databubble.co](https://databubble.co/news/rethinking-agentic-reinforcement-learning-in-large-language-models?utm_source=openai))

Fangming Cui, Ruixiao Zhu

NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.

Haonan Dong, Kehan Jiang

Verbalizing LLMs' assumptions to explain and control sycophancy

Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.

Myra Cheng, Isabel Sieh

LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.

Yilin Xiao, Jin Chen

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.

Wanlong Liu, Bo Zhang

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Builds a plug-and-play framework for 3D medical imaging that unifies detection and risk prediction. Targets hospital workflows, not just leaderboard benchmarks.

Daniel C. MacRae, Luuk van der Hoek

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.

Tao Jin, Phuong Minh Nguyen

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Studies how mixture-of-experts language models actually route work between experts. Offers tools to inspect which expert fires and why, instead of treating MoE as a black box.

Jeremy Herbst, Jae Hee Lee

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

Reuses standard one-direction language models to build bidirectional encoders that can handle text and other signals. Bridges chat models and BERT-style encoders.

Nicolas Boizard, Théo Deschamps-Berger

Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

Introduces Neuro-RIT, which looks at individual neurons while customizing language models for retrieval-heavy tasks. The aim is steadier answers when retrieved documents shift or are noisy.

Jaemin Kim, Jae O Lee

Reliable Control-Point Selection for Steering Reasoning in Large Language Models

Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.

Haomin Zhuang, Hojun Yoo

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Uses a language model’s own feedback as a training signal for retrieval rerankers in RAG pipelines. Aims to pick more useful documents for question answering.

Yuhang Wu, Xiangqing Shen

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.

Yunlong Chu, Minglai Shao

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Presents a framework that treats an agent swarm as a graph you can design, visualize, and debug. Makes multi-agent systems feel more like building workflows than wiring hacks.

Yang Liu, Jinxuan Cai

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.

Changcheng Li, Jiancan Wu

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.

Yitong Chen, Zuxuan Wu

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.

Lijiang Li, Zuwei Long

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

Introduces a new optimization rule for training chat agents over long conversations. The goal: steadier learning and more helpful dialogue without exploding token and compute costs.

Naifan Zhang, Ruihan Sun

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Probes frozen vision backbones for tasks like measuring lengths and angles. Tests how much real-world geometry is already baked into general-purpose models.

Yakov Pyotr Shkolnikov

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.

Mingluo Su, Huan Wang

Diffusion Language Models Are Natively Length-Aware

Argues that diffusion-style language models naturally handle short and long prompts without special tricks. Points to a promising path for huge-context text models.

Vittorio Rossi, Giacomo Cirò

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.

Hila Chefer, Patrick Esser

RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.

Yuhang Liu, Ruijie Wang

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.

Boqiang Zhang, Lei Ke

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

Defines a benchmark that scores how models actually write answers when given retrieved documents. Helps teams compare RAG setups on answer quality, not just retrieval hit rates.

Koki Itai, Shunichi Hasegawa

Multimodal Large Language Models as Image Classifiers

Tests big models that handle text and images when you use them as plain image classifiers. Shows when they beat or trail classic vision networks.

Nikita Kisel, Illia Volkov

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.

Ruchika Chavhan, Malcolm Chadwick

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.

Lizhuo Luo, Shenggui Li

KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.

Jian Chen, Zhuoran Wang

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.

Shuo Nie, Hexuan Deng

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.

Jian Chen, Yesheng Liang

Optimization is Not Enough: Why Problem Formulation Deserves Equal Attention

The authors argue that many "AI optimization" wins really come from how humans pose the problem, not from the math alone. They show cases where small tweaks in formulation beat heavy algorithmic tuning, especially in engineering-style tasks.

Iván Olarte Rodríguez, Gokhan Serhat

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

RRAttention keeps only a fraction of attention blocks by rotating which positions each head looks at in a round-robin pattern. It recovers almost full-attention accuracy while skipping about half the computation at 128K-token context lengths.

Siran Liu, Guoxia Wang

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.

Fangzhi Xu, Hang Yan

EuroLLM-22B: Technical Report

EuroLLM-22B is a 22B-parameter open model focused on European languages, with long-context support and a detailed training recipe. It aims to give EU labs and companies a strong regional alternative to US-centric frontier models.

Miguel Moura Ramos, Duarte M. Alves

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.

Peter Holderrieth, Douglas Chen

Shared LoRA Subspaces for almost Strict Continual Learning

The paper shows that you can reuse a shared low-rank adapter space across many tasks instead of adding new adapters forever. That keeps performance high while holding down memory as models pick up new skills over time.

Prakhar Kaushik, Ankit Vaidya

Reinforcement World Model Learning for LLM-based Agents

The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.

Xiao Yu, Baolin Peng

Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks

Using genetic programming, the author mines custom activation functions from physics data and reuses them in ecology models. These bespoke activations match accuracy with far fewer parameters.

Anas Hajbi

ARC Prize 2025: Technical Report

This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.

François Chollet, Mike Knoop

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.

Yosub Shin, Michael Buriek