ArXiv AI Papers
Latest artificial intelligence and machine learning research papers from ArXiv.
Showing 50 of 114 items
Make Your LVLM KV Cache More Lightweight
Targets the memory blow-up from vision tokens in large vision–language models when you run the AI. Uses a prompt-aware method, LightKV, to merge redundant vision tokens before decoding. If you ship LVLMs, this is a concrete way to cut GPU memory and costs without killing quality. ([arxiv.org](https://arxiv.org/list/cs.CV/pastweek?show=100))
ObjectGraph: From Document Injection to Knowledge Traversal — A Native File Format for the Agentic Era
Proposes a new file format that treats documents as typed graphs instead of long strings dumped into context windows. Agents query and traverse nodes, cutting tokens used by up to ~95% while keeping task accuracy. If your agents still paste whole PDFs into prompts, this hints at a cleaner architecture layer. ([arxiv.org](https://arxiv.org/abs/2604.27820))
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
Argues most terminal-agent benchmarks are written like prompts, so they test instruction clarity, not real capability. Provides a playbook for building adversarial, hard, and readable tasks, plus a catalog of common reward-hacking failure modes. If you rely on agent benchmark scores, sanity-check your tasks against this checklist before bragging. ([papers.cool](https://papers.cool/arxiv/2604.28093?utm_source=openai))
Characterizing the Consistency of the Emergent Misalignment Persona
Fine-tunes an aligned model on narrow harmful tasks and studies how that misaligned "persona" behaves across many scenarios. Finds patterns in how self-reports, harmful actions, and domain choices line up or diverge. If you care about frontier safety, mine this for concrete tests instead of relying on vibes about "misalignment". ([arxiv.org](https://arxiv.org/abs/2604.28082))
LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning
Lets language models write Answer Set Programs, then uses feedback from a symbolic solver to iteratively fix their code. Shows this combo handles default rules and exceptions better than standard constraint solvers on diverse logic tasks. If you are building reasoning-heavy agents, this is a concrete recipe for bolting on symbolic reliability. ([arxiv.org](https://arxiv.org/abs/2604.27960))
In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks
Directly compares workflow graphs managed by external orchestrators to a single prompt that spells out the whole procedure. For travel, tech support, and claims flows, one big prompt beats complex agent tooling on quality and failure rates. If your product is more orchestration code than prompt, this paper says simplify before you scale. ([arxiv.org](https://arxiv.org/abs/2604.27891))
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
Surveys how teams use reinforcement learning plus GUI interaction to push beyond simple desktop macros into always-on "digital inhabitants". Breaks the space into offline, online, and hybrid strategies, and highlights trends like world-model training and process-level rewards. If you’re automating real GUI workflows, treat this as a roadmap, not just a survey. ([arxiv.org](https://arxiv.org/abs/2604.27955))
Synthetic Computers at Scale for Long-Horizon Productivity Simulation
Builds thousands of synthetic "computers" with realistic files and calendars to simulate month-long knowledge work for AI agents. Each run spans 8+ hours and ~2,000 steps, yielding dense signals for training long-horizon productivity agents. If you are designing office copilots or agent training curricula, copy this setup to cheaply generate rich experience data. ([arxiv.org](https://arxiv.org/abs/2604.28181))
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Builds a massive graph of how AI methods evolve across 1M+ papers, with over 9M typed edges between techniques. Lets agents and humans trace method lineages, score idea novelty, and auto-generate new research directions. If you scout research or design AI research agents, treat this as a new data layer, not just another paper.
Rethinking Agentic Reinforcement Learning In Large Language Models
Synthesizes the fast-growing literature on reinforcement learning for agent-style language models, from environment design to safety and compute limits. Argues the key shift is treating models as long-lived decision-makers, not one-shot text generators. If you’re planning big training runs for agents, use this as a design checklist, not just a citation. ([databubble.co](https://databubble.co/news/rethinking-agentic-reinforcement-learning-in-large-language-models?utm_source=openai))
NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons
Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.
Verbalizing LLMs' assumptions to explain and control sycophancy
Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.
LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation
Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.
R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning
Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.
PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction
Builds a plug-and-play framework for 3D medical imaging that unifies detection and risk prediction. Targets hospital workflows, not just leaderboard benchmarks.
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Studies how mixture-of-experts language models actually route work between experts. Offers tools to inspect which expert fires and why, instead of treating MoE as a black box.
BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Reuses standard one-direction language models to build bidirectional encoders that can handle text and other signals. Bridges chat models and BERT-style encoders.
Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
Introduces Neuro-RIT, which looks at individual neurons while customizing language models for retrieval-heavy tasks. The aim is steadier answers when retrieved documents shift or are noisy.
Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Uses a language model’s own feedback as a training signal for retrieval rerankers in RAG pipelines. Aims to pick more useful documents for question answering.
SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.
MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing
Presents a framework that treats an agent swarm as a graph you can design, visualize, and debug. Makes multi-agent systems feel more like building workflows than wiring hacks.
Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.
MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Introduces a new optimization rule for training chat agents over long conversations. The goal: steadier learning and more helpful dialogue without exploding token and compute costs.
Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Probes frozen vision backbones for tasks like measuring lengths and angles. Tests how much real-world geometry is already baked into general-purpose models.
ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.
Diffusion Language Models Are Natively Length-Aware
Argues that diffusion-style language models naturally handle short and long prompts without special tricks. Points to a promising path for huge-context text models.
Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.
RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning
Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Defines a benchmark that scores how models actually write answers when given retrieved documents. Helps teams compare RAG setups on answer quality, not just retrieval hit rates.
Multimodal Large Language Models as Image Classifiers
Tests big models that handle text and images when you use them as plain image classifiers. Shows when they beat or trail classic vision networks.
NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices
NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.
DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.
KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs
KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.
Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.
Optimization is Not Enough: Why Problem Formulation Deserves Equal Attention
The authors argue that many "AI optimization" wins really come from how humans pose the problem, not from the math alone. They show cases where small tweaks in formulation beat heavy algorithmic tuning, especially in engineering-style tasks.
RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
RRAttention keeps only a fraction of attention blocks by rotating which positions each head looks at in a round-robin pattern. It recovers almost full-attention accuracy while skipping about half the computation at 128K-token context lengths.
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.
EuroLLM-22B: Technical Report
EuroLLM-22B is a 22B-parameter open model focused on European languages, with long-context support and a detailed training recipe. It aims to give EU labs and companies a strong regional alternative to US-centric frontier models.
Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.
Shared LoRA Subspaces for almost Strict Continual Learning
The paper shows that you can reuse a shared low-rank adapter space across many tasks instead of adding new adapters forever. That keeps performance high while holding down memory as models pick up new skills over time.
Reinforcement World Model Learning for LLM-based Agents
The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.
Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks
Using genetic programming, the author mines custom activation functions from physics data and reuses them in ecology models. These bespoke activations match accuracy with far fewer parameters.
ARC Prize 2025: Technical Report
This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.