ArXiv AI Papers
Latest artificial intelligence and machine learning research papers from ArXiv.
Showing 50 of 104 items
R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning
Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.
LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation
Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.
PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction
Builds a plug-and-play framework for 3D medical imaging that unifies detection and risk prediction. Targets hospital workflows, not just leaderboard benchmarks.
Verbalizing LLMs' assumptions to explain and control sycophancy
Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.
NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons
Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Uses a language model’s own feedback as a training signal for retrieval rerankers in RAG pipelines. Aims to pick more useful documents for question answering.
Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Studies how mixture-of-experts language models actually route work between experts. Offers tools to inspect which expert fires and why, instead of treating MoE as a black box.
Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
Introduces Neuro-RIT, which looks at individual neurons while customizing language models for retrieval-heavy tasks. The aim is steadier answers when retrieved documents shift or are noisy.
BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Reuses standard one-direction language models to build bidirectional encoders that can handle text and other signals. Bridges chat models and BERT-style encoders.
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.
MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing
Presents a framework that treats an agent swarm as a graph you can design, visualize, and debug. Makes multi-agent systems feel more like building workflows than wiring hacks.
Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.
SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.
ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.
Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.
Diffusion Language Models Are Natively Length-Aware
Argues that diffusion-style language models naturally handle short and long prompts without special tricks. Points to a promising path for huge-context text models.
RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning
Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.
MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Introduces a new optimization rule for training chat agents over long conversations. The goal: steadier learning and more helpful dialogue without exploding token and compute costs.
Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Probes frozen vision backbones for tasks like measuring lengths and angles. Tests how much real-world geometry is already baked into general-purpose models.
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Defines a benchmark that scores how models actually write answers when given retrieved documents. Helps teams compare RAG setups on answer quality, not just retrieval hit rates.
Multimodal Large Language Models as Image Classifiers
Tests big models that handle text and images when you use them as plain image classifiers. Shows when they beat or trail classic vision networks.
NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices
NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.
DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.
KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs
KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.
Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.
Optimization is Not Enough: Why Problem Formulation Deserves Equal Attention
The authors argue that many "AI optimization" wins really come from how humans pose the problem, not from the math alone. They show cases where small tweaks in formulation beat heavy algorithmic tuning, especially in engineering-style tasks.
RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
RRAttention keeps only a fraction of attention blocks by rotating which positions each head looks at in a round-robin pattern. It recovers almost full-attention accuracy while skipping about half the computation at 128K-token context lengths.
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.
EuroLLM-22B: Technical Report
EuroLLM-22B is a 22B-parameter open model focused on European languages, with long-context support and a detailed training recipe. It aims to give EU labs and companies a strong regional alternative to US-centric frontier models.
Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.
Shared LoRA Subspaces for almost Strict Continual Learning
The paper shows that you can reuse a shared low-rank adapter space across many tasks instead of adding new adapters forever. That keeps performance high while holding down memory as models pick up new skills over time.
Reinforcement World Model Learning for LLM-based Agents
The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.
Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks
Using genetic programming, the author mines custom activation functions from physics data and reuses them in ecology models. These bespoke activations match accuracy with far fewer parameters.
ARC Prize 2025: Technical Report
This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.
Future Optical Flow Prediction Improves Robot Control & Video Generation
FOFPred trains a language-conditioned vision model to forecast dense motion fields in video. That single model then boosts both robot control and text-guided video generation in downstream tasks.
LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning
The authors use token-level uncertainty to decide when an LLM should think longer in games like tic-tac-toe. Low entropy means short context and reasoning, high entropy triggers more examples and multiple reasoning paths.
Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core
This paper experiments with aggressively "forgetting" facts while preserving reasoning ability in a small Qwen model. The model loses targeted knowledge yet starts to lean harder on explicit reasoning steps.
Mugi: Value Level Parallelism For Efficient LLMs
Mugi generalizes value-level parallelism hardware tricks to full LLM workloads. It speeds up core math operations and softmax, yielding over 2x throughput and big energy savings on custom chips.
ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.
CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems
CTHA adds formal communication contracts and authority limits between fast and slow agent layers. That stabilizes multi-level agent stacks and sharply reduces cascades of bad decisions.
Reasoning Models Generate Societies of Thought
This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.
DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference
DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
The authors release LMEE-Bench to test how agents explore and remember in long-horizon 3D tasks. Their MemoryExplorer method trains a vision-language model with reinforcement learning to actively query and use episodic memory.
BYOL: Bring Your Own Language Into LLMs
BYOL lays out a playbook to lift extremely low-resource languages into modern LLMs. It mixes corpus cleaning, synthetic data, extra training, and translation to build strong models for languages with tiny digital footprints.