Back to AI Lab

ArXiv AI Papers

Latest artificial intelligence and machine learning research papers from ArXiv.

Showing 50 of 104 items

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.

Wanlong Liu, Bo Zhang

LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.

Yilin Xiao, Jin Chen

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Builds a plug-and-play framework for 3D medical imaging that unifies detection and risk prediction. Targets hospital workflows, not just leaderboard benchmarks.

Daniel C. MacRae, Luuk van der Hoek

Verbalizing LLMs' assumptions to explain and control sycophancy

Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.

Myra Cheng, Isabel Sieh

NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.

Haonan Dong, Kehan Jiang

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Uses a language model’s own feedback as a training signal for retrieval rerankers in RAG pipelines. Aims to pick more useful documents for question answering.

Yuhang Wu, Xiangqing Shen

Reliable Control-Point Selection for Steering Reasoning in Large Language Models

Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.

Haomin Zhuang, Hojun Yoo

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Studies how mixture-of-experts language models actually route work between experts. Offers tools to inspect which expert fires and why, instead of treating MoE as a black box.

Jeremy Herbst, Jae Hee Lee

Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

Introduces Neuro-RIT, which looks at individual neurons while customizing language models for retrieval-heavy tasks. The aim is steadier answers when retrieved documents shift or are noisy.

Jaemin Kim, Jae O Lee

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

Reuses standard one-direction language models to build bidirectional encoders that can handle text and other signals. Bridges chat models and BERT-style encoders.

Nicolas Boizard, Théo Deschamps-Berger

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.

Tao Jin, Phuong Minh Nguyen

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Presents a framework that treats an agent swarm as a graph you can design, visualize, and debug. Makes multi-agent systems feel more like building workflows than wiring hacks.

Yang Liu, Jinxuan Cai

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.

Changcheng Li, Jiancan Wu

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.

Yunlong Chu, Minglai Shao

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.

Yitong Chen, Zuxuan Wu

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.

Lijiang Li, Zuwei Long

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.

Boqiang Zhang, Lei Ke

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Improves one-shot pruning so teams can shrink models aggressively with less quality loss. Directly targets cheaper deployment on GPUs and even consumer hardware.

Mingluo Su, Huan Wang

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.

Hila Chefer, Patrick Esser

Diffusion Language Models Are Natively Length-Aware

Argues that diffusion-style language models naturally handle short and long prompts without special tricks. Points to a promising path for huge-context text models.

Vittorio Rossi, Giacomo Cirò

RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.

Yuhang Liu, Ruijie Wang

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

Introduces a new optimization rule for training chat agents over long conversations. The goal: steadier learning and more helpful dialogue without exploding token and compute costs.

Naifan Zhang, Ruihan Sun

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Probes frozen vision backbones for tasks like measuring lengths and angles. Tests how much real-world geometry is already baked into general-purpose models.

Yakov Pyotr Shkolnikov

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

Defines a benchmark that scores how models actually write answers when given retrieved documents. Helps teams compare RAG setups on answer quality, not just retrieval hit rates.

Koki Itai, Shunichi Hasegawa

Multimodal Large Language Models as Image Classifiers

Tests big models that handle text and images when you use them as plain image classifiers. Shows when they beat or trail classic vision networks.

Nikita Kisel, Illia Volkov

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.

Ruchika Chavhan, Malcolm Chadwick

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.

Lizhuo Luo, Shenggui Li

KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.

Jian Chen, Zhuoran Wang

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.

Shuo Nie, Hexuan Deng

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.

Jian Chen, Yesheng Liang

Optimization is Not Enough: Why Problem Formulation Deserves Equal Attention

The authors argue that many "AI optimization" wins really come from how humans pose the problem, not from the math alone. They show cases where small tweaks in formulation beat heavy algorithmic tuning, especially in engineering-style tasks.

Iván Olarte Rodríguez, Gokhan Serhat

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

RRAttention keeps only a fraction of attention blocks by rotating which positions each head looks at in a round-robin pattern. It recovers almost full-attention accuracy while skipping about half the computation at 128K-token context lengths.

Siran Liu, Guoxia Wang

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.

Fangzhi Xu, Hang Yan

EuroLLM-22B: Technical Report

EuroLLM-22B is a 22B-parameter open model focused on European languages, with long-context support and a detailed training recipe. It aims to give EU labs and companies a strong regional alternative to US-centric frontier models.

Miguel Moura Ramos, Duarte M. Alves

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.

Peter Holderrieth, Douglas Chen

Shared LoRA Subspaces for almost Strict Continual Learning

The paper shows that you can reuse a shared low-rank adapter space across many tasks instead of adding new adapters forever. That keeps performance high while holding down memory as models pick up new skills over time.

Prakhar Kaushik, Ankit Vaidya

Reinforcement World Model Learning for LLM-based Agents

The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.

Xiao Yu, Baolin Peng

Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks

Using genetic programming, the author mines custom activation functions from physics data and reuses them in ecology models. These bespoke activations match accuracy with far fewer parameters.

Anas Hajbi

ARC Prize 2025: Technical Report

This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.

François Chollet, Mike Knoop

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.

Yosub Shin, Michael Buriek

Future Optical Flow Prediction Improves Robot Control & Video Generation

FOFPred trains a language-conditioned vision model to forecast dense motion fields in video. That single model then boosts both robot control and text-guided video generation in downstream tasks.

Kanchana Ranasinghe, Honglu Zhou

LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

The authors use token-level uncertainty to decide when an LLM should think longer in games like tic-tac-toe. Low entropy means short context and reasoning, high entropy triggers more examples and multiple reasoning paths.

Tommaso Felice Banfi, Sashenka Gamage

Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core

This paper experiments with aggressively "forgetting" facts while preserving reasoning ability in a small Qwen model. The model loses targeted knowledge yet starts to lean harder on explicit reasoning steps.

Mengmeng Peng, Zhenyu Fang

Mugi: Value Level Parallelism For Efficient LLMs

Mugi generalizes value-level parallelism hardware tricks to full LLM workloads. It speeds up core math operations and softmax, yielding over 2x throughput and big energy savings on custom chips.

Daniel Price, Prabhu Vellaisamy

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.

Xinyue Ma, Heelim Hong

CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems

CTHA adds formal communication contracts and authority limits between fast and slow agent layers. That stabilizes multi-level agent stacks and sharply reduces cascades of bad decisions.

Percy Jardine

Reasoning Models Generate Societies of Thought

This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.

Junsol Kim, Shiyang Lai

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.

Parisa Rabbani, Priyam Sahoo

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

The authors release LMEE-Bench to test how agents explore and remember in long-horizon 3D tasks. Their MemoryExplorer method trains a vision-language model with reinforcement learning to actively query and use episodic memory.

Sen Wang, Bangwei Liu

BYOL: Bring Your Own Language Into LLMs

BYOL lays out a playbook to lift extremely low-resource languages into modern LLMs. It mixes corpus cleaning, synthetic data, extra training, and translation to build strong models for languages with tiny digital footprints.

Syed Waqas Zamir, Wassim Hamidouche