Back to AI Lab

ArXiv AI Papers

Latest artificial intelligence and machine learning research papers from ArXiv.

Showing 50 of 79 items

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.

Ruchika Chavhan, Malcolm Chadwick

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.

Fangzhi Xu, Hang Yan

Optimization is Not Enough: Why Problem Formulation Deserves Equal Attention

The authors argue that many "AI optimization" wins really come from how humans pose the problem, not from the math alone. They show cases where small tweaks in formulation beat heavy algorithmic tuning, especially in engineering-style tasks.

Iván Olarte Rodríguez, Gokhan Serhat

Shared LoRA Subspaces for almost Strict Continual Learning

The paper shows that you can reuse a shared low-rank adapter space across many tasks instead of adding new adapters forever. That keeps performance high while holding down memory as models pick up new skills over time.

Prakhar Kaushik, Ankit Vaidya

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.

Jian Chen, Yesheng Liang

KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

KV-CoRE systematically measures how well key-value caches in different layers and tasks can be compressed with low-rank methods. It helps engineers know where cache compression will save memory without wrecking accuracy.

Jian Chen, Zhuoran Wang

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.

Shuo Nie, Hexuan Deng

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.

Peter Holderrieth, Douglas Chen

EuroLLM-22B: Technical Report

EuroLLM-22B is a 22B-parameter open model focused on European languages, with long-context support and a detailed training recipe. It aims to give EU labs and companies a strong regional alternative to US-centric frontier models.

Miguel Moura Ramos, Duarte M. Alves

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.

Lizhuo Luo, Shenggui Li

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

RRAttention keeps only a fraction of attention blocks by rotating which positions each head looks at in a round-robin pattern. It recovers almost full-attention accuracy while skipping about half the computation at 128K-token context lengths.

Siran Liu, Guoxia Wang

Reinforcement World Model Learning for LLM-based Agents

The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.

Xiao Yu, Baolin Peng

Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks

Using genetic programming, the author mines custom activation functions from physics data and reuses them in ecology models. These bespoke activations match accuracy with far fewer parameters.

Anas Hajbi

ARC Prize 2025: Technical Report

This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.

François Chollet, Mike Knoop

Future Optical Flow Prediction Improves Robot Control & Video Generation

FOFPred trains a language-conditioned vision model to forecast dense motion fields in video. That single model then boosts both robot control and text-guided video generation in downstream tasks.

Kanchana Ranasinghe, Honglu Zhou

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.

Parisa Rabbani, Priyam Sahoo

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.

Yosub Shin, Michael Buriek

Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core

This paper experiments with aggressively "forgetting" facts while preserving reasoning ability in a small Qwen model. The model loses targeted knowledge yet starts to lean harder on explicit reasoning steps.

Mengmeng Peng, Zhenyu Fang

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.

Xinyue Ma, Heelim Hong

Mugi: Value Level Parallelism For Efficient LLMs

Mugi generalizes value-level parallelism hardware tricks to full LLM workloads. It speeds up core math operations and softmax, yielding over 2x throughput and big energy savings on custom chips.

Daniel Price, Prabhu Vellaisamy

CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems

CTHA adds formal communication contracts and authority limits between fast and slow agent layers. That stabilizes multi-level agent stacks and sharply reduces cascades of bad decisions.

Percy Jardine

Reasoning Models Generate Societies of Thought

This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.

Junsol Kim, Shiyang Lai

LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

The authors use token-level uncertainty to decide when an LLM should think longer in games like tic-tac-toe. Low entropy means short context and reasoning, high entropy triggers more examples and multiple reasoning paths.

Tommaso Felice Banfi, Sashenka Gamage

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

The authors release LMEE-Bench to test how agents explore and remember in long-horizon 3D tasks. Their MemoryExplorer method trains a vision-language model with reinforcement learning to actively query and use episodic memory.

Sen Wang, Bangwei Liu

BYOL: Bring Your Own Language Into LLMs

BYOL lays out a playbook to lift extremely low-resource languages into modern LLMs. It mixes corpus cleaning, synthetic data, extra training, and translation to build strong models for languages with tiny digital footprints.

Syed Waqas Zamir, Wassim Hamidouche

Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models

The authors probe how models like Llama 3.1, Qwen 2.5, and Mistral internally represent human trust signals in text. They show specific attention heads reliably track fairness, certainty, and accountability cues, which you can exploit to design more trustworthy systems.

Gerard Yeo, Svetlana Churina

Bi-Orthogonal Factor Decomposition for Vision Transformers

The authors dissect attention in vision transformers into content and position factors using ANOVA and SVD. They show heads specialize into different interaction types and explain why self-supervised models like DINOv2 use attention differently from supervised ones.

Fenil R. Doshi, Thomas Fel

STResNet & STYOLO: A New Family of Compact Classification and Object Detection Models for MCUs

STResNet and STYOLO target microcontrollers, hitting competitive ImageNet and COCO scores with just a few million parameters. If you care about on-device vision, these architectures offer stronger accuracy–latency tradeoffs than classic MobileNet-style baselines.

Sudhakar Sah, Ravish Kumar

Naiad: Novel Agentic Intelligent Autonomous System for Inland Water Monitoring

Naiad chains an AI agent with weather data, satellite imagery, and domain tools to monitor lakes and rivers end to end. It lets non-experts ask plain-language questions and get tailored environmental reports, showing how agent stacks can tackle real infrastructure problems.

Eirini Baltzi, Tilemachos Moumouris

The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models

This paper shows that giving models medical “doctor” personas helps on some emergency tasks but can hurt performance in primary care settings. Teams using personas for safety or expertise should test them task by task instead of assuming they always help.

Tassallah Abdullahi, Shrestha Ghosh

Over-Searching in Search-Augmented Large Language Models

This work shows that search‑augmented models often call tools even when search hurts answers or wastes tokens. It introduces a cost‑aware metric and mitigation tricks, so teams can dial back needless retrieval instead of just adding more context.

Roy Xie, Deepak Gopinath

ART: Adaptive Reasoning Trees for Explainable Claim Verification

ART makes models verify claims by building explicit argument trees instead of spitting out one opaque chain of thought. That structure lets a judge model compare supporting and attacking evidence, making fact-checking more transparent and easier to audit.

Sahil Wadhwa, Himanshu Kumar

WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

WildSci builds a large question set from real scientific papers across many fields, then uses reinforcement learning to sharpen models’ scientific reasoning. It moves science QA beyond toy benchmarks and gives labs a more realistic way to stress-test research assistants.

Tengxiao Liu, Deepak Nathani

Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models

Crisis-Bench drops models into simulated multi-day corporate crises and scores them on stock-price outcomes and public sentiment. It exposes when models act like blunt truth-tellers versus savvy spokespeople, giving companies a way to test PR-style agent behavior before deployment.

Cooper Lin, Maohao Ran

MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs

MoEBlaze redesigns mixture‑of‑experts training to cut activation memory and data movement on GPUs. It claims over 4× speedups and 50% memory savings versus existing frameworks, which directly matters for anyone pushing bigger sparse models.

Jiyuan Zhang, Yining Liu

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

This paper teaches vision‑language agents to actively consult maps while guessing where a photo was taken. With reinforcement learning and parallel search paths, their system beats even strong commercial baselines on real‑world geolocation benchmarks.

Yuxiang Ji, Yong Wang

TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning

TIME teaches dialogue models to drop short "thinking" blocks only when time gaps or context shifts actually demand deeper reasoning. Models keep answers compact while still reasoning hard when conversations get tricky or span days instead of seconds.

Susmit Das

Effects of personality steering on cooperative behavior in Large Language Model agents

The authors test how adding human-like personality traits changes how AI agents cooperate in repeated Prisoner’s Dilemma games. They find agreeableness boosts cooperation but can also make agents easier to exploit, warning that persona dials act as soft biases, not hard controls.

Mizuki Sakai, Mizuki Yokoyama

Conformity and Social Impact on AI Agents

Researchers adapt classic social-psychology experiments to AI agents and find they also conform to group pressure. Even strong models can be pushed into wrong answers by coordinated peers, which raises real worries for multi-agent deployments and information ecosystems.

Alessandro Bellina, Giordano De Marzo

MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

MaxCode treats code optimization as a reinforcement learning search over code edits guided by runtime feedback. It uses natural-language critiques and a reward model to steer generation, beating past systems at speeding up CUDA and C++ kernels.

Jiefu Ou, Sapana Chaudhary

Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection

Hi‑ZFO mixes gradient-based updates on important layers with gradient-free noise on the rest to escape bad minima. It aims to get better customized models with less compute and more stable training than pure gradient methods.

Feihu Jin, Ying Tan

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing combines a diffusion-based generator with a preference-optimization trick to drive talking-head avatars in real time. It reacts to a user’s speech and body motion with low latency, producing more expressive, conversational faces without needing labeled interaction data.

Taekyung Ki, Sangwon Jang

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

HGMem turns the “scratchpad” of a multi-step retrieval system into a hypergraph that connects many related facts at once. This richer memory structure helps language models keep global context straight over long tasks, boosting performance on challenging reasoning and long-document benchmarks.

Chulun Zhou, Chunkang Zhang

Deep Delta Learning

The authors replace standard residual skip connections with a learnable "Delta" operator that can flexibly distort the identity path. This lets deep nets control how much old information to erase versus new information to write, improving how they model complex dynamics while keeping training stable.

Yifan Zhang, Yifeng Liu

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

A "reasoner" model and a "discriminator" model train together so the discriminator flags wrong steps in math solutions, not just wrong final answers. This joint training gives dense step-level rewards and boosts math benchmark scores for existing open models like DeepSeek-R1 distills without huge extra compute. ([ar5iv.org](https://ar5iv.org/abs/2512.16917))

Qihao Liu, Luoxin Ye

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

The authors find sparse "circuits" inside language models that drive math reasoning and selectively strengthen only those pieces. They report up to 11.4% accuracy gains while touching about 1.6% of model components, keeping other skills like MMLU almost unchanged. ([ar5iv.org](https://ar5iv.org/abs/2512.16914))

Nikhil Prakash, Donghao Ren

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

This paper dissects why "learning from verifiable rewards" can improve math reasoning even when rewards look noisy or misleading. It shows how clipping and reward noise reduce the model’s randomness in useful ways and offers principles for designing better reasoning-focused training runs. ([ar5iv.org](https://ar5iv.org/abs/2512.16912))

Peter Chen, Xiaopeng Li

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

The authors study adding watermarks after the fact by having a model rewrite existing text while embedding tracking signals. They map how beam search, sampling tricks, and model size trade off between detection strength and text quality, and show watermarks work better on prose than on verifiable code. ([ar5iv.org](https://ar5iv.org/abs/2512.16904))

Pierre Fernandez, Tom Sander

SFTok: Bridging the Performance Gap in Discrete Tokenizers

SFTok narrows the quality gap between discrete and continuous image tokenizers using multi-step reconstruction tricks. That matters if you want autoregressive image or video generators that scale.

Qihang Rao, Borui Zhang

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

SceneDiff gives a new benchmark and a strong baseline for detecting object changes across views and time. Useful for robots that must notice what actually moved, not just viewpoint shifts.

Yuqun Wu, Chih-hao Lin