Back to AI Lab

Llm

Research papers, repositories, and articles about llm

Showing 50 of 119 items

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.

Zijian Wu, Lingkai Kong

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.

Songyang Gao, Yuzhe Gu

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

T-pro 2.0 is an open-weight Russian large language model focused on hybrid reasoning: it can answer directly or emit explicit reasoning traces, and it’s optimized for low-latency inference via speculative decoding. Alongside the model, the authors release a Russian instruction corpus, a math benchmark, and an EAGLE-based inference stack, making it a practical foundation for Russian-language reasoning applications.

Dmitrii Stoianov, Danil Taranets

huggingface/transformers

The standard library for state-of-the-art models in text, vision, audio, and combined formats. If you build with open models, you almost certainly depend on this already.

156,240

STEP3-VL-10B Technical Report

STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.

Ailin Huang, Chengyuan Yao

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

FlashPrefill discovers sparse attention patterns during the prefill phase and drops low-importance connections on the fly. It reports huge speedups on 256K-token contexts while still matching baseline accuracy.

Qihang Fan, Huaibo Huang

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash pairs a tiny diffusion model with a big LLM to draft and verify text in big chunks. It’s currently one of the highest-upvoted speedup methods on Hugging Face.

Jian Chen, Yesheng Liang

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

ML-Master 2.0 introduces a "hierarchical cognitive cache" that separates short-term logs from long-term strategy for AI agents working for days on ML engineering tasks. It hits state-of-the-art on MLE-Bench, hinting at how to run week-long research agents.

Xinyu Zhu, Yuzhu Cai

Memory in the Age of AI Agents

A substantial survey that systematizes the fast-growing literature on ‘agent memory’—how agentic LLM systems store, retrieve, and evolve information over time. It proposes a taxonomy across forms (token, parametric, latent), functions (factual, experiential, working) and dynamics, and catalogs existing benchmarks and frameworks. If you’re building agent systems with nontrivial memory, this is quickly becoming the reference map of the territory.

Yuyang Hu, Shichun Liu

exo-explore/exo

Exo turns a pile of Macs or PCs into one AI cluster so you can run huge models at home. It auto-discovers devices, shards models across them, and uses high-speed links like Thunderbolt to get near data-center performance. ([github.com](https://github.com/trending))

35,600

openai/codex

A lightweight coding agent that runs directly in your terminal, wiring OpenAI models into a loop that edits files, runs tests, and applies patches. Compared to IDE plugins, it’s closer to a shell-native ‘pair programmer’ that can operate on entire repos and workflows. Given its rapid adoption and tight integration with existing CLIs, it’s poised to become a reference design for terminal-first code agents.

54,000

microsoft/VibeVoice

Open-source frontier voice model stack from Microsoft. Aims at natural, low-latency speech AI that builders can inspect and extend.

36,444

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.

Jian Chen, Yesheng Liang

Reasoning Models Generate Societies of Thought

This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.

Junsol Kim, Shiyang Lai

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

This work changes reinforcement learning for LLMs to reward correct but uncommon solution strategies, not just the first one that works. That raises pass@k without tanking single-answer performance, which matters if you sample multiple candidates.

Zhiyuan Hu, Yucheng Wang

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

ReFusion is a masked diffusion model for text that decodes in parallel over contiguous ‘slots’ instead of individual tokens. By combining diffusion-based planning with autoregressive infilling, it recovers much of the quality of strong autoregressive LLMs while massively speeding up generation and allowing KV-cache reuse. This is one of the more serious attempts to rethink LLM decoding beyond the usual left-to-right paradigm.

Jia-Nan Li, Jian Guan

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Trains a 32B-parameter code model on synthetic “thinking traces” and hardware execution logs. Targets chip design, GPU tuning, and embedded code with explicit reasoning steps.

Jian Yang, Wei Zhang

Reinforcement World Model Learning for LLM-based Agents

The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.

Xiao Yu, Baolin Peng

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

A "reasoner" model and a "discriminator" model train together so the discriminator flags wrong steps in math solutions, not just wrong final answers. This joint training gives dense step-level rewards and boosts math benchmark scores for existing open models like DeepSeek-R1 distills without huge extra compute. ([ar5iv.org](https://ar5iv.org/abs/2512.16917))

Qihao Liu, Luoxin Ye

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.

Asa Cooper Stickland, Jan Michelfeit

LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.

Yilin Xiao, Jin Chen

NousResearch/hermes-agent

General-purpose AI agent framework that grows with user needs. Focuses on composable tools and skills instead of one fixed workflow.

26,246

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill turns memory operations into skills that an agent can learn, select, and even redesign over time. It beats hand-written memory pipelines on long conversations, documents, and embodied tasks like ALFWorld.

Haozhen Zhang, Quanyu Long

Heterogeneous Scientific Foundation Model Collaboration

Introduces Eywa, a framework that lets language models coordinate with domain‑specific scientific models across non-text data. Treats those models as tools inside an agent system and studies planning strategies across them. If you’re building AI for science, this shows how to wire specialized models into one reasoning loop. ([huggingface.co](https://huggingface.co/papers/2604.27351))

Zihao Li, Jiaru Zou

Verbalizing LLMs' assumptions to explain and control sycophancy

Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.

Myra Cheng, Isabel Sieh

badlogic/pi-mono

Agent toolkit with a coding-agent CLI, unified LLM API, UI libraries, and Slack bot. Focuses on wiring agents into real dev environments.

31,908

onyx-dot-app/onyx

Full-stack open source AI chat platform that plugs into many models. Ships with advanced chat features, memory, and multi-user workspaces.

25,005

GoogleCloudPlatform/generative-ai

Large collection of Gemini on Vertex AI notebooks and sample apps. Great starting point if you want to build production-style systems on Google Cloud fast.

14,457

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.

Yunlong Chu, Minglai Shao

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.

Xinyue Ma, Heelim Hong

openai/openai-cookbook

The OpenAI cookbook is a large set of worked examples for building with OpenAI’s API. Treat it as a pattern library for chat apps, agents, RAG systems, and fine-grained evaluations.

70,628

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

HGMem turns the “scratchpad” of a multi-step retrieval system into a hypergraph that connects many related facts at once. This richer memory structure helps language models keep global context straight over long tasks, boosting performance on challenging reasoning and long-document benchmarks.

Chulun Zhou, Chunkang Zhang

letta-ai/letta

Letta is a framework for long-lived agents with memory and tools. Use it to build assistants that actually remember projects over weeks, not prompts.

19,930

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

AuditDM trains an "auditor" model that hunts for cases where strong vision-language models disagree. Teams can reuse these hard examples to patch weaknesses without manual labeling.

Qihao Liu, Chengzhi Mao

Adaptation of Agentic AI

This large-scale study tracks how agent-like AI systems adapt over time and across tasks. If you're betting on agents, it gives structure and warnings for long-term deployment.

Pengcheng Jiang, Jiacheng Lin

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Builds on Direct Preference Optimization but tackles its weak learning signal when both preferred and rejected responses share similar flaws. RPO adds a hint-guided reflection step that encourages the model to produce more contrastive, informative preference pairs before optimizing them. The result is a more stable and data-efficient on-policy alignment pipeline that still avoids full RLHF/RLAIF complexity.

Zihui Zhao, Zechang Li

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Describes the QwenLong-L1.5 post-training recipe for extending LLM context windows while keeping reasoning quality intact. The work focuses not just on positional encodings but also on memory management strategies and training curricula that keep long-context performance from collapsing. This is highly relevant for anyone trying to turn a baseline LLM into a stable long-context model without re‑training from scratch.

Weizhou Shen, Ziyi Yang

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Introduces NL2Repo-Bench, a benchmark where coding agents must generate or modify entire repositories from natural language specifications, rather than solving single-file LeetCode-style tasks. It evaluates long-horizon planning, tool use, and consistency across files and modules. This is a big step toward evaluating code agents in settings that look like real software projects instead of toy problems.

Jingzhe Ding, Shengda Long

thedotmack/claude-mem

A Claude Code plugin that logs your coding sessions, compresses them with Claude via the agent SDK, and feeds back relevant context into future sessions. In practice it acts like a persistent, AI-managed memory of your projects, making the assistant far more ‘aware’ of the codebase and past conversations. It’s a concrete, production-friendly take on the “long-term memory for coding agents” idea.

7,300

CopilotKit

React UI components plus backend infrastructure for building in-app AI copilots, chatbots, and agentic workflows. It’s becoming a go-to choice if you want "agentic frontends" without wiring everything from scratch. ([github.com](https://github.com/trending?since=daily))

26,435

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

FACTS is a multi-part leaderboard that evaluates LLM factuality across image-based QA, closed-book QA, search-augmented QA, and document-grounded long-form responses, using automated judge models. It’s designed as a long-lived suite with public and private splits, giving a single factuality score while still exposing failure modes across modalities and tool-use settings. ([huggingface.co](https://huggingface.co/papers/2512.10791))

Aileen Cheng, Alon Jacovi

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Meta describes Confucius Code Agent (CCA), an open-source AI "software engineer" built on the Confucius SDK with hierarchical working memory, persistent cross-session notes, and robust tool orchestration. On SWE-Bench-Pro it reaches 54.3% Resolve@1, substantially outperforming prior coding agents while emphasizing transparency and extensibility for industrial-scale workflows. ([huggingface.co](https://huggingface.co/papers/2512.10398))

Zhaodong Wang, Zhenting Qi

google/langextract

Langextract turns messy text into structured records using LLMs with grounded citations. It targets production use cases where you need both high recall and traceable sources.

22,050

Make Your LVLM KV Cache More Lightweight

Targets the memory blow-up from vision tokens in large vision–language models when you run the AI. Uses a prompt-aware method, LightKV, to merge redundant vision tokens before decoding. If you ship LVLMs, this is a concrete way to cut GPU memory and costs without killing quality. ([arxiv.org](https://arxiv.org/list/cs.CV/pastweek?show=100))

Anonymous (ICLR and TMLR drafts; arXiv metadata lists named authors)

NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.

Haonan Dong, Kehan Jiang

EuroLLM-22B: Technical Report

EuroLLM-22B is a 22B-parameter open model focused on European languages, with long-context support and a detailed training recipe. It aims to give EU labs and companies a strong regional alternative to US-centric frontier models.

Miguel Moura Ramos, Duarte M. Alves

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.

Parisa Rabbani, Priyam Sahoo

resemble-ai/chatterbox

Chatterbox is a state-of-the-art open source text-to-speech stack. If you need production-quality voices without a SaaS bill, start here.

16,307

TauricResearch/TradingAgents

Multi-agent LLM framework for algorithmic trading. Provides reusable components for data pipelines, strategy simulation, and coordinated agents across markets. If you experiment with AI trading, use this instead of gluing together notebooks. ([github.com](https://github.com/trending?since=daily))

64,981

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.

Tao Jin, Phuong Minh Nguyen