Back to AI Lab

Llm

Research papers, repositories, and articles about llm

Showing 31 of 31 items

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.

Zijian Wu, Lingkai Kong

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.

Songyang Gao, Yuzhe Gu

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

T-pro 2.0 is an open-weight Russian large language model focused on hybrid reasoning: it can answer directly or emit explicit reasoning traces, and it’s optimized for low-latency inference via speculative decoding. Alongside the model, the authors release a Russian instruction corpus, a math benchmark, and an EAGLE-based inference stack, making it a practical foundation for Russian-language reasoning applications.

Dmitrii Stoianov, Danil Taranets

Memory in the Age of AI Agents

A substantial survey that systematizes the fast-growing literature on ‘agent memory’—how agentic LLM systems store, retrieve, and evolve information over time. It proposes a taxonomy across forms (token, parametric, latent), functions (factual, experiential, working) and dynamics, and catalogs existing benchmarks and frameworks. If you’re building agent systems with nontrivial memory, this is quickly becoming the reference map of the territory.

Yuyang Hu, Shichun Liu

openai/codex

A lightweight coding agent that runs directly in your terminal, wiring OpenAI models into a loop that edits files, runs tests, and applies patches. Compared to IDE plugins, it’s closer to a shell-native ‘pair programmer’ that can operate on entire repos and workflows. Given its rapid adoption and tight integration with existing CLIs, it’s poised to become a reference design for terminal-first code agents.

54,000

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

ReFusion is a masked diffusion model for text that decodes in parallel over contiguous ‘slots’ instead of individual tokens. By combining diffusion-based planning with autoregressive infilling, it recovers much of the quality of strong autoregressive LLMs while massively speeding up generation and allowing KV-cache reuse. This is one of the more serious attempts to rethink LLM decoding beyond the usual left-to-right paradigm.

Jia-Nan Li, Jian Guan

simstudioai/sim

A full-stack platform for visually building, running, and deploying AI agent workflows. Provides a canvas for wiring together agents, tools, vector stores, and orchestrations, with both cloud-hosted and self-hosted (Docker/Ollama) options and strong Copilot integration. It effectively turns ‘agent graphs’ into a first-class artifact, which is where a lot of production LLM work is heading.

22,700

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.

Asa Cooper Stickland, Jan Michelfeit

thedotmack/claude-mem

A Claude Code plugin that logs your coding sessions, compresses them with Claude via the agent SDK, and feeds back relevant context into future sessions. In practice it acts like a persistent, AI-managed memory of your projects, making the assistant far more ‘aware’ of the codebase and past conversations. It’s a concrete, production-friendly take on the “long-term memory for coding agents” idea.

7,300

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Introduces NL2Repo-Bench, a benchmark where coding agents must generate or modify entire repositories from natural language specifications, rather than solving single-file LeetCode-style tasks. It evaluates long-horizon planning, tool use, and consistency across files and modules. This is a big step toward evaluating code agents in settings that look like real software projects instead of toy problems.

Jingzhe Ding, Shengda Long

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Describes the QwenLong-L1.5 post-training recipe for extending LLM context windows while keeping reasoning quality intact. The work focuses not just on positional encodings but also on memory management strategies and training curricula that keep long-context performance from collapsing. This is highly relevant for anyone trying to turn a baseline LLM into a stable long-context model without re‑training from scratch.

Weizhou Shen, Ziyi Yang

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Builds on Direct Preference Optimization but tackles its weak learning signal when both preferred and rejected responses share similar flaws. RPO adds a hint-guided reflection step that encourages the model to produce more contrastive, informative preference pairs before optimizing them. The result is a more stable and data-efficient on-policy alignment pipeline that still avoids full RLHF/RLAIF complexity.

Zihui Zhao, Zechang Li

CopilotKit

React UI components plus backend infrastructure for building in-app AI copilots, chatbots, and agentic workflows. It’s becoming a go-to choice if you want "agentic frontends" without wiring everything from scratch. ([github.com](https://github.com/trending?since=daily))

26,435

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

FACTS is a multi-part leaderboard that evaluates LLM factuality across image-based QA, closed-book QA, search-augmented QA, and document-grounded long-form responses, using automated judge models. It’s designed as a long-lived suite with public and private splits, giving a single factuality score while still exposing failure modes across modalities and tool-use settings. ([huggingface.co](https://huggingface.co/papers/2512.10791))

Aileen Cheng, Alon Jacovi

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Meta describes Confucius Code Agent (CCA), an open-source AI "software engineer" built on the Confucius SDK with hierarchical working memory, persistent cross-session notes, and robust tool orchestration. On SWE-Bench-Pro it reaches 54.3% Resolve@1, substantially outperforming prior coding agents while emphasizing transparency and extensibility for industrial-scale workflows. ([huggingface.co](https://huggingface.co/papers/2512.10398))

Zhaodong Wang, Zhenting Qi

ZJU-LLMs/Foundations-of-LLMs

An open book and course materials on the foundations of large language models, covering theory, architectures, training, and deployment. With >14k stars, it’s quickly becoming a go‑to learning resource for people trying to move from ‘user’ to ‘builder’ of LLMs. If you want a structured, code-linked path into the guts of modern LMs, this is a strong candidate.

14,400

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Claims an exact, error-free formulation of linear attention derived from a continuous-time view of transformer dynamics. The authors argue they can match the behavior of standard softmax attention while enjoying linear-time complexity, avoiding the approximation errors that plague many fast-attention variants. If the theory and practice hold up, this could become a key building block for large-context models and resource-constrained deployments.

Jingdi Lei, Di Zhang

nanoGPT

Karpathy’s minimalist GPT training repo continues to trend, reflecting ongoing interest in from-scratch pretraining and fine-tuning for medium-sized LLMs. Still one of the best learning references if you want to understand the guts of GPT-style models. ([github.com](https://github.com/trending?since=daily))

51,051

Reverse Thinking Enhances Missing Information Detection in Large Language Models

Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))

Yuxin Liu, Chaojie Gu

SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning

Argues that current task-oriented agents are over-optimized as passive followers and under-use conversation as an action. SpeakRL introduces a reinforcement-learning setup that rewards models for asking clarifying questions when the user’s intent is ambiguous, balancing ‘asking’ vs ‘acting’. On synthetic task-oriented dialogue scenarios, the trained agents substantially improve task completion rates without bloating the number of turns, suggesting that proactive clarification is a powerful, underused control knob.

Emre Can Acikgoz, Jinoh Oh

mindsdb

Markets itself as a "federated query engine for AI" and "the only MCP server you’ll ever need," exposing AI models and tools through a unified interface. Useful if you’re standardizing on MCP and want a batteries-included orchestration backend. ([github.com](https://github.com/trending?since=daily))

37,856

WeKnora

Tencent’s LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answering with a RAG paradigm. Essentially a production-grade answer engine stack rather than a toy demo. ([github.com](https://github.com/trending?since=daily))

8,623

WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment

Proposes WebOperator, a web agent framework that uses action-aware tree search to plan sequences of browser actions rather than issuing greedy commands. By modeling the future impact of clicks, form fills, and navigations, the agent can backtrack from bad branches and robustly complete multi-step web tasks. It’s part of the growing trend from ‘prompt a browser wrapper’ toward genuinely search-based web agents.

Mahir Labib Dihan, Tanzima Hashem

Error-Driven Prompt Optimization for Arithmetic Reasoning

Targets the surprisingly hard problem of getting small on‑prem LLMs to do reliable arithmetic over tabular data in regulated environments. The authors propose an error-driven loop that clusters the model’s wrong answers, derives new prompt rules to address those failure modes, and iteratively refines a code-generation agent. On a finance-style deployment with a 4B-parameter model, this strategy reportedly boosts arithmetic accuracy to around 70% while keeping all computation inside the secure environment.

Árpád Pándy, Róbert Lakatos

next-ai-draw-io

A Next.js web app that layers natural-language-driven AI editing on top of draw.io diagrams, letting you create and modify diagrams through prompts. Great if your team lives in diagrams and you want AI to help refactor system designs. ([github.com](https://github.com/trending?since=daily))

9,627

MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations

Proposes a multi‑agent architecture where specialized conversational agents coordinate to decide when and how to ask clarification questions in ambiguous multi‑turn tasks. Instead of a monolithic assistant, MAC assigns roles and coordination rules so that the ‘right’ agent takes the lead on resolving uncertainty. This is a nice complement to SpeakRL: one focuses on *whether* to clarify, the other on *who* clarifies and how to coordinate in complex workflows.

Emre Can Acikgoz, Jinoh Oh

pandas-ai

pandas-ai turns DataFrames and SQL/CSV/Parquet sources into a conversational interface, translating natural-language questions into code or SQL, running them in a (configurable) sandbox, and optionally using RAG and semantic schemas to answer more complex queries. It’s attractive for quickly giving analysts or business users an LLM front-end on top of existing data, though you do need to pay attention to security configurations given its history of prompt-injection/RCE issues that were later mitigated with new settings. ([github.com](https://github.com/sinaptik-ai/pandas-ai?utm_source=openai))

22,805

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

InternGeometry is a geometry-solving LLM agent that reaches medalist-level performance on IMO geometry problems by tightly integrating with a symbolic engine. It proposes auxiliary constructions and propositions, verifies them symbolically, reflects on the feedback, and is trained with a complexity-boosting RL curriculum—achieving 44/50 problems solved using a tiny fraction of the data required by AlphaGeometry 2.

Haiteng Zhao, Junhao Shen

BEAVER: An Efficient Deterministic LLM Verifier

BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.

Tarun Suresh, Nalin Wadhwa

An Introduction to Large Language Models: Prompt ...

This introductory post explains what LLMs are and why they’re powerful, then walks through practical prompt‑engineering patterns (zero‑shot, few‑shot, chain‑of‑thought) and P‑tuning as a lightweight way to specialize models for particular tasks. Developers new to LLMs get concrete examples of how to structure prompts and when to switch from prompting to parameter‑efficient tuning, along with intuition about the trade‑offs in scale and data. ([developer.nvidia.com](https://developer.nvidia.com/blog/an-introduction-to-large-language-models-prompt-engineering-and-p-tuning/))

NVIDIA AI

tinker-cookbook

tinker-cookbook provides practical, end‑to‑end examples of post‑training LLMs using Tinker, a managed fine‑tuning API from Thinking Machines Lab that handles distributed training while you control the algorithms and data. The repo includes recipes for instruction tuning, math reasoning, RLHF-style preference learning, tool use, prompt distillation, and multi-agent setups, making it a strong starting point if you want to fine‑tune open-weight models like Llama or Qwen without building your own training stack. ([github.com](https://github.com/thinking-machines-lab/tinker-cookbook?utm_source=openai))

2,434