Agents
Research papers, repositories, and articles about agents
Showing 33 of 33 items
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.
Memory in the Age of AI Agents
A substantial survey that systematizes the fast-growing literature on ‘agent memory’—how agentic LLM systems store, retrieve, and evolve information over time. It proposes a taxonomy across forms (token, parametric, latent), functions (factual, experiential, working) and dynamics, and catalogs existing benchmarks and frameworks. If you’re building agent systems with nontrivial memory, this is quickly becoming the reference map of the territory.
openai/codex
A lightweight coding agent that runs directly in your terminal, wiring OpenAI models into a loop that edits files, runs tests, and applies patches. Compared to IDE plugins, it’s closer to a shell-native ‘pair programmer’ that can operate on entire repos and workflows. Given its rapid adoption and tight integration with existing CLIs, it’s poised to become a reference design for terminal-first code agents.
simstudioai/sim
A full-stack platform for visually building, running, and deploying AI agent workflows. Provides a canvas for wiring together agents, tools, vector stores, and orchestrations, with both cloud-hosted and self-hosted (Docker/Ollama) options and strong Copilot integration. It effectively turns ‘agent graphs’ into a first-class artifact, which is where a lot of production LLM work is heading.
Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.
dify
A very popular production-ready platform for building agentic workflows and applications, with UI, orchestration, and deployment all in one. Given its star growth, it’s becoming a de facto choice for many teams moving beyond simple RAG bots. ([github.com](https://github.com/trending?since=daily))
thedotmack/claude-mem
A Claude Code plugin that logs your coding sessions, compresses them with Claude via the agent SDK, and feeds back relevant context into future sessions. In practice it acts like a persistent, AI-managed memory of your projects, making the assistant far more ‘aware’ of the codebase and past conversations. It’s a concrete, production-friendly take on the “long-term memory for coding agents” idea.
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Introduces NL2Repo-Bench, a benchmark where coding agents must generate or modify entire repositories from natural language specifications, rather than solving single-file LeetCode-style tasks. It evaluates long-horizon planning, tool use, and consistency across files and modules. This is a big step toward evaluating code agents in settings that look like real software projects instead of toy problems.
CopilotKit
React UI components plus backend infrastructure for building in-app AI copilots, chatbots, and agentic workflows. It’s becoming a go-to choice if you want "agentic frontends" without wiring everything from scratch. ([github.com](https://github.com/trending?since=daily))
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
Meta describes Confucius Code Agent (CCA), an open-source AI "software engineer" built on the Confucius SDK with hierarchical working memory, persistent cross-session notes, and robust tool orchestration. On SWE-Bench-Pro it reaches 54.3% Resolve@1, substantially outperforming prior coding agents while emphasizing transparency and extensibility for industrial-scale workflows. ([huggingface.co](https://huggingface.co/papers/2512.10398))
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
HF pitches Confucius Code Agent as an industrial-strength open coding agent with hierarchical working memory, persistent notes, and a meta-agent that continuously refines configurations. If you care about reproducible, extensible coding agents rather than opaque SaaS tools, this is a substantial systems paper. ([huggingface.co](https://huggingface.co/papers/2512.10398))
SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning
Argues that current task-oriented agents are over-optimized as passive followers and under-use conversation as an action. SpeakRL introduces a reinforcement-learning setup that rewards models for asking clarifying questions when the user’s intent is ambiguous, balancing ‘asking’ vs ‘acting’. On synthetic task-oriented dialogue scenarios, the trained agents substantially improve task completion rates without bloating the number of turns, suggesting that proactive clarification is a powerful, underused control knob.
virattt/ai-hedge-fund
An ‘AI hedge fund team’ repository wrapping together data pipelines, modeling code, and infra for algorithmic trading driven by AI. While not plug-and-play as a real fund, it’s a surprisingly complete example of how to glue modern ML, backtesting, and orchestration around financial strategies. It’s trending hard, partly because it’s both ambitious and unusually transparent for this domain.
mindsdb
Markets itself as a "federated query engine for AI" and "the only MCP server you’ll ever need," exposing AI models and tools through a unified interface. Useful if you’re standardizing on MCP and want a batteries-included orchestration backend. ([github.com](https://github.com/trending?since=daily))
chrome-devtools-mcp
An MCP server that exposes Chrome DevTools to coding agents, enabling them to inspect and manipulate web pages programmatically. This is a big enabler for realistic browser-based agents that need deep debugging and automation capabilities. ([github.com](https://github.com/trending?since=daily))
daytona
Daytona is a secure, elastic runtime for executing AI-generated code and agent workflows in isolated sandboxes, with Python and TypeScript SDKs to spin up environments in sub‑100ms and run arbitrary code, processes, or dev tools. It’s quickly becoming a go-to “agent runtime” layer for teams that need safe, persistent, and massively parallel sandboxes (including LangChain’s open-source coding agent), instead of gluing together ad‑hoc Docker or VM setups. ([github.com](https://github.com/daytonaio/daytona?utm_source=openai))
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
Reframes GUI agent interaction history as a program with variables and control flow, using this structure to decide what to retain or discard in context. Combined with a global belief-state mechanism, AgentProg significantly improves long-horizon task success on AndroidWorld and a new benchmark, avoiding the context bloat and semantic loss that plague prior compression schemes. ([arxiv.org](https://arxiv.org/abs/2512.10371?utm_source=openai))
Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
HF frames Fed-SE as a way to let LLM agents "self-evolve" across different clients and environments without sharing raw trajectories. For people deploying agents in regulated or siloed settings, it’s an interesting recipe for federated RL that reduces gradient conflicts across heterogeneous tasks. ([huggingface.co](https://huggingface.co/papers/2512.08870))
Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
Fed-SE is a federated learning framework for LLM agents that must improve across heterogeneous environments under strict privacy constraints. It combines local parameter-efficient fine-tuning on high-return trajectories with global aggregation in a low-rank subspace, reducing negative transfer and boosting average success rates by ~18% over federated baselines. ([huggingface.co](https://huggingface.co/papers/2512.08870))
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data
Introduces MedInsightBench, a benchmark for ‘analytics agents’ that must reason over multimodal medical data—think tables, images, and reports—to extract multi-step clinical insights rather than just answer single questions. The tasks force agents to chain together retrieval, interpretation, and aggregation across data sources, closer to what real analytics workflows look like in hospitals. This is important if you care about LLM agents that move beyond toy QA into realistic decision support.
WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment
Proposes WebOperator, a web agent framework that uses action-aware tree search to plan sequences of browser actions rather than issuing greedy commands. By modeling the future impact of clicks, form fills, and navigations, the agent can backtrack from bad branches and robustly complete multi-step web tasks. It’s part of the growing trend from ‘prompt a browser wrapper’ toward genuinely search-based web agents.
agents.md
Defines AGENTS.md, a simple open format for describing coding agents: their tools, capabilities, and expectations. It’s trying to do for agents what README and OpenAPI did for repos and APIs—standardize how we document them. ([github.com](https://github.com/trending?since=daily))
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))
Error-Driven Prompt Optimization for Arithmetic Reasoning
Targets the surprisingly hard problem of getting small on‑prem LLMs to do reliable arithmetic over tabular data in regulated environments. The authors propose an error-driven loop that clusters the model’s wrong answers, derives new prompt rules to address those failure modes, and iteratively refines a code-generation agent. On a finance-style deployment with a 4B-parameter model, this strategy reportedly boosts arithmetic accuracy to around 70% while keeping all computation inside the secure environment.
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA gets called out as a way to align role-playing agents along many competing dimensions simultaneously using multi-objective RL and thought-augmented rollouts. It’s especially relevant if you’re trying to get smaller models to behave like premium chatbots in complex, persona-heavy domains. ([huggingface.co](https://huggingface.co/papers/2512.09756))
MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations
Proposes a multi‑agent architecture where specialized conversational agents coordinate to decide when and how to ask clarification questions in ambiguous multi‑turn tasks. Instead of a monolithic assistant, MAC assigns roles and coordination rules so that the ‘right’ agent takes the lead on resolving uncertainty. This is a nice complement to SpeakRL: one focuses on *whether* to clarify, the other on *who* clarifies and how to coordinate in complex workflows.
hello-agents
A Chinese-language tutorial project titled "从零开始构建智能体" (Building Agents from Scratch), walking through agent principles and practical implementations. Good onboarding material if you want to upskill teammates on modern agentic patterns. ([github.com](https://github.com/trending?since=daily))
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
InternGeometry is a geometry-solving LLM agent that reaches medalist-level performance on IMO geometry problems by tightly integrating with a symbolic engine. It proposes auxiliary constructions and propositions, verifies them symbolically, reflects on the feedback, and is trained with a complexity-boosting RL curriculum—achieving 44/50 problems solved using a tiny fraction of the data required by AlphaGeometry 2.
Accelerate innovation with AI: Introducing the Product Change Management agent template
Azure introduces a Product Change Management agent template that uses AI to orchestrate changes across equipment, products, and processes in manufacturing. It’s a concrete example of "agent-as-template" thinking, where Microsoft ships prebuilt agent workflows tailored to specific industry problems. ([microsoft.com](https://www.microsoft.com/en-us/ai/blog/?utm_source=openai))
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
The authors augment multimodal LLMs with a "Video Toolkit" and a STAR (Spatiotemporal Reasoning) framework that orchestrates calls to temporal and spatial tools for video question answering. Instead of treating the video as a black-box embedding, the model actively localizes key regions over time using tools, yielding sizable gains on VideoMME and LongVideoBench when wrapped around GPT-4o.
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
ReViSE defines a new Reason-Informed Video Editing task and benchmark, then introduces a unified video model that edits while continuously self-evaluating its own reasoning. A built-in VLM judges whether the edited video logically satisfies the instruction, providing self-reflective feedback that tightens the link between "understanding" and actual visual edits.