Back to AI Lab

Agents

Research papers, repositories, and articles about agents

Showing 50 of 112 items

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.

Zijian Wu, Lingkai Kong

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.

Songyang Gao, Yuzhe Gu

anthropics/claude-code

Claude Code runs as a terminal-native coding agent that understands your repo and executes commands. It blurs the line between shell, IDE, and assistant, and it’s quickly becoming a default tool for power users.

55,236

anomalyco/opencode

OpenCode is an open-source coding agent that edits and writes code for you, wired into modern tooling. Use it as a local, hackable alternative to proprietary AI dev environments.

61,642

obra/superpowers

Superpowers is a skills library and workflow for coding agents like Claude Code and OpenCode. It bakes in design, planning, testing, and review loops so agents behave like disciplined junior engineers.

28,500

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Builds thousands of synthetic "computers" with realistic files and calendars to simulate month-long knowledge work for AI agents. Each run spans 8+ hours and ~2,000 steps, yielding dense signals for training long-horizon productivity agents. If you are designing office copilots or agent training curricula, copy this setup to cheaply generate rich experience data. ([arxiv.org](https://arxiv.org/abs/2604.28181))

Tao Ge, Baolin Peng

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

ML-Master 2.0 introduces a "hierarchical cognitive cache" that separates short-term logs from long-term strategy for AI agents working for days on ML engineering tasks. It hits state-of-the-art on MLE-Bench, hinting at how to run week-long research agents.

Xinyu Zhu, Yuzhu Cai

Memory in the Age of AI Agents

A substantial survey that systematizes the fast-growing literature on ‘agent memory’—how agentic LLM systems store, retrieve, and evolve information over time. It proposes a taxonomy across forms (token, parametric, latent), functions (factual, experiential, working) and dynamics, and catalogs existing benchmarks and frameworks. If you’re building agent systems with nontrivial memory, this is quickly becoming the reference map of the territory.

Yuyang Hu, Shichun Liu

SakanaAI/AI-Scientist-v2

Implements AI Scientist v2, which runs agentic tree search over experiments. Pushes toward semi-automated scientific discovery instead of just paper drafting.

4,938

Partnering with Mozilla to improve Firefox’s security

Anthropic used Claude Opus 4.6 to scan Firefox’s code and surfaced 22 new vulnerabilities, 14 rated high severity. The post lays out a playbook for pairing AI bug hunters with human maintainers safely.

Anthropic Newsroom

openai/codex

A lightweight coding agent that runs directly in your terminal, wiring OpenAI models into a loop that edits files, runs tests, and applies patches. Compared to IDE plugins, it’s closer to a shell-native ‘pair programmer’ that can operate on entire repos and workflows. Given its rapid adoption and tight integration with existing CLIs, it’s poised to become a reference design for terminal-first code agents.

54,000

block/goose

Open-source AI agent that installs, edits, executes, and tests code with any language model. Targets real workflows, not just inline suggestions.

36,989

openclaw/openclaw

Cross-platform personal AI assistant that runs anywhere. Targets power users who want a local-first, extensible agent instead of being locked into one vendor.

281,298

Reasoning Models Generate Societies of Thought

This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.

Junsol Kim, Shiyang Lai

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

The authors design a reward scheme that scores agents on how well they build evidence chains with proper citations, not just final answers. Their new training method reduces shortcut tricks and hallucinated claims, so deep research agents behave more like careful analysts.

Jiajie Zhang, Xin Lv

eigent-ai/eigent

Eigent is a desktop app for running multi-agent AI workflows locally. It orchestrates specialized workers, tools, and context so agents can execute long, complex jobs for you.

2,455

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

Directly compares workflow graphs managed by external orchestrators to a single prompt that spells out the whole procedure. For travel, tech support, and claims flows, one big prompt beats complex agent tooling on quality and failure rates. If your product is more orchestration code than prompt, this paper says simplify before you scale. ([arxiv.org](https://arxiv.org/abs/2604.27891))

Simon Dennis, Michael Diamond

luongnv89/claude-howto

Hands-on guide for Claude Code, from basics to multi-agent setups. Gives copy-paste templates, diagrams, and a learning path for serious use.

20,351

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Spider-Sense bakes a threat detector into the agent itself, so it only runs heavy safety checks when it senses risk. It keeps attack success low and false positives rare while adding little delay.

Zhenxiong Yu, Zhi Yang

Reinforcement World Model Learning for LLM-based Agents

The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.

Xiao Yu, Baolin Peng

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.

Asa Cooper Stickland, Jan Michelfeit

dify

A very popular production-ready platform for building agentic workflows and applications, with UI, orchestration, and deployment all in one. Given its star growth, it’s becoming a de facto choice for many teams moving beyond simple RAG bots. ([github.com](https://github.com/trending?since=daily))

121,651

NousResearch/hermes-agent

General-purpose AI agent framework that grows with user needs. Focuses on composable tools and skills instead of one fixed workflow.

26,246

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill turns memory operations into skills that an agent can learn, select, and even redesign over time. It beats hand-written memory pipelines on long conversations, documents, and embodied tasks like ALFWorld.

Haozhen Zhang, Quanyu Long

ruvnet/ruflo

Agent orchestration platform tuned for Claude-based systems. Focuses on multi-agent swarms, enterprise deployments, and built-in RAG and code workflows. If you’re standardizing on Claude for serious products, study this before rolling your own orchestrator. ([github.com](https://github.com/trending?since=daily))

38,646

Heterogeneous Scientific Foundation Model Collaboration

Introduces Eywa, a framework that lets language models coordinate with domain‑specific scientific models across non-text data. Treats those models as tools inside an agent system and studies planning strategies across them. If you’re building AI for science, this shows how to wire specialized models into one reasoning loop. ([huggingface.co](https://huggingface.co/papers/2604.27351))

Zihao Li, Jiaru Zou

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Treats methods, not papers, as first-class nodes in a huge evolution graph of AI research. Lets you query how techniques emerged, combined, and replaced each other, then use that to rate or generate new ideas. If you invest in research strategy, this is basically a map of the territory. ([huggingface.co](https://huggingface.co/papers/2604.28158))

Yujun Wu, Dongxu Zhang

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Surveys how teams use reinforcement learning plus GUI interaction to push beyond simple desktop macros into always-on "digital inhabitants". Breaks the space into offline, online, and hybrid strategies, and highlights trends like world-model training and process-level rewards. If you’re automating real GUI workflows, treat this as a roadmap, not just a survey. ([arxiv.org](https://arxiv.org/abs/2604.27955))

Junan Hu, Jian Liu

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Builds a massive graph of how AI methods evolve across 1M+ papers, with over 9M typed edges between techniques. Lets agents and humans trace method lineages, score idea novelty, and auto-generate new research directions. If you scout research or design AI research agents, treat this as a new data layer, not just another paper.

Yujun Wu, Dongxu Zhang

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Lays out a five-level roadmap for visual generation, from basic image mapping up to interactive world modeling for agents. Argues the next race is about structure, memory, and causality, not prettier pictures. If you work on vision models, benchmark against these levels, not just FID-style metrics. ([huggingface.co](https://huggingface.co/papers/2604.28185))

Keming Wu, Zuhao Yang

Rethinking Agentic Reinforcement Learning In Large Language Models

Synthesizes the fast-growing literature on reinforcement learning for agent-style language models, from environment design to safety and compute limits. Argues the key shift is treating models as long-lived decision-makers, not one-shot text generators. If you’re planning big training runs for agents, use this as a design checklist, not just a citation. ([databubble.co](https://databubble.co/news/rethinking-agentic-reinforcement-learning-in-large-language-models?utm_source=openai))

Fangming Cui, Ruixiao Zhu

onyx-dot-app/onyx

Full-stack open source AI chat platform that plugs into many models. Ships with advanced chat features, memory, and multi-user workspaces.

25,005

badlogic/pi-mono

Agent toolkit with a coding-agent CLI, unified LLM API, UI libraries, and Slack bot. Focuses on wiring agents into real dev environments.

31,908

Yeachan-Heo/oh-my-claudecode

Teams-first orchestration layer around Claude Code. Manages multi-agent workflows for orgs instead of single-user toy projects.

24,466

GTC 2026 Insights: Through the Dell Enterprise Hub Lens

Explains how Dell’s Enterprise Hub plus Hugging Face models turn “deploy a model” into a one-command task. Highlights container versioning, multi-vendor GPUs, and a Python SDK that hides infra pain.

Hugging Face Blog

GoogleCloudPlatform/generative-ai

Large collection of Gemini on Vertex AI notebooks and sample apps. Great starting point if you want to build production-style systems on Google Cloud fast.

14,457

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

CAR-bench builds an in-car assistant world with messy, ambiguous user requests and many tools. It measures not just if agents finish tasks, but whether they know when they’re out of their depth.

Johannes Kirmayr, Lukas Stappen

Reinforcement World Model Learning for LLM-based Agents

RWML trains agents to imagine next states and then line them up with reality, instead of just predicting the next token. That shift gives stronger gains on text-based environments than reward-on-final-score alone.

Xiao Yu, Baolin Peng

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.

Fangzhi Xu, Hang Yan

AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems

AstroReason-Bench tests agents on realistic satellite scheduling and space-mission planning rather than toy puzzles. Current agentic LLM systems lag far behind hand-built solvers, giving a sharp reality check for "generalist" planning claims.

Weiyi Wang, Xinchi Chen

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

ToolSafe builds a guardrail model that watches each tool call an agent plans to make and flags dangerous ones before they run. In tool-using agents under prompt-injection attacks, it slashes harmful calls while slightly improving task success.

Yutao Mou, Zhangchi Xue

bytedance/UI-TARS-desktop

UI‑TARS is a full desktop stack for multimodal AI agents, connecting top models with tools, memory, and UI. If you want to ship serious agent apps, this gives you infrastructure instead of starting from scratch.

22,746

simstudioai/sim

Sim is an open platform for building and deploying AI agent workflows end to end. It focuses on visual orchestration, so teams can compose tools, models, and memory without hand-rolling brittle pipelines.

25,414

openai/openai-cookbook

The OpenAI cookbook is a large set of worked examples for building with OpenAI’s API. Treat it as a pattern library for chat apps, agents, RAG systems, and fine-grained evaluations.

70,628

letta-ai/letta

Letta is a framework for long-lived agents with memory and tools. Use it to build assistants that actually remember projects over weeks, not prompts.

19,930

Adaptation of Agentic AI

This large-scale study tracks how agent-like AI systems adapt over time and across tasks. If you're betting on agents, it gives structure and warnings for long-term deployment.

Pengcheng Jiang, Jiacheng Lin

thedotmack/claude-mem

A Claude Code plugin that logs your coding sessions, compresses them with Claude via the agent SDK, and feeds back relevant context into future sessions. In practice it acts like a persistent, AI-managed memory of your projects, making the assistant far more ‘aware’ of the codebase and past conversations. It’s a concrete, production-friendly take on the “long-term memory for coding agents” idea.

7,300

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Introduces NL2Repo-Bench, a benchmark where coding agents must generate or modify entire repositories from natural language specifications, rather than solving single-file LeetCode-style tasks. It evaluates long-horizon planning, tool use, and consistency across files and modules. This is a big step toward evaluating code agents in settings that look like real software projects instead of toy problems.

Jingzhe Ding, Shengda Long

CopilotKit

React UI components plus backend infrastructure for building in-app AI copilots, chatbots, and agentic workflows. It’s becoming a go-to choice if you want "agentic frontends" without wiring everything from scratch. ([github.com](https://github.com/trending?since=daily))

26,435

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

HF pitches Confucius Code Agent as an industrial-strength open coding agent with hierarchical working memory, persistent notes, and a meta-agent that continuously refines configurations. If you care about reproducible, extensible coding agents rather than opaque SaaS tools, this is a substantial systems paper. ([huggingface.co](https://huggingface.co/papers/2512.10398))

Zhaodong Wang, Zhenting Qi