Agents
Research papers, repositories, and articles about agents
Showing 50 of 112 items
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.
anthropics/claude-code
Claude Code runs as a terminal-native coding agent that understands your repo and executes commands. It blurs the line between shell, IDE, and assistant, and it’s quickly becoming a default tool for power users.
anomalyco/opencode
OpenCode is an open-source coding agent that edits and writes code for you, wired into modern tooling. Use it as a local, hackable alternative to proprietary AI dev environments.
obra/superpowers
Superpowers is a skills library and workflow for coding agents like Claude Code and OpenCode. It bakes in design, planning, testing, and review loops so agents behave like disciplined junior engineers.
Synthetic Computers at Scale for Long-Horizon Productivity Simulation
Builds thousands of synthetic "computers" with realistic files and calendars to simulate month-long knowledge work for AI agents. Each run spans 8+ hours and ~2,000 steps, yielding dense signals for training long-horizon productivity agents. If you are designing office copilots or agent training curricula, copy this setup to cheaply generate rich experience data. ([arxiv.org](https://arxiv.org/abs/2604.28181))
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
ML-Master 2.0 introduces a "hierarchical cognitive cache" that separates short-term logs from long-term strategy for AI agents working for days on ML engineering tasks. It hits state-of-the-art on MLE-Bench, hinting at how to run week-long research agents.
Memory in the Age of AI Agents
A substantial survey that systematizes the fast-growing literature on ‘agent memory’—how agentic LLM systems store, retrieve, and evolve information over time. It proposes a taxonomy across forms (token, parametric, latent), functions (factual, experiential, working) and dynamics, and catalogs existing benchmarks and frameworks. If you’re building agent systems with nontrivial memory, this is quickly becoming the reference map of the territory.
SakanaAI/AI-Scientist-v2
Implements AI Scientist v2, which runs agentic tree search over experiments. Pushes toward semi-automated scientific discovery instead of just paper drafting.
Partnering with Mozilla to improve Firefox’s security
Anthropic used Claude Opus 4.6 to scan Firefox’s code and surfaced 22 new vulnerabilities, 14 rated high severity. The post lays out a playbook for pairing AI bug hunters with human maintainers safely.
openai/codex
A lightweight coding agent that runs directly in your terminal, wiring OpenAI models into a loop that edits files, runs tests, and applies patches. Compared to IDE plugins, it’s closer to a shell-native ‘pair programmer’ that can operate on entire repos and workflows. Given its rapid adoption and tight integration with existing CLIs, it’s poised to become a reference design for terminal-first code agents.
block/goose
Open-source AI agent that installs, edits, executes, and tests code with any language model. Targets real workflows, not just inline suggestions.
openclaw/openclaw
Cross-platform personal AI assistant that runs anywhere. Targets power users who want a local-first, extensible agent instead of being locked into one vendor.
Reasoning Models Generate Societies of Thought
This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
The authors design a reward scheme that scores agents on how well they build evidence chains with proper citations, not just final answers. Their new training method reduces shortcut tricks and hallucinated claims, so deep research agents behave more like careful analysts.
eigent-ai/eigent
Eigent is a desktop app for running multi-agent AI workflows locally. It orchestrates specialized workers, tools, and context so agents can execute long, complex jobs for you.
In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks
Directly compares workflow graphs managed by external orchestrators to a single prompt that spells out the whole procedure. For travel, tech support, and claims flows, one big prompt beats complex agent tooling on quality and failure rates. If your product is more orchestration code than prompt, this paper says simplify before you scale. ([arxiv.org](https://arxiv.org/abs/2604.27891))
luongnv89/claude-howto
Hands-on guide for Claude Code, from basics to multi-agent setups. Gives copy-paste templates, diagrams, and a learning path for serious use.
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
Spider-Sense bakes a threat detector into the agent itself, so it only runs heavy safety checks when it senses risk. It keeps attack success low and false positives rare while adding little delay.
Reinforcement World Model Learning for LLM-based Agents
The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.
Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.
dify
A very popular production-ready platform for building agentic workflows and applications, with UI, orchestration, and deployment all in one. Given its star growth, it’s becoming a de facto choice for many teams moving beyond simple RAG bots. ([github.com](https://github.com/trending?since=daily))
NousResearch/hermes-agent
General-purpose AI agent framework that grows with user needs. Focuses on composable tools and skills instead of one fixed workflow.
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
MemSkill turns memory operations into skills that an agent can learn, select, and even redesign over time. It beats hand-written memory pipelines on long conversations, documents, and embodied tasks like ALFWorld.
ruvnet/ruflo
Agent orchestration platform tuned for Claude-based systems. Focuses on multi-agent swarms, enterprise deployments, and built-in RAG and code workflows. If you’re standardizing on Claude for serious products, study this before rolling your own orchestrator. ([github.com](https://github.com/trending?since=daily))
Heterogeneous Scientific Foundation Model Collaboration
Introduces Eywa, a framework that lets language models coordinate with domain‑specific scientific models across non-text data. Treats those models as tools inside an agent system and studies planning strategies across them. If you’re building AI for science, this shows how to wire specialized models into one reasoning loop. ([huggingface.co](https://huggingface.co/papers/2604.27351))
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Treats methods, not papers, as first-class nodes in a huge evolution graph of AI research. Lets you query how techniques emerged, combined, and replaced each other, then use that to rate or generate new ideas. If you invest in research strategy, this is basically a map of the territory. ([huggingface.co](https://huggingface.co/papers/2604.28158))
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
Surveys how teams use reinforcement learning plus GUI interaction to push beyond simple desktop macros into always-on "digital inhabitants". Breaks the space into offline, online, and hybrid strategies, and highlights trends like world-model training and process-level rewards. If you’re automating real GUI workflows, treat this as a roadmap, not just a survey. ([arxiv.org](https://arxiv.org/abs/2604.27955))
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Builds a massive graph of how AI methods evolve across 1M+ papers, with over 9M typed edges between techniques. Lets agents and humans trace method lineages, score idea novelty, and auto-generate new research directions. If you scout research or design AI research agents, treat this as a new data layer, not just another paper.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Lays out a five-level roadmap for visual generation, from basic image mapping up to interactive world modeling for agents. Argues the next race is about structure, memory, and causality, not prettier pictures. If you work on vision models, benchmark against these levels, not just FID-style metrics. ([huggingface.co](https://huggingface.co/papers/2604.28185))
Rethinking Agentic Reinforcement Learning In Large Language Models
Synthesizes the fast-growing literature on reinforcement learning for agent-style language models, from environment design to safety and compute limits. Argues the key shift is treating models as long-lived decision-makers, not one-shot text generators. If you’re planning big training runs for agents, use this as a design checklist, not just a citation. ([databubble.co](https://databubble.co/news/rethinking-agentic-reinforcement-learning-in-large-language-models?utm_source=openai))
onyx-dot-app/onyx
Full-stack open source AI chat platform that plugs into many models. Ships with advanced chat features, memory, and multi-user workspaces.
badlogic/pi-mono
Agent toolkit with a coding-agent CLI, unified LLM API, UI libraries, and Slack bot. Focuses on wiring agents into real dev environments.
Yeachan-Heo/oh-my-claudecode
Teams-first orchestration layer around Claude Code. Manages multi-agent workflows for orgs instead of single-user toy projects.
GTC 2026 Insights: Through the Dell Enterprise Hub Lens
Explains how Dell’s Enterprise Hub plus Hugging Face models turn “deploy a model” into a one-command task. Highlights container versioning, multi-vendor GPUs, and a Python SDK that hides infra pain.
GoogleCloudPlatform/generative-ai
Large collection of Gemini on Vertex AI notebooks and sample apps. Great starting point if you want to build production-style systems on Google Cloud fast.
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
CAR-bench builds an in-car assistant world with messy, ambiguous user requests and many tools. It measures not just if agents finish tasks, but whether they know when they’re out of their depth.
Reinforcement World Model Learning for LLM-based Agents
RWML trains agents to imagine next states and then line them up with reality, instead of just predicting the next token. That shift gives stronger gains on text-based environments than reward-on-final-score alone.
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.
AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
AstroReason-Bench tests agents on realistic satellite scheduling and space-mission planning rather than toy puzzles. Current agentic LLM systems lag far behind hand-built solvers, giving a sharp reality check for "generalist" planning claims.
ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
ToolSafe builds a guardrail model that watches each tool call an agent plans to make and flags dangerous ones before they run. In tool-using agents under prompt-injection attacks, it slashes harmful calls while slightly improving task success.
bytedance/UI-TARS-desktop
UI‑TARS is a full desktop stack for multimodal AI agents, connecting top models with tools, memory, and UI. If you want to ship serious agent apps, this gives you infrastructure instead of starting from scratch.
simstudioai/sim
Sim is an open platform for building and deploying AI agent workflows end to end. It focuses on visual orchestration, so teams can compose tools, models, and memory without hand-rolling brittle pipelines.
openai/openai-cookbook
The OpenAI cookbook is a large set of worked examples for building with OpenAI’s API. Treat it as a pattern library for chat apps, agents, RAG systems, and fine-grained evaluations.
letta-ai/letta
Letta is a framework for long-lived agents with memory and tools. Use it to build assistants that actually remember projects over weeks, not prompts.
Adaptation of Agentic AI
This large-scale study tracks how agent-like AI systems adapt over time and across tasks. If you're betting on agents, it gives structure and warnings for long-term deployment.
thedotmack/claude-mem
A Claude Code plugin that logs your coding sessions, compresses them with Claude via the agent SDK, and feeds back relevant context into future sessions. In practice it acts like a persistent, AI-managed memory of your projects, making the assistant far more ‘aware’ of the codebase and past conversations. It’s a concrete, production-friendly take on the “long-term memory for coding agents” idea.
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Introduces NL2Repo-Bench, a benchmark where coding agents must generate or modify entire repositories from natural language specifications, rather than solving single-file LeetCode-style tasks. It evaluates long-horizon planning, tool use, and consistency across files and modules. This is a big step toward evaluating code agents in settings that look like real software projects instead of toy problems.
CopilotKit
React UI components plus backend infrastructure for building in-app AI copilots, chatbots, and agentic workflows. It’s becoming a go-to choice if you want "agentic frontends" without wiring everything from scratch. ([github.com](https://github.com/trending?since=daily))
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
HF pitches Confucius Code Agent as an industrial-strength open coding agent with hierarchical working memory, persistent notes, and a meta-agent that continuously refines configurations. If you care about reproducible, extensible coding agents rather than opaque SaaS tools, this is a substantial systems paper. ([huggingface.co](https://huggingface.co/papers/2512.10398))