Agentic Systems
Autonomous agents, tool use, multi-agent collaboration, and extended autonomous action. AI that can act in the world.
Key Benchmarks
Recent Papers
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
Johannes Kirmayr, Lukas Stappen, Elisabeth André
Optimization is Not Enough: Why Problem Formulation Deserves Equal Attention
Iván Olarte Rodríguez, Gokhan Serhat, Mariusz Bujny +3 more
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
Zhenxiong Yu, Zhi Yang, Zhiheng Jin +19 more
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
Fangzhi Xu, Hang Yan, Qiushi Sun +16 more
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Haozhen Zhang, Quanyu Long, Jianzhu Bao +4 more
Reinforcement World Model Learning for LLM-based Agents
Xiao Yu, Baolin Peng, Ruize Xu +6 more
Reinforcement World Model Learning for LLM-based Agents
Xiao Yu, Baolin Peng, Ruize Xu +6 more
ARC Prize 2025: Technical Report
François Chollet, Mike Knoop, Gregory Kamradt +1 more
CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems
Percy Jardine
Reasoning Models Generate Societies of Thought
Junsol Kim, Shiyang Lai, Nino Scherrer +2 more
Recent Milestones
Claude agents build 100k‑line C compiler in Rust
Techzine reported on February 10 that 16 instances of Anthropic’s Claude Opus 4.6, running as independent AI agents, collaboratively wrote roughly 100,000 lines of Rust code to create a working C compiler. The experiment, originally detailed by Anthropic researcher Nicholas Carlini, took about two weeks, cost around US$20,000 in API usage, and produced a compiler capable of building a Linux 6.9 kernel and major open‑source projects.
Harvey Legal AI Chases $11B Valuation
TechCrunch reports that legal AI startup Harvey is in talks to raise about $200 million at an $11 billion valuation, led by Sequoia and Singapore’s GIC. The company, which sells LLM‑based tools to law firms, has already climbed from a $3 billion valuation in early 2025 to $8 billion after a $160 million round late last year.([techcrunch.com](https://techcrunch.com/2026/02/09/harvey-reportedly-raising-at-11b-valuation-just-months-after-it-hit-8b/))
Claude Cowork Triggers SaaS Panic
On February 8, 2026, Chilean outlet G5noticias reported that the launch of Anthropic’s Claude Cowork automation tools triggered sell‑offs in software, legal and financial services stocks. Analysts fear enterprises could internalize tasks previously outsourced to SaaS and data providers.
Goldman rolls out Claude agents across the bank
Goldman Sachs is rolling out AI agents built with Anthropic’s Claude model to automate trade accounting, compliance checks and client onboarding after six months of joint development. CIO Marco Argenti said on February 8, 2026 that the agents, already tested internally, will launch soon to support back-office staff handling a portion of the bank’s $2.5 trillion in assets under supervision.
Goldman deploys Claude AI agents in back office
On February 7, 2026, Indian outlet The Financial Express reported that Goldman Sachs is rolling out Anthropic’s Claude‑based AI agents to automate accounting, paperwork, and client onboarding tasks after six months of co‑development. MLQ.ai and other tech outlets say the agents already support workflows tied to a portion of Goldman’s $2.5 trillion in assets under supervision and have cut onboarding times by about 30%. ([financialexpress.com](https://www.financialexpress.com/life/technology-goldman-sachs-adopts-anthropic-ai-agents-to-automate-accounting-and-back-office-tasks-4134437/))
SoftBank bets on OpenAI Frontier for Japan Inc
On Feb. 6, 2026, SB OAI Japan and SoftBank announced they will base their “Crystal Intelligence” enterprise AI solution on OpenAI’s new Frontier platform.([softbank.jp](https://www.softbank.jp/corp/news/press/sbkk/2026/20260206_01/)) Frontier is being validated inside SoftBank, with a full rollout to Japanese enterprises targeted for 2026.
OpenAI Frontier launches enterprise agent platform
On February 5, 2026, OpenAI announced Frontier, a new platform for enterprises to build, deploy and govern AI agents across their existing systems, initially available to a limited set of customers. TechCrunch reported the launch at 10:09 a.m. PST, noting early adopters including HP, Oracle, State Farm, Thermo Fisher and Uber.([openai.com](https://openai.com/index/introducing-openai-frontier/))
Claude Opus 4.6 adds 1M context and agent teams
Anthropic launched Claude Opus 4.6 on February 5, 2026, adding “agent teams” that split large tasks across multiple coordinated agents and extending context to 1 million tokens. TechCrunch reported the release at 9:51 a.m. PST, highlighting gains in multi‑step coding, analytics and new PowerPoint integration.([techcrunch.com](https://techcrunch.com/2026/02/05/anthropic-releases-opus-4-6-with-new-agent-teams/))
GPT‑5.3‑Codex pushes state-of-the-art coding
OpenAI released GPT‑5.3‑Codex, a new agentic coding model, on February 5, 2026, claiming state‑of‑the‑art performance on key coding and agent benchmarks and 25% faster execution than GPT‑5.2‑Codex. TechCrunch reported the launch at 12:01 p.m. PST, shortly after Anthropic unveiled its own upgraded Claude Opus 4.6 model for agentic coding.([openai.com](https://openai.com/index/introducing-gpt-5-3-codex/))
ElevenLabs Raises $500M for Voice & Agentic AI
On February 5, 2026, ElevenLabs announced a $500 million Series D round at an $11 billion valuation, led by Sequoia Capital with heavy participation from Andreessen Horowitz and Iconiq. The company says it will use the capital to expand its ElevenAgents conversational AI platform, invest in “audio AGI” research and grow teams in hubs including Tokyo, Bengaluru, London and New York. ([prtimes.jp](https://prtimes.jp/main/html/rd/p/000000023.000160611.html))
Roblox 4D AI Lets Players Spawn Live Objects
Roblox opened an “open beta” for its new 4D creation feature on February 4, 2026, letting users generate fully interactive objects rather than static 3D models. The system builds on Roblox’s Cube 3D generative model, which has already produced over 1.8 million user-created assets since launch.([techcrunch.com](https://techcrunch.com/2026/02/04/robloxs-4d-creation-feature-is-now-available-in-open-beta/))
Claude Cowork Triggers $300B 'SaaSpocalypse
Between February 3–4, 2026, global software and data stocks lost roughly $300 billion in market value after Anthropic launched new Claude Cowork plugins, including a legal automation tool, that investors fear could undercut traditional software and IT services. Indian IT and BPO stocks fell up to 8–10%, while major U.S. and European software names also sold off sharply.([economictimes.indiatimes.com](https://economictimes.indiatimes.com/markets/stocks/news/rs-1-75-lakh-crore-saaspocalypse-for-it-stocks-explained-what-it-means-for-investors/articleshow/127900151.cms?from=mdr))
Gemini Tests Full Android App Control
Chinese tech outlet IT之家 reports that strings in the Google app 17.4 beta reveal an experimental “Complete tasks with Gemini” feature, codenamed “bonobo,” enabling Gemini to control Android apps via screen automation. The prototype would let Gemini place orders or book rides inside specified apps, with Google warning users to closely supervise its actions.([finance.sina.com.cn](https://finance.sina.com.cn/tech/digi/2026-02-04/doc-inhkrtwh2936393.shtml))
Alexa+ Becomes Free Prime-Scale AI Agent
On February 4, 2026, Amazon made Alexa+, its generative AI-powered assistant, available to all U.S. customers, bundling unlimited access into Prime memberships and offering a $19.99/month standalone tier. Alexa+ runs on large language models including Amazon Nova and Anthropic and adds agentic capabilities like booking services, managing calendars, and handling complex requests across devices.([techcrunch.com](https://techcrunch.com/2026/02/04/alexa-amazons-ai-assistant-is-now-available-to-everyone-in-the-u-s/))
ElevenLabs raises $500M for voice & agent platform
ElevenLabs announced on February 4, 2026 that it raised a $500 million Series D round valuing the company at $11 billion, led by Sequoia Capital with major follow‑on investments from Andreessen Horowitz and ICONIQ. On February 5, 2026, outlets including eWeek and Music Business Worldwide reported the funding as one of the largest private generative‑AI deals to date.([elevenlabs.io](https://elevenlabs.io/blog/series-d))
Zoom’s federated AI tops Humanity’s Last Exam
On January 21, 2026, UC Today reported that Zoom’s federated AI system scored 53.0 on the Humanity’s Last Exam benchmark, outperforming OpenAI’s GPT‑5.2 Pro and Google’s Gemini Deep Research. Zoom attributes the result to its architecture that routes queries across its own small models and partner models from OpenAI, Anthropic, Meta and others.
Baidu tests multi-agent AI group chats
On January 21, 2026, Baidu’s Wenxin (Ernie) app confirmed internal testing of a new “multi-person, multi-agent” group chat feature. Company insiders said the tool is meant for task-focused AI collaboration, not to compete directly with WeChat.
AI Slows Hiring at India’s IT Giants
The Register reports that India’s top four IT outsourcers — HCL, Infosys, TCS and Wipro — added just 3,910 staff over the past year, an unusually slow pace given historic quarterly hiring of 10,000+ workers. Management at all four firms told investors they are leaning heavily on AI to automate client work and streamline their own operations. ([theregister.com](https://www.theregister.com/2026/01/19/hcl_infosys_tcs_wipro_results/))
Claude Cowork Shows Agents Hitting SaaS
Semafor reports that Anthropic’s new Claude Cowork agent has contributed to a selloff in SaaS software stocks, with investors worried about automation of white-collar workflows. The tool, launched January 12 and now going viral, lets Claude autonomously build apps, edit and organize local files, and handle complex office tasks with minimal input. ([semafor.com](https://www.semafor.com/article/01/18/2026/anthropics-new-ai-agent-claude-cowork-pushes-down-software-stocks?utm_source=openai))
Claude agents power large cyber‑espionage op
On January 17, 2026, Cyber Magazine reported on Anthropic’s November 2025 disclosure that a Chinese state‑backed group used its Claude Code tool to automate 80–90% of a cyber‑espionage campaign against about 30 organizations worldwide. Anthropic says it detected the activity in mid‑September 2025, shut down the attackers’ access within 10 days and published a 13‑page technical report describing the AI‑orchestrated operation.([cybermagazine.com](https://cybermagazine.com/news/ai-agents-drive-first-large-scale-autonomous-cyberattack))