How Anthropic Built Multi-Agent Deep Research人类如何构建多代理深度研究

15x the tokens, 90.2% better answers, and the three decisions that make it work.15倍的token数量，90.2%更好的答案，以及使其运作的三个决策。

May 23, 2026

🧭 Part 13 of the 🤖 Agents course🧭 机器人课程的第13部分

You hand Claude a question: “Find every board member of every Information Technology company in the S&P 500.” A single agent will work through it the way you would: search company one, scrape the result, search company two, repeat. By the time it finishes, you’ve spent an hour and burned through the context window twice. There are 65 companies on that list. That’s at least 65 sequential web searches plus a few hundred follow-ups to chase ambiguous names. The query is breadth-first by nature, and a single agent can only walk it sequentially, one company at a time.你让克劳德提出一个问题：“找出标普500中每个信息技术公司的每个董事会成员。” 单个代理会像你一样处理它：搜索公司一，抓取结果，搜索公司二，重复。当它完成时，你已经花了一个小时，并且两次耗尽了上下文窗口。这个列表有65家公司。这至少需要65次连续的网络搜索，加上几百次后续跟进来追查模糊的名称。这个查询本质上是广度优先的，而单个代理只能顺序地逐一处理，一次公司一个。

Anthropic shipped a different answer in April 2025: a Research feature1 that spawns parallel subagents, each with its own context window, each chasing one independent thread. On their internal eval, this setup beat single-agent Claude Opus 4 by 90.2%. The cost: roughly 15x the tokens of a normal chat2. Anthropic在2025年4月发布了不同的答案：一个研究功能1，它生成并行子代理，每个子代理都有自己的上下文窗口，每个子代理追逐一个独立的线索。在内部评估中，这种设置比单个代理克劳德奥普斯4提高了90.2%。成本：大约是普通聊天的15倍token数量。

This issue is about what the architecture is, when the 15x cost is worth paying, and the production decisions that separate 'neat demo' from 'shipped to millions.'这个问题涉及架构是什么，何时支付15倍成本是值得的，以及将‘有趣的演示’与‘部署给数百万用户’分开的生产决策。

TL;DR
TL;DR

What it is: Claude’s Research is an orchestrator-worker system: a lead agent plans, spins up 3-5 specialized subagents in parallel, and synthesizes their findings with a separate citation pass.什么是：克劳德的研究是一个调度器-工作者系统：一个主导代理进行计划，并行生成3-5个专业化的子代理，并由单独的引用通过验证他们的发现。
What it solves: Breadth-first research questions where the answer requires exploring many independent paths at once and the total information exceeds a single context window.它解决了什么：需要同时探索许多独立路径的广度优先研究问题，并且总信息超出单个上下文窗口。
Performance: 90.2% improvement over single-agent Claude Opus 4 on Anthropic’s internal research eval. 90% reduction in research time on complex queries.性能：在Anthropic内部研究评估中比单个代理克劳德奥普斯4提高90.2%。在复杂查询上减少90%的研究时间。
Cost: ~15x the tokens of a chat interaction. Token usage alone explains 80% of performance variance.成本：约是聊天交互的15倍token数量。token使用量单独解释了80%的性能差异。
Lesson: Architecture follows task structure. Multi-agent only wins when the task decomposes into independent parallel threads教训：架构遵循任务结构。多代理只有在任务可以分解为独立的并行线程时才会获胜

The Orchestrator-Worker Pattern
调度器-工作者模式

Orchestrator-worker predates LLMs by decades. Query planners in distributed databases fan work out to shard workers. Kernels schedule threads. Every engineering team has a tech lead who decomposes the sprint and hands tickets to engineers. One coordinator breaks the problem apart, workers handle the pieces in isolation, the coordinator stitches the result back together. The novelty in 2025 is that every node in the system is an LLM making routing decisions on the fly.调度器-工作者模式比LLM早几十年。分布式数据库中的查询计划器将工作分配给工作者。内核调度线程。每个工程团队都有一个技术负责人，将冲刺分解并将任务交给工程师。一个协调者将问题分解，工作者在隔离中处理各部分，协调者将结果拼接在一起。2025年的创新在于，系统中的每个节点都是一个LLM在实时做出路由决策。

The other major pattern is the swarm or peer-to-peer model, where agents talk directly to each other and share state, often through a common message bus or scratchpad. Swarm models are flexible but hard to reason about. Orchestrator-worker constrains the topology. Workers never talk to each other. Every decision about what comes next lives in the orchestrator.另一个主要模式是蜂群或点对点模型，其中代理直接相互交流并共享状态，通常通过公共消息总线或临时存储。蜂群模型更灵活但更难理解。调度器-工作者约束了拓扑结构。工作者从不相互交流。每个决定关于接下来要做什么都由调度器做出。

The decision that matters in any multi-agent design is the isolation boundary: what does each subagent need to know about what the others are doing? Anthropic’s bet is that for research, the answer is “almost nothing.” Each subagent gets a self-contained task description, an output format, and a fresh context window. It doesn’t know the other subagents exist. It cannot coordinate with them mid-task. It’s what lets the subagents run in true parallel and what keeps the lead agent’s context window from drowning in cross-talk.任何多代理设计中最重要的决策是隔离边界：每个子代理需要知道其他代理在做什么的多少？Anthropic的赌注是，对于研究，答案几乎是“几乎不需要”。每个子代理获得一个自包含的任务描述、输出格式和一个新的上下文窗口。它不知道其他子代理的存在。它不能在任务中与它们协调。这让子代理能够真正并行运行，并保持主导代理的上下文窗口不被交叉对话淹没。

Cognition’s Don’t Build Multi-Agents3 argues the opposite: parallel subagents make independent decisions, and independent decisions on the same problem produce conflicting outputs. Their canonical example: a Flappy Bird clone request decomposed into subtasks. Subagent A builds a Super Mario background. Subagent B builds a bird with no consistent art style. Both technically completed their assigned tasks. Neither saw the original “Flappy Bird” framing, so the implicit “match the source game’s aesthetic” decision was lost in delegation. Both teams are right inside their own domain. Cognition is solving for shared-state tasks where isolation breaks things. Anthropic is solving for independent-thread tasks where isolation is the whole point. The task picks the architecture.Cognition的‘不要构建多代理’3认为，并行子代理做出独立决策，独立决策在同一问题上会产生冲突的输出。他们的典型例子：一个Flappy Bird克隆请求被分解为子任务。子代理A构建超级马里奥背景。子代理B构建一只没有统一艺术风格的鸟。两者都技术上完成了分配的任务。两者都没有看到原始的‘Flappy Bird’框架，所以隐含的‘匹配源游戏的美学’决策在委托中丢失了。两个团队在各自领域内都是正确的。Cognition解决的是共享状态任务，其中隔离会破坏事物。Anthropic解决的是独立线程任务，其中隔离就是整个点。任务选择架构。

Why Anthropic Built This
为什么Anthropic构建了这个

Anthropic launched the Research feature in April 2025 as a research extension to Claude. The structural problem they hit is the one every research workflow hits: you can’t predict the path in advance, because each new fact reshapes the next question. A static pipeline (”retrieve top-k, summarize, return”) fails the moment the user asks anything that requires following a thread.Anthropic在2025年4月作为克劳德的研究扩展推出了研究功能。他们遇到的结构性问题是每个研究工作流都会遇到的问题：你无法提前预测路径，因为每个新事实都会重塑下一个问题。静态管道（'检索top-k，总结，返回'）在用户询问任何需要跟随线索的问题时就会失败。

Their first attempt was a single agent with bigger context and more tool calls. It hit two limits. The first one was sequential time: a query that needs 50 web searches takes 50 sequential round-trips. The second limit was the 200K-token context limit on Claude Opus 4. Past that limit, the context gets truncated, and the agent loses the plan it made in turn one. 他们的第一次尝试是一个单个代理，具有更大的上下文和更多的工具调用。它遇到了两个限制。第一个是顺序时间：需要50次网络搜索的查询需要50次连续的往返。第二个限制是克劳德奥普斯4的200K上下文限制。超过这个限制，上下文会被截断，代理会丢失在第一轮做出的计划。

Anthropic’s architecture answers the structural problem: one Claude plans the search, several Claudes do the search in parallel, and a separate Claude with its own context window verifies every citation before anything reaches the user.Anthropic的架构解决了结构性问题：一个克劳德计划搜索，几个克劳德并行进行搜索，一个单独的克劳德使用自己的上下文窗口验证每个引用，然后才到达用户。

LeadResearcher (Claude Opus 4) receives the query.LeadResearcher（克劳德奥普斯4）接收查询。
LeadResearcher plans. It uses extended thinking mode4 to draft a strategy, decide breadth vs depth, and write the plan to external memory before context fills.LeadResearcher进行计划。它使用延长思考模式4来草拟策略，决定广度与深度，并在上下文填满前将计划写入外部内存。
LeadResearcher spawns 3-5 Subagents (Claude Sonnet 4) in parallel. Each gets a self-contained task: an objective, an output format, a tool list, and a clear boundary on when it’s done.LeadResearcher并行生成3-5个子代理（克劳德索尼特4）。每个子代理获得一个自包含的任务：一个目标、一个输出格式、一个工具列表，以及明确的完成边界。
Each Subagent searches independently. Each one calls 3+ tools in parallel inside its own context window, evaluates results with interleaved thinking5, and returns a condensed summary to the lead.每个子代理独立搜索。每个子代理在自己的上下文窗口内并行调用3个以上工具，评估结果时进行交替思考5，并向主导代理返回一个简洁的摘要。
LeadResearcher synthesizes. It reads the summaries, decides whether more research is needed, and either spawns another wave of subagents or moves to the next step.LeadResearcher进行综合。它阅读摘要，决定是否需要更多研究，然后要么生成另一波子代理，要么进入下一步。
CitationAgent (a separate pass) attributes claims. It walks through the final report and the source documents, attaching each claim to a specific URL. Single-agent systems can’t separate confident from correct. This can.CitationAgent（单独的通过）归因声明。它浏览最终报告和源文档，将每个声明附加到特定的URL。单个代理系统无法区分自信和正确。这可以做到。

🔍 Deeper Look: The deeper engineering lesson is separation of concerns for high-stakes tasks, and it generalizes beyond citations. A single agent doing both "decide what to write" and "verify every citation" produces what Anthropic calls "the game of telephone": by the time the report is drafted, the source URLs have been condensed and re-summarized through several subagent returns, and the lead agent is reconstructing citations from memory. A separate CitationAgent reads the raw documents AND the final report, so it checks claims against ground truth instead of the lead agent's recollection of it.🔍 更深入的观察：更深层的工程教训是高风险任务的关注点分离，并且它超越了引用。一个单个代理同时‘决定写什么’和‘验证每个引用’会产生Anthropic称为‘电话游戏’的情况：到报告被草拟时，源URL已经通过多个子代理返回被压缩和总结，而主导代理正在从记忆中重建引用。一个单独的CitationAgent阅读原始文档和最终报告，因此它可以将声明与真实来源进行核对，而不是主导代理对其的记忆。

How the components interact across a single session, with the LeadResearcher blocking on the slowest subagent in each wave:在单次会话中组件如何相互作用，主导代理在每一波中阻塞在最慢的子代理上：

Know someone debating multi-agent for their next build? Send them this before they spend 15x on tokens for a task that doesn’t need it.知道有人在讨论多代理系统用于他们的下一个构建？发送他们这个内容，在他们为不需要它的任务花费15倍token之前。

Decision 1: Externalize state to memory
决策1：将状态外部化到内存

The decision. When the LeadResearcher’s context approaches the 200K-token limit, it writes the plan to external memory, hands the next phase to subagents with fresh context windows, and reads the plan back when needed.决策。当主导代理的上下文接近200K token限制时，它将计划写入外部内存，将下一阶段交给子代理，并读取计划以便需要时使用。

The context. Research tasks can run for hundreds of turns. At Claude Opus 4’s 200K context limit, a long-running research agent that holds every tool result in conversation history will hit truncation, and the plan written in turn 1 is gone by turn 40.上下文。研究任务可以运行数百次。在克劳德奥普斯4的200K上下文限制下，一个长时间运行的研究代理如果将每个工具结果都保留在对话历史中，将在第40次达到截断，而第一轮写下的计划到第40次就消失了。

The tradeoff. Coordination complexity. State lives in two places now: in-context for short-lived decisions, in external memory for long-lived plans. The agent has to decide every turn whether to use what's in context or read from memory.权衡。协调复杂性。状态现在存在两个地方：短期决策在上下文中，长期计划在外部内存中。代理每次都必须决定使用上下文中的内容还是从内存中读取。

The limitations. External memory works for state that’s structured and infrequently mutated (the plan, the user’s original question, the high-level strategy). It does not work for fast-changing in-flight state across many parallel agents. There’s no shared transactional store; if two subagents need to coordinate on a finding mid-search, this architecture can’t help them. Cognition’s critique of multi-agent systems lands hardest exactly here.限制。外部内存适用于结构化且不频繁变化的状态（计划、用户的原始问题、高级策略）。它不适用于多个并行代理之间快速变化的在线状态。没有共享事务存储；如果两个子代理需要在搜索中协调发现，这种架构无法帮助他们。Cognition对多代理系统的批评正好在这里落地。

The result. Research sessions that would otherwise hit context truncation at turn 40 can now run to turn 200+. The LeadResearcher’s plan survives the entire session because it lives outside the context window.结果。原本在第40次达到上下文截断的研究会话现在可以运行到第200+次。主导代理的计划在整个会话中都得以保留，因为它位于上下文窗口之外。

My take. This decision is what lets the whole architecture exist, and it’s worth borrowing even in single-agent systems. We covered the broader pattern in What is Agent Memory?.我的看法。这个决定是让整个架构存在的原因，并且值得借用，即使在单个代理系统中。我们在‘什么是代理内存？’中已经讨论了更广泛的模式。

Decision 2: Run subagents in parallel
决策2：并行运行子代理

The decision. The LeadResearcher spawns 3-5 subagents simultaneously. Each runs in its own context window and never sees what the others are doing.决策。主导代理同时生成3-5个子代理。每个子代理在自己的上下文窗口中运行，从不看到其他子代理在做什么。

The context. Early versions of the system ran searches sequentially. The sequential approach hit two limits. The first was latency: 50 sequential web searches at ~2 seconds each is ~100 seconds before the agent even starts synthesizing. The second was context bloat: tool results crowded out the original query within 20 searches. Parallelism solves both.上下文。早期系统版本运行搜索是顺序的。顺序方法遇到了两个限制。第一个是延迟：50次连续的网络搜索，每次约2秒，大约需要100秒才能开始综合。第二个是上下文膨胀：工具结果在20次搜索后占据了原始查询。并行性解决了这两个问题。

The tradeoff. Coordination. A subagent that doesn’t know what its peers are doing will sometimes duplicate work. One of Anthropic’s early systems had a subagent investigating the 2021 automotive chip crisis while two others duplicated work on current 2025 supply chains. The fix was better task descriptions in the orchestrator’s delegation prompt: explicit objectives, explicit boundaries, explicit “don’t research X, that’s another subagent’s job.”权衡。协调。一个不知道其同伴在做什么的子代理有时会重复工作。Anthropic早期的一个系统有子代理在研究2021年汽车芯片危机，而两个其他子代理重复了当前2025年供应链的工作。修复是更好的任务描述在调度器的委托提示中：明确的目标、明确的边界、明确‘不要研究X，那是另一个子代理的工作’。

The limitations. Parallel subagents only help if the subtasks are truly independent. If subagent B needs subagent A’s findings to do its job, parallelism degenerates into expensive serial execution with extra overhead. Anthropic’s blog is direct: “domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today.” Coding, debugging, and most agentic workflows fail this test. Research passes it.限制。并行子代理只有在子任务真正独立时才有帮助。如果子代理B需要子代理A的发现来完成其工作，并行性会退化为昂贵的序列执行，带有额外开销。Anthropic的博客直接指出：‘今天多代理系统不适合的领域是所有代理需要共享相同上下文或涉及许多代理之间依赖的领域。’编码、调试和大多数代理工作流失败这个测试。研究通过它。

The result. Up to 90% reduction in research time on complex queries that previously ran serially. Wall-clock per wave drops from the sum of subagent times to the max of subagent times.结果。在复杂查询上减少高达90%的研究时间，这些查询之前是顺序运行的。每一波的实时时间从子代理时间的总和减少到子代理时间的最大值。

My take. Parallel subagents work because each one gets its own context window. Total tokens spent on the query goes up, even though no single context window gets bigger. Anthropic found that token spend alone, more than any other variable, predicts how good the final answer is.我的看法。并行子代理有效是因为每个子代理都有自己的上下文窗口。查询的总token数量增加，即使没有单个上下文窗口变大。Anthropic发现，token消耗量比任何其他变量都更能预测最终答案的质量。

Decision 3: Evaluate outcomes with LLM judges
决策3：使用LLM评判者评估结果

The decision. Anthropic evaluates the system using LLM-as-judge against a rubric (factual accuracy, citation accuracy, completeness, source quality, tool efficiency). They do not check whether the agents followed a “correct” sequence of tool calls.决策。Anthropic使用LLM作为评判者，根据一个评分标准（事实准确性、引用准确性、完整性、来源质量、工具效率）对系统进行评估。他们不检查代理是否遵循了‘正确’的工具调用序列。

The context. Multi-agent systems are non-deterministic by design. Given the same query, two runs may use different subagent counts, different tools, different search orders, all reaching the same valid answer. Traditional evals that score on the path (”did the agent call tool X then tool Y?”) fail because there is no single correct path. We covered this pattern in detail in Why AI Agents Keep Failing in Production.上下文。多代理系统按设计是非确定性的。给定相同的查询，两次运行可能使用不同的子代理数量、不同的工具、不同的搜索顺序，所有都达到相同的有效答案。传统评估根据路径（'代理是否调用工具X然后工具Y？'）进行评分会失败，因为没有单一正确的路径。我们在‘为什么AI代理在生产中持续失败’中详细讨论了这个模式。

The tradeoff. LLM judges are themselves probabilistic. A judge that scores too leniently masks regressions; a judge that scores too strictly punishes valid alternative paths. Anthropic experimented with multiple judges per output and found a single judge with a clear rubric was both cheaper and more aligned with human grading than a panel approach.权衡。LLM评判者本身是概率性的。一个评判者评分过宽松会掩盖退化；一个评判者评分过严格会惩罚有效的替代路径。Anthropic尝试了多个评判者每次输出，发现一个具有明确评分标准的单个评判者比面板方法更便宜且更符合人类评分。

The limitations. LLM judges work well when the rubric is clear (factual claims have verifiable answers; citations have URLs that exist or don’t). They work poorly when the task is creative, subjective, or has no ground truth. Research with citations passes this test. Open-ended summarization fails it. Human testers also caught failure modes the LLM judge missed, including a subtle bias where early agents preferred SEO-optimized content over higher-quality but lower-ranked sources like academic PDFs.限制。LLM评判者在评分标准明确时效果良好（事实主张有可验证答案；引用有存在或不存在的URL）。当任务是创造性的、主观的或没有地面真实时，它们效果不佳。研究有引用通过这个测试。开放式总结失败。人类测试者还发现了LLM评判者遗漏的失败模式，包括一种微妙的偏见，早期代理更喜欢SEO优化的内容而不是更高质量但排名较低的来源如学术PDF。

The result. Hundreds of evals run cheaply per change. Anthropic recommends starting with about 20 test cases. Early in development, any change you make moves the score so much that you don’t need a big test set to see whether it helped.结果。每次更改都可以廉价运行数百次评估。Anthropic建议从大约20个测试用例开始。在早期开发中，你做出的任何更改都会显著改变评分，因此你不需要大型测试集来看到它是否有帮助。

My take. LLM-as-judge is one of the most-misused patterns in agent eval. Anthropic’s rubric has five explicit criteria: factual accuracy, citation accuracy, completeness, source quality, tool efficiency. With those criteria spelled out, a single Sonnet call returns scores that match human graders. Teams who skip the rubric and just ask “is this good?” get noise and blame the LLM.我的看法。LLM作为评判者是代理评估中最被误用的模式之一。Anthropic的评分标准有五个明确的标准：事实准确性、引用准确性、完整性、来源质量、工具效率。有这些标准明确说明，一个单独的Sonnet调用返回的评分与人类评分者一致。那些跳过评分标准，只是问‘这是好的吗？’的团队会得到噪音并责怪LLM。

The Tradeoffs
权衡

The architecture only pays off for tasks in the upper-right quadrant of this map: high-value queries that decompose into independent parallel threads. 这种架构只有在上右象限的任务上才能支付代价：高价值查询，可以分解为独立的并行线程。

The LeadResearcher executes subagent waves synchronously, so a single slow subagent stalls the entire wave; asynchronous execution is on the roadmap but the coordination problem (result ordering, state consistency, partial failures) is unsolved. The topology has no migration path for tasks with shared state across agents, which means coding, dialogue, and most agentic workflows fall outside the sweet spot.主导代理同步执行子代理波次，所以一个慢的子代理会拖慢整个波次；异步执行在路线图上，但协调问题（结果排序、状态一致性、部分失败）尚未解决。拓扑结构没有共享状态任务的迁移路径，这意味着编码、对话和大多数代理工作流不在甜点区。

A single Research session can use millions of tokens, dollars per query at current Claude pricing. The economics only work for high-value research: legal due diligence, competitive intelligence, biomedical literature review. Consumer-grade Q&A cannot absorb the multiplier. The 15x baseline also compounds when something misbehaves: a subagent that recursively spawns more subagents, or a tool that returns oversized results, can multiply a single query’s cost by another 10x or more. The published architecture has no circuit breakers or per-run caps.单次研究会话可能使用数百万个token，在当前克劳德定价下每次查询的美元成本。经济性只适用于高价值研究：法律尽职调查、竞争情报、生物医学文献综述。消费级问答无法承受乘数。15倍基准也会在某些情况下乘以另一个10倍或更多：一个子代理递归生成更多子代理，或者一个工具返回过大结果，可以使单次查询的成本乘以另一个10倍或更多。发布的架构没有电路断路器或每次运行的上限。

Can a smaller team reproduce this? The orchestrator-worker pattern, the CitationAgent pattern, and the external-memory pattern are all reproducible. What isn’t is the prompt engineering. Anthropic spent weeks watching agents fail in simulations and rewriting delegation prompts to fix specific failure modes. They published the principles (”think like your agents,” “scale effort to query complexity,” “teach the orchestrator how to delegate”) but not the prompts. Expect 2-3 months of iteration before your version stops spawning 50 subagents for a one-line question. Their Claude cookbook prompts are the closest public reference for the basic agent workflow patterns.一个小团队能复制这个吗？调度器-工作者模式、CitationAgent模式和外部内存模式都是可复制的。但不是提示工程。Anthropic花了几周时间观察代理在模拟中失败并重写委托提示以修复特定失败模式。他们发布了原则（'像你的代理思考'，'根据查询复杂性扩展努力'，'教导调度器如何委托'），但没有提示。预计你的版本在停止为一个一行问题生成50个子代理之前需要2-3个月的迭代。他们的克劳德烹饪书提示是公开的最接近的基本代理工作流模式参考。

🏗️ Engineering Lesson: You can use three patterns from this architecture without building a multi-agent system at all. (1) Externalize state to memory before context fills. (2) Isolate workers with self-contained task descriptions. (3) Verify high-stakes outputs (citations, code review, factual claims) with a separate pass. The full orchestrator-worker topology is expensive and only justified when the task is provably breadth-first. Most production systems can apply these three patterns inside a single agent and capture the reliability gains without the 15x token cost.🏗️ 工程教训：你可以在不构建多代理系统的情况下使用这三个架构中的三个模式。(1) 在上下文填满前将状态外部化到内存。(2) 使用自包含任务描述隔离工作者。(3) 用单独的通过验证高风险输出（引用、代码审查、事实主张）。完整的调度器-工作者拓扑结构昂贵，只有在任务明确是广度优先时才值得。大多数生产系统可以在单个代理中应用这三个模式，并捕获可靠性收益而无需15倍token成本。

The Bottom Line
总结

Agent architecture is a token-spending strategy. You spend dollars to buy parallelism, and the spend only cashes out when the task decomposes into independent threads. Anthropic’s own line is the cleanest version: “Multi-agent systems work mainly because they help spend enough tokens to solve the problem.” Before building one of these systems, the question to answer is whether your task has enough independent threads to make the parallelism pay for itself.代理架构是一种token消费策略。你花钱购买并行性，而花费只有在任务可以分解为独立线程时才会付出回报。Anthropic自己的线是最清晰的版本：‘多代理系统主要之所以有效，是因为它们帮助你足够地消费token来解决问题。’在构建这样的系统之前，需要回答的问题是你的任务是否有足够的独立线程让并行性支付回报。

If you’ve built anything multi-agent in production, what broke first? 如果你在生产中构建了任何多代理系统，什么首先出问题了？

Where to Next?
接下来去哪里？

📖 Go Deeper: The AI Agents Stack (2026 Edition). The full layered view of how agents get assembled in production. Orchestrator-worker is one pattern at the top layer; the bottom seven matter just as much.📖 深入：2026年AI代理堆栈。完整的生产中代理如何组装的分层视图。调度器-工作者是顶层模式；底层七个同样重要。

🔗 Go Simpler: What is an AI Agent?. Start here if any of the loop, tool, or context-window references felt too fast.🔗 更简单：什么是AI代理？如果任何关于循环、工具或上下文窗口的引用让你太快，就从这里开始。

🔀 Go Adjacent: Why AI Agents Keep Failing in Production. The three production patterns that break first, and how Anthropic’s architecture sidesteps each one.🔀 相邻：为什么AI代理在生产中持续失败。三个首先出问题的生产模式，以及Anthropic的架构如何绕过每一个。

If you're transitioning into AI engineering, How to Break Into AI Engineering in 2026 is a free 34-page PDF roadmap, covers everything from market data to monthly transition milestones, organized by your starting role.如果你正在转向AI工程，How to Break Into AI Engineering in 2026是一份免费的34页PDF路线图，涵盖从市场数据到每月过渡里程碑，按你的起始角色组织。

🔜 Tuesday: The Open-Source Agent Toolkit in 2026. What you’d actually use to build a multi-agent system in 2026 without the Anthropic API bill.🔜 周二：2026年的开源代理工具包。你在2026年实际用来构建多代理系统而不支付Anthropic API账单的内容。

Anthropic Research feature launch (Apr. 2025)

How we built our multi-agent research system (Jun. 2025)

Don’t Build Multi-Agents (Jun. 2025)

Extended thinking (Anthropic docs)

Interleaved thinking (Anthropic docs)

Butilca Bogdan

May 24Edited

This really landed for me — "multi-agent works mainly because it spends enough tokens to solve the problem." I built a little personal deep-research setup on basically that premise, and the caveats you list are the exact walls I hit.

The runaway-cost one especially. I ended up just not letting subagents spawn their own subagents — enforced in the orchestration layer, not begged for in a prompt. Doesn't give me a real per-run cost cap (still missing that one too), but it kills the scariest multiplier by default.

Same instinct on verification — a separate verifier that grep-checks citations against the source, instead of an LLM grading its own homework.

The thing I keep chewing on: did Anthropic hard-ban recursion internally, or let agents spawn and rein it in some other way? Curious how they landed on it.

1 reply by Paolo Perrone

Pradeep

May 23

The parallelization insight here is what separates shallow agentic implementations from ones that actually scale. The 15x token cost sounds alarming but the 90.2% answer quality improvement tells the real story - for knowledge-intensive enterprise use cases, the ROI math flips quickly when you factor in analyst time saved. The orchestrator-subagent pattern Anthropic settled on is also the one I've seen work best in production deployments. Really useful breakdown of the design decisions.

2 more comments...

Discussion about this post关于这篇帖子的讨论

Ready for more?准备更多？

How Anthropic Built Multi-Agent Deep Research人类如何构建多代理深度研究

15x the tokens, 90.2% better answers, and the three decisions that make it work.15倍的token数量，90.2%更好的答案，以及使其运作的三个决策。

TL;DRTL;DR

The Orchestrator-Worker Pattern调度器-工作者模式

Why Anthropic Built This为什么Anthropic构建了这个

The Engineering Decisions工程决策

Decision 1: Externalize state to memory决策1：将状态外部化到内存

Decision 2: Run subagents in parallel决策2：并行运行子代理

Decision 3: Evaluate outcomes with LLM judges决策3：使用LLM评判者评估结果

The Tradeoffs权衡

The Bottom Line总结

Where to Next?接下来去哪里？