Building Reliable Agentic AI Systems构建可靠的自主AI系统

A Case Study in building production-ready agentic AI systems构建生产级自主AI系统的案例研究

This paper presents the Preclinical Information Center (PRINCE), a cloud-hosted platform developed by Bayer AG with Thoughtworks to address pharmaceutical industry challenges in drug development. PRINCE leverages Agentic Retrieval-Augmented Generation and Text-to-SQL to integrate decades of safety study reports. We describe PRINCE's evolution from keyword-based search to an intelligent research assistant capable of answering complex questions and drafting regulatory documents. We reflect on key engineering decisions through the lens of context engineering—how information was shaped and routed between specialized agents—and harness engineering—how orchestration, recovery, and observability were built around the models to maintain control and reliability. The system prioritizes trust through transparency, explainability, and human-in-the-loop integration. PRINCE demonstrates AI's transformative potential in pharmaceuticals, significantly improving data accessibility and research efficiency while ensuring governance and compliance. 本文介绍了临床前信息中心（PRINCE），这是一个由拜耳公司与Thoughtworks联合开发的云托管平台，旨在应对制药行业在药物开发中的挑战。PRINCE利用自主检索增强生成和Text-to-SQL技术，整合了数十年的安全性研究报告。我们描述了PRINCE从基于关键词的搜索演变为能够回答复杂问题并起草监管文件的智能研究助手的过程。我们通过上下文工程（信息如何在专门代理之间塑造和路由）和驾驭工程（围绕模型构建编排、恢复和可观测性以保持控制和可靠性）的视角，反思了关键的工程决策。该系统通过透明度、可解释性和人机协同集成来优先考虑信任。PRINCE展示了AI在制药领域的变革潜力，显著提高了数据可访问性和研究效率，同时确保了治理和合规性。

16 June 20262026年6月16日

Sarang Sanjay Kulkarni

Sarang Kulkarni is a Principal Consultant at Thoughtworks, working at the intersection of software engineering, data platforms, and applied AI. He focuses on building production-grade GenAI systems, particularly Retrieval-Augmented Generation (RAG) and multi-agent workflows, and helps teams take these systems from early ideas to real-world use. Sarang also contributes to Thoughtworks’ Global AI Service Development team and teaches an O’Reilly course on building production-ready RAG applications.

Contents目录

The Challenge: Navigating the Preclinical Data Maze挑战：导航临床前数据迷宫
The Solution: PRINCE - An Evolutionary Platform解决方案：PRINCE——一个演进平台
System Architecture: Engineering a Reliable Agentic RAG System系统架构：构建可靠的自主RAG系统
The Agentic RAG System
Building Trust in a Production LLM System
Engineering for Resilience: Error Handling and Recovery弹性工程：错误处理与恢复
Enhancing Data Quality: Named Entity Recognition and Annotation提升数据质量：命名实体识别与标注
The Journey Continues: Iterative Development持续旅程：迭代开发
Conclusion结论

Preclinical drug discovery is inherently complex and data-intensive. Researchers face the significant challenge of efficiently accessing and analyzing vast volumes of information generated during this critical phase. Traditional keyword-based search methods, often reliant on rigid Boolean logic, frequently fall short when confronted with the nuanced and intricate nature of preclinical research questions.临床前药物发现本质上是复杂且数据密集的。研究人员面临着高效访问和分析这一关键阶段产生的大量信息的重大挑战。传统的基于关键词的搜索方法，通常依赖于僵化的布尔逻辑，在面对临床前研究问题的细微差别和复杂性时常常力不从心。

The advent of Large Language Models (LLMs) has presented a transformative opportunity. By combining the generative power of LLMs with the precision of information retrieval systems, Retrieval-Augmented Generation (RAG) has emerged as a promising technique. This approach holds the potential to revolutionize preclinical data access, enabling researchers to pose complex questions in natural language and receive accurate, context-rich answers grounded in proprietary data.大型语言模型（LLM）的出现带来了变革性的机遇。通过将LLM的生成能力与信息检索系统的精确性相结合，检索增强生成（RAG）已成为一种有前景的技术。这种方法有望彻底改变临床前数据访问方式，使研究人员能够用自然语言提出复杂问题，并基于专有数据获得准确、上下文丰富的答案。

Recognizing this potential early, Bayer committed to exploring how these technologies could address longstanding challenges in preclinical research.拜耳很早就认识到这一潜力，并致力于探索这些技术如何解决临床前研究中长期存在的挑战。

In this post, we share that journey—how Bayer's early investment in generative AI has resulted in PRINCE, an agentic AI system built on Agentic RAG. This case study explores the technical architecture, engineering decisions, and lessons learned in transforming preclinical data retrieval from a challenging maze into an intuitive conversational experience.在这篇文章中，我们分享这段旅程——拜耳在生成式AI上的早期投入如何催生了PRINCE，一个基于自主RAG构建的自主AI系统。本案例研究探讨了将临床前数据检索从充满挑战的迷宫转变为直观对话体验的技术架构、工程决策和经验教训。

Many of the engineering decisions behind PRINCE can now be understood through the lens of context engineering and harness engineering, although when the system was first designed we did not use these terms. Context engineering shaped what information each model received, what it did not receive, and how context moved between specialized steps such as research, reflection, and writing. Harness engineering shaped the scaffolding around the models: orchestration, tool boundaries, state persistence, retries, fallbacks, validation, reflection loops, observability, and human review.PRINCE背后的许多工程决策现在可以通过上下文工程和驾驭工程的视角来理解，尽管在系统最初设计时我们并未使用这些术语。上下文工程决定了每个模型接收什么信息、不接收什么信息，以及上下文如何在研究、反思和写作等专门步骤之间流动。驾驭工程则构建了模型周围的脚手架：编排、工具边界、状态持久化、重试、回退、验证、反思循环、可观测性和人工审核。

While this post focuses on the technical architecture and engineering challenges, our paper published in Frontiers in Artificial Intelligence covers the product evolution and business impact in more detail.虽然本文侧重于技术架构和工程挑战，但我们在《人工智能前沿》期刊上发表的论文更详细地介绍了产品演进和业务影响。

The Challenge: Navigating the Preclinical Data Maze

The preclinical research landscape at Bayer, like many large pharmaceutical organizations, is characterized by a diverse and extensive array of data. This includes highly structured datasets from various studies, alongside vast amounts of unstructured information embedded within text documents such as study reports, publications, and regulatory submissions. Researchers frequently encountered significant hurdles in accessing and analyzing this information effectively:与许多大型制药组织一样，拜耳的临床前研究格局以多样化和广泛的数据为特征。这包括来自各种研究的高度结构化数据集，以及嵌入在研究报告、出版物和监管文件等文本文档中的大量非结构化信息。研究人员在有效访问和分析这些信息时经常遇到重大障碍：

Data Silos: information was fragmented and scattered across numerous disparate systems and repositories, making it exceedingly difficult to gain a comprehensive, holistic view of preclinical data related to a specific compound or study. 数据孤岛：信息分散在众多不同的系统和存储库中，使得获取与特定化合物或研究相关的临床前数据的全面、整体视图变得极其困难。
Limited Search Capabilities: traditional keyword-based search engines struggled with the complexity and variability of preclinical terminology and research questions, often yielding irrelevant, incomplete, or overwhelming results. 搜索能力有限：传统的基于关键词的搜索引擎难以应对临床前术语和研究问题的复杂性和多变性，常常产生不相关、不完整或过多的结果。
Time-Consuming Manual Analysis: extracting specific insights or compiling information across multiple documents required considerable manual effort, diverting valuable researcher time away from core scientific activities.耗时的手动分析：跨多个文档提取特定见解或汇编信息需要大量手动工作，占用了研究人员宝贵的核心科学活动时间。

These inherent challenges highlighted a clear need for a more efficient, intelligent, and integrated approach to preclinical data retrieval and analysis.这些固有挑战凸显了对更高效、更智能、更集成的临床前数据检索和分析方法的明确需求。

The Solution: PRINCE - An Evolutionary Platform

To address these challenges, Bayer developed the Preclinical Information Center (PRINCE) platform. PRINCE was conceived as a unified gateway to preclinical data, initially focusing on consolidating previously siloed structured study metadata and exposing them in a “Searchable” manner. This initial phase allowed users to apply advanced filters and retrieve information primarily from structured study metadata.为了应对这些挑战，拜耳开发了临床前信息中心（PRINCE）平台。PRINCE被设想为临床前数据的统一网关，最初专注于整合先前孤岛化的结构化研究元数据，并以“可搜索”的方式呈现。这一初始阶段允许用户应用高级过滤器，主要从结构化研究元数据中检索信息。

However, a significant portion of Bayer's valuable preclinical knowledge resides within unstructured PDF study reports accumulated over decades. Due to numerous system migrations over the years, the structured metadata associated with these reports could be incomplete, missing, or even contain incorrect annotations. Crucially, the authoritative “gold standard” information was consistently present within the approved PDF study reports.然而，拜耳大量宝贵的临床前知识存在于数十年来积累的非结构化PDF研究报告中。由于多年来多次系统迁移，与这些报告相关的结构化元数据可能不完整、缺失，甚至包含错误的注释。关键的是，权威的“黄金标准”信息始终存在于已批准的PDF研究报告中。

The emergence of Generative AI, particularly RAG, provided the key to unlocking this wealth of unstructured data. By integrating RAG capabilities, PRINCE began to shift the paradigm from a filter-based 'search' tool to a natural language 'ask' system, enabling researchers to query the content of these study reports directly.生成式AI，特别是RAG的出现，为解锁这些丰富的非结构化数据提供了关键。通过集成RAG功能，PRINCE开始将范式从基于过滤器的“搜索”工具转变为自然语言的“询问”系统，使研究人员能够直接查询这些研究报告的内容。

This evolution reflects PRINCE's progression through three distinct phases:这一演进反映了PRINCE经历的三个不同阶段：

Search: the initial phase focused on creating a unified gateway to thousands of nonclinical study reports, consolidating multiple in-house data silos from various preclinical domains into a searchable format, primarily leveraging structured metadata.搜索：初始阶段专注于创建数千份非临床研究报告的统一网关，将来自各个临床前领域的多个内部数据孤岛整合为可搜索的格式，主要利用结构化元数据。
Ask: this phase introduced an AI-powered question-answering system utilizing Retrieval Augmented Generation (RAG). This enabled researchers to derive insights directly from unstructured data, including scanned PDFs from historical reports, by posing questions in natural language.询问：该阶段引入了利用检索增强生成（RAG）的AI驱动问答系统。这使得研究人员能够通过用自然语言提问，直接从非结构化数据（包括历史报告中的扫描PDF）中获取见解。
Do: the current phase positions PRINCE as an active research assistant capable of executing complex tasks. This is achieved through the integration of multi-agent systems, allowing the platform to handle intricate queries, orchestrate workflows, and support activities like drafting regulatory documents.执行：当前阶段将PRINCE定位为能够执行复杂任务的主动研究助手。这是通过集成多代理系统实现的，使平台能够处理复杂查询、编排工作流，并支持起草监管文件等活动。

This deliberate evolution from Search to Ask to Do represents a strategic response to the industry's need for greater efficiency and innovation in preclinical development. By providing researchers with increasingly powerful tools to access, analyze, and act upon preclinical data, PRINCE aims to enable faster data-driven decision-making, reduce the need for unnecessary experiments, and ultimately accelerate the development of safer, more effective therapies.这种从搜索到询问再到执行的有意演进，是对行业在临床前开发中提高效率和创新需求的战略响应。通过为研究人员提供日益强大的工具来访问、分析和利用临床前数据，PRINCE旨在实现更快的数据驱动决策，减少不必要的实验，并最终加速开发更安全、更有效的疗法。

System Architecture: Engineering a Reliable Agentic RAG System

The system functions as an interactive conversational UI, powered by a robust backend infrastructure. Its architecture, designed for handling complex queries and delivering accurate, context-rich answers, is orchestrated using LangGraph and served via a FastAPI application.该系统作为一个交互式对话UI运行，由强大的后端基础设施支持。其架构旨在处理复杂查询并提供准确、上下文丰富的答案，使用LangGraph进行编排，并通过FastAPI应用程序提供服务。

Figure 1 provides the system context—UI, backend, data stores, LLM fallbacks, and observability—while Figure 2 zooms into how the system coordinates its specialized agents.图1提供了系统上下文——UI、后端、数据存储、LLM回退和可观测性——而图2则深入展示了系统如何协调其专门代理。

Figure 1: System context and supporting platforms.图1：系统上下文和支持平台。

User Request: the process begins when a user submits a request through the Conversational UI which is built with React.用户请求：流程始于用户通过使用React构建的对话UI提交请求。
Orchestration: the user's request is routed to a LangGraph-based orchestration layer in the backend. This workflow engine coordinates a multi-stage process that progresses through clarifying user intent, thinking and planning, conducting research (using RAG and Text-to-SQL), validating data completion, and finally generating a response through the Writer agent. The workflow includes deliberate pause points and feedback loops to ensure data completeness before proceeding. (We explore the details of this agentic workflow in a dedicated section later.)编排：用户请求被路由到后端基于LangGraph的编排层。该工作流引擎协调一个多阶段流程，依次经历澄清用户意图、思考与规划、进行研究（使用RAG和Text-to-SQL）、验证数据完整性，最后通过写作代理生成响应。工作流包含有意的暂停点和反馈循环，以确保在继续之前数据完整。（我们将在后面的专门章节中探讨此自主工作流的细节。）
Data Retrieval and State Management: the Researcher agents interact with a comprehensive and distributed data ecosystem: 数据检索与状态管理：研究员代理与一个全面且分布式的数据生态系统交互：

Vector representations of all study reports are stored in OpenSearch, forming the core knowledge base for information retrieval.所有研究报告的向量表示存储在OpenSearch中，构成信息检索的核心知识库。
Curated structured data, resulting from various ETL and harmonization processes, is accessed via Athena.通过各种ETL和协调过程产生的精选结构化数据，通过Athena访问。
The state of the agent's execution is meticulously tracked. After each logical step (a LangGraph node execution), the corresponding state is persisted in PostgreSQL using a LangGraph checkpointer.代理执行的状态被细致跟踪。在每个逻辑步骤（LangGraph节点执行）之后，相应的状态使用LangGraph检查点持久化到PostgreSQL中。
Broader application-level state is managed in DynamoDB.更广泛的应用程序级状态在DynamoDB中管理。

The system leverages internal GenAI platforms that host models from OpenAI, Anthropic, Google, and open-source providers. These platforms expose all models via a unified OpenAI-compatible endpoint, making it easy to swap models and choose the best tool for each task. They also manage the control plane, enforcing rate limits and other safeguards to prevent abuse.该系统利用内部GenAI平台，托管来自OpenAI、Anthropic、Google和开源提供商的模型。这些平台通过统一的OpenAI兼容端点公开所有模型，使得轻松切换模型并为每个任务选择最佳工具成为可能。它们还管理控制平面，强制执行速率限制和其他防护措施以防止滥用。
Resilience and Error Handling: robustness is a critical design principle, with multiple fallback mechanisms in place: 弹性与错误处理：鲁棒性是一个关键设计原则，具有多种回退机制：

If a specific LLM fails, the system automatically retries the request several times before falling back to an alternative model or platform to ensure service continuity.如果特定LLM失败，系统会自动重试请求数次，然后回退到替代模型或平台，以确保服务连续性。
To recover quickly from transient failures, retries are implemented at both the individual LLM call level and the logical node level (i.e., an entire step in the agent's plan).为了从瞬时故障中快速恢复，在单个LLM调用级别和逻辑节点级别（即代理计划中的整个步骤）都实现了重试。
Also, agents are provided the context of the errors so that they can chart a different trajectory or alternative plan of action as a response.此外，代理会获得错误上下文，以便它们能够规划不同的轨迹或替代行动方案作为响应。

Observability and Evaluation: the entire system is monitored for performance and reliability: 可观测性与评估：整个系统都受到性能和可靠性监控：

General system health and metrics are tracked using Cloudwatch.使用CloudWatch跟踪一般系统健康状况和指标。
Langfuse serves as the primary observability tool, providing detailed traces of all production traffic. This allows for in-depth debugging of issues. Furthermore, evaluation datasets are stored and managed within Langfuse, making it easier to analyze performance scores and diagnose specific failures. The evaluation is done using RAGAS evaluation framework. The live traffic evaluation is done on a daily basis while the dataset evaluation is done whenever significant changes are made to the core workflow, prompts, or underlying models.Langfuse作为主要的可观测性工具，提供所有生产流量的详细跟踪。这允许深入调试问题。此外，评估数据集存储在Langfuse中并进行管理，使得分析性能分数和诊断特定故障更加容易。评估使用RAGAS评估框架完成。实时流量评估每天进行，而数据集评估则在核心工作流、提示或底层模型发生重大更改时进行。

Final Response: once the agents have processed the request and generated a satisfactory response, it is sent back to the Conversational UI to be presented to the user.最终响应：一旦代理处理完请求并生成满意的响应，它将被发送回对话UI呈现给用户。

A design principle running through this architecture is context discipline. Larger context windows did not remove the need to be selective about what each agent sees. In early iterations, putting too much information into the context made the system harder to steer and harder to evaluate. PRINCE therefore avoids treating the prompt as one large container for all available information. Instead, different stages receive different context: planning context for Think & Plan, retrieval context for the Researcher Agent, evidence context for the Reflection Agent, and synthesis context for the Writer Agent. This reduces context pollution and makes the system easier to debug, evaluate, and improve.贯穿此架构的一个设计原则是上下文纪律。更大的上下文窗口并没有消除对每个代理所见内容进行选择的需要。在早期迭代中，将过多信息放入上下文使得系统更难引导和评估。因此，PRINCE避免将提示视为容纳所有可用信息的一个大容器。相反，不同阶段接收不同的上下文：思考与规划阶段接收规划上下文，研究员代理接收检索上下文，反思代理接收证据上下文，写作代理接收合成上下文。这减少了上下文污染，并使系统更易于调试、评估和改进。

These steps ensure that the system can provide reliable and contextually relevant answers to a wide range of complex queries by leveraging a sophisticated, multi-agent architecture and a diverse set of powerful tools and data sources.这些步骤确保系统能够通过利用复杂的多代理架构以及多样化的强大工具和数据源，为各种复杂查询提供可靠且上下文相关的答案。

The Agentic RAG System自主RAG系统

PRINCE incorporates an agentic RAG system (Figure 2) to handle complex user requests that require multiple steps, reasoning, and interaction with different tools or data sources. This setup, implemented using LangGraph, orchestrates the overall workflow and leverages Researcher Agent, Writer Agent, and Reflection Agent for specific tasks. The system is designed to be robust and reliable, with multiple fallback mechanisms in place to ensure that the system can continue to function even if some of the components fail.PRINCE集成了一个自主RAG系统（图2），以处理需要多个步骤、推理以及与不同工具或数据源交互的复杂用户请求。该设置使用LangGraph实现，协调整体工作流，并利用研究员代理、写作代理和反思代理执行特定任务。该系统设计为健壮可靠，具有多种回退机制，以确保即使某些组件发生故障，系统也能继续运行。

Figure 2: The research workflow.图2：研究工作流。

Clarify User Intent

The Clarify User Intent step serves as the first line of defense against ambiguity. As the system scaled to include diverse domains like toxicology and pharmacology, simple user queries often became ambiguous, making it difficult to automatically select the right tools. Rather than relying on expensive trial-and-error across all data sources, the system proactively asks clarifying questions to pinpoint the specific domain or data type.澄清用户意图步骤是应对歧义的第一道防线。随着系统扩展到包括毒理学和药理学等不同领域，简单的用户查询常常变得模糊，使得自动选择正确工具变得困难。系统不是依赖昂贵的跨所有数据源的试错，而是主动提出澄清性问题，以确定特定领域或数据类型。

This ensures the system enhances the query with the necessary constraints to target the correct tools. We are also optimizing this by developing domain-level selection in the UI, which will allow users to pre-filter valid tools upfront. To further reduce friction, the system also provides AI-assisted source recommendations: when a user has not selected any data source — or has selected several without a clear focus — the model analyzes the intent behind the user's query and suggests the most relevant sources. The user retains full control and can accept, adjust, or override the recommendation, ensuring domain expertise always has the final say. This “fail-fast” mechanism prevents wasted execution on vague queries, while careful tuning ensures the system remains unobtrusive when the intent is already clear.这确保了系统用必要的约束增强查询，以定位正确的工具。我们也在通过开发UI中的领域级选择来优化这一点，这将允许用户预先过滤有效工具。为了进一步减少摩擦，系统还提供AI辅助的源推荐：当用户未选择任何数据源——或选择了多个但没有明确重点时——模型会分析用户查询背后的意图并建议最相关的源。用户保留完全控制权，可以接受、调整或覆盖推荐，确保领域专业知识始终拥有最终决定权。这种“快速失败”机制防止了对模糊查询的浪费执行，同时通过仔细调整确保当意图已经明确时系统保持不显眼。

From a context engineering perspective, this step is the first assembly decision in the workflow: it constrains which tools, domains, and data sources will be in scope before any retrieval begins, ensuring subsequent agents receive a focused rather than open-ended problem.从上下文工程的角度来看，此步骤是工作流中的第一个组装决策：它在任何检索开始之前约束哪些工具、领域和数据源将在范围内，确保后续代理接收的是聚焦而非开放的问题。

Think & Plan: Process Reflection

The Think & Plan step is responsible for devising a strategy to fulfill the user's request. This critical component gives the system a dedicated space to reason about the next steps before taking action—a technique inspired by Anthropic's Think tool. Importantly, this step performs process reflection: evaluating whether the agent is making the right progress toward its end goal and is on right trajectory, rather than evaluating the data itself.思考与规划步骤负责制定满足用户请求的策略。这个关键组件为系统提供了一个专用空间，在采取行动之前推理下一步——这一技术受Anthropic的Think工具启发。重要的是，此步骤执行过程反思：评估代理是否朝着最终目标取得正确进展并处于正确轨迹，而不是评估数据本身。

In multi-step agentic workflows, particularly those involving many sequential actions, process reflection is essential. Consider a scenario where the system needs to execute 50 steps to complete a complex task. At each juncture, the system must ask: Am I taking these steps in the right manner? Am I making the progress I'm supposed to make? Is the current trajectory leading toward the user's goal? The Think & Plan step provides this metacognitive capability, allowing the system to reflect on its own workflow and adjust its strategy accordingly.在多步骤自主工作流中，特别是那些涉及许多顺序操作的工作流中，过程反思至关重要。考虑一个场景，系统需要执行50个步骤来完成一个复杂任务。在每个节点，系统必须问：我是否以正确的方式执行这些步骤？我是否取得了应有的进展？当前轨迹是否通向用户的目标？思考与规划步骤提供了这种元认知能力，允许系统反思自己的工作流并相应调整策略。

This “thinking space” has proven particularly valuable in scenarios involving multiple tool calls. When PRINCE was initially developed, it had only a couple of tools: one for RAG-based retrieval and another for Text-to-SQL queries. However, as we integrated more data sources to expand the system's capabilities, the number of available tools grew significantly. With this explosion of tools came an inherent challenge: overlapping concerns and domain boundaries across different tools.这种“思考空间”在涉及多个工具调用的场景中已被证明特别有价值。当PRINCE最初开发时，它只有几个工具：一个用于基于RAG的检索，另一个用于Text-to-SQL查询。然而，随着我们集成更多数据源以扩展系统能力，可用工具的数量显著增长。随着工具的激增，一个固有的挑战随之而来：不同工具之间的关注点重叠和领域边界模糊。

For example, multiple tools might serve similar but subtly different purposes—querying structured metadata versus unstructured reports, or retrieving study summaries versus detailed experimental data. When presented with tools that belong to similar domains but handle slightly different data, the LLM would sometimes struggle to select the most appropriate tool for a given query. By introducing a dedicated thinking step, the system can explicitly reason about which tool best matches the user's intent, evaluate the characteristics of each available tool, and make a more informed decision. This approach led to a dramatic improvement in the accuracy of tool selection.例如，多个工具可能服务于相似但略有不同的目的——查询结构化元数据与非结构化报告，或检索研究摘要与详细实验数据。当呈现属于相似领域但处理略有不同数据的工具时，LLM有时会难以选择最适合给定查询的工具。通过引入一个专门的思考步骤，系统可以明确推理哪个工具最符合用户意图，评估每个可用工具的特性，并做出更明智的决策。这种方法显著提高了工具选择的准确性。

Beyond tool selection, the Think & Plan step is essential for orchestrating multi-step processes. Many complex queries in PRINCE require a series of tool calls where the output of one tool must be analyzed before determining the next action. For instance, the system might first query structured metadata to identify relevant studies, then use those study IDs to retrieve detailed information from unstructured reports, and finally synthesize the findings. Without a dedicated space for process reflection, the system would attempt to execute these steps linearly without evaluating whether each step is bringing it closer to the goal. With the thinking step in place, the system can pause, assess its progress in the workflow, and intelligently plan the subsequent tool calls needed to complete the user's request.除了工具选择，思考与规划步骤对于编排多步骤过程至关重要。PRINCE中的许多复杂查询需要一系列工具调用，其中一个工具的输出必须在确定下一步行动之前进行分析。例如，系统可能首先查询结构化元数据以识别相关研究，然后使用这些研究ID从非结构化报告中检索详细信息，最后综合发现。如果没有专门的空间进行过程反思，系统将尝试线性执行这些步骤，而不评估每一步是否使其更接近目标。有了思考步骤，系统可以暂停，评估其在工作流中的进展，并智能地规划完成用户请求所需的后续工具调用。

The Researcher Agent

The Researcher Agent serves as the system's primary information gatherer. As we onboard new scientific domains onto PRINCE, we consistently observe that data falls into two primary categories: structured and unstructured. While specific implementation techniques may vary across domains — for instance, leveraging Snowflake Cortex Analyst for pharmacology queries for Text-to-SQL versus other more custom methods for toxicology—the fundamentals behind these retrieval strategies remain consistent.研究员代理充当系统的主要信息收集者。随着我们将新的科学领域引入PRINCE，我们一致观察到数据分为两大类：结构化和非结构化。尽管具体实现技术可能因领域而异——例如，利用Snowflake Cortex Analyst进行药理学查询的Text-to-SQL，与毒理学的其他更定制方法——但这些检索策略背后的基本原理保持一致。

As PRINCE expands across multiple preclinical domains, a single Researcher agent with a flat tool list becomes increasingly hard to manage. Many tools operate on similar concepts—“studies”, “findings”, “assays”—but point to different underlying datasets, schemas, and regulatory interpretations depending on the domain. For example, when a user refers to “the study”, the relevant context might be a repeat‑dose toxicology study, a cardiovascular safety pharmacology package, or a particular assay in aggregated mass‑data tables, each with its own preferred sources of truth.随着PRINCE扩展到多个临床前领域，一个拥有扁平工具列表的单一研究员代理变得越来越难以管理。许多工具操作在相似的概念上——“研究”、“发现”、“测定”——但指向不同的底层数据集、模式和监管解释，具体取决于领域。例如，当用户提到“该研究”时，相关上下文可能是重复剂量毒理学研究、心血管安全药理学包，或聚合质量数据表中的特定测定，每个都有自己偏好的权威来源。

To avoid one monolithic agent juggling overlapping tools and subtly different data contracts, we are actively evolving the Researcher capability into a hierarchy of domain‑specific sub‑agents. In this proposed architecture, each domain agent will own its own toolset (for example, toxicology RAG + tox metadata SQL, or pharmacology RAG + assay‑level SQL) along with tailored prompt instructions that encode how that domain’s data model works, which tables or indices are authoritative, and how to interpret key concepts. We anticipate this will keep responsibilities coherent, reduce accidental cross‑domain leakage, and make it easier to reason about and test retrieval behaviour per domain.为了避免一个单一的代理处理重叠的工具和略有不同的数据契约，我们正在积极将研究员能力演进为领域特定子代理的层次结构。在这种提议的架构中，每个领域代理将拥有自己的工具集（例如，毒理学RAG + 毒理元数据SQL，或药理学RAG + 测定级SQL）以及定制的提示指令，这些指令编码了该领域数据模型的工作方式、哪些表或索引是权威的，以及如何解释关键概念。我们预计这将保持职责的一致性，减少意外的跨领域泄漏，并使每个领域的检索行为更容易推理和测试。

To effectively harvest insights from this diverse landscape, the Researcher Agent employs a hybrid retriever approach focused on two distinct patterns:为了有效从这种多样化的格局中获取见解，研究员代理采用了一种混合检索器方法，专注于两种不同的模式：

Retrieval-Augmented Generation (RAG): for processing unstructured data, primarily PDF reports.检索增强生成（RAG）：用于处理非结构化数据，主要是PDF报告。
Text-to-SQL: for querying structured data housed in Amazon Athena.Text-to-SQL：用于查询存储在Amazon Athena中的结构化数据。

This dual-strategy allows the system to bridge the gap between narrative scientific reports and quantitative experimental data.这种双重策略使系统能够弥合叙述性科学报告与定量实验数据之间的差距。

In this updated vision, the top‑level Researcher Agent is designed to act as a coordinator rather than a single all‑knowing component. Given the clarified user intent and any explicit domain selection from the UI, it will route the query to the appropriate domain sub‑agent, which can then decide how to combine RAG and Text‑to‑SQL within its own boundary. This pattern aims to preserve the simplicity of “one researcher” from the user’s perspective, while internally allowing each domain to evolve its own tools, schemas, and retrieval recipes without destabilizing the rest of the system.在这个更新的愿景中，顶层研究员代理被设计为充当协调者，而不是一个全知的单一组件。根据澄清后的用户意图和UI中的任何显式领域选择，它将查询路由到适当的领域子代理，然后该子代理可以决定如何在其边界内组合RAG和Text-to-SQL。这种模式旨在从用户角度保持“一个研究员”的简单性，同时在内部允许每个领域发展自己的工具、模式和检索配方，而不会破坏系统的其余部分。

Retrieval-Augmented Generation (RAG) for Unstructured Data用于非结构化数据的检索增强生成（RAG）

Given the vast repository of thousands of preclinical study reports and other unstructured documents, RAG is essential for extracting relevant insights by grounding LLM responses in this specific knowledge base. The RAG pipeline comprises a comprehensive ingestion process and a sophisticated query-time architecture.鉴于数千份临床前研究报告和其他非结构化文档的庞大存储库，RAG对于通过将LLM响应锚定在此特定知识库中来提取相关见解至关重要。RAG管道包括一个全面的摄取过程和一个复杂的查询时架构。

Ingestion Process: Preclinical study reports, mostly PDFs spanning decades and often including scanned documents with complex tables, are first centralized into an S3 data lake and passed through an extraction pipeline tuned for this corpus. The extracted text is normalized into structured JSON and then chunked using a strategy that preserves enough scientific context while keeping chunks efficient for retrieval.摄取过程：临床前研究报告，主要是跨越数十年的PDF，通常包含带有复杂表格的扫描文档，首先集中到S3数据湖中，并通过针对此语料库调整的提取管道。提取的文本被规范化为结构化JSON，然后使用一种策略进行分块，该策略在保持足够科学上下文的同时，使块对检索高效。

Each chunk is enriched with study‑ and section‑level metadata from Amazon Athena (for example study ID, compound, species, route, page, and parent section), which later enables precise metadata filtering in the RAG layer. Finally, these annotated chunks are embedded and indexed in Amazon OpenSearch Service, forming the vector store that backs semantic and metadata‑aware retrieval over both the historical corpus and the daily deltas as new or updated reports arrive.每个块都使用来自Amazon Athena的研究和章节级元数据（例如研究ID、化合物、物种、途径、页面和父章节）进行丰富，这后来在RAG层中实现了精确的元数据过滤。最后，这些带注释的块被嵌入并索引到Amazon OpenSearch Service中，形成支持语义和元数据感知检索的向量存储，覆盖历史语料库以及随着新报告或更新报告到达的每日增量。

Query-Time RAG Pipeline: When a user submits a query, the system initiates a multi-stage retrieval process. This pipeline is engineered to effectively retrieve the most relevant and trustworthy information from the vector database to ground the LLM's response.查询时RAG管道：当用户提交查询时，系统启动一个多阶段检索过程。该管道旨在从向量数据库中有效检索最相关和最可信的信息，以锚定LLM的响应。

To illustrate this pipeline, consider the example query: “Were any of the following clinical findings observed in study T123456-2: piloerection, ataxia, eyes partially closed, and loose faeces?”. The system processes this query through the following steps:为了说明此管道，考虑示例查询：“在研究T123456-2中是否观察到以下任何临床发现：竖毛、共济失调、眼睛部分闭合和稀便？”系统通过以下步骤处理此查询：

Keyword Extraction: the user's natural language query is first analyzed by an LLM. Through careful prompt engineering, the model is instructed to extract keywords highly relevant for keyword search within our document corpus (e.g., “piloerection”, “ataxia”, “eyes partially closed”, “loose faeces”).关键词提取：用户的自然语言查询首先由LLM分析。通过精心设计的提示工程，模型被指示提取与我们的文档语料库中关键词搜索高度相关的关键词（例如，“竖毛”、“共济失调”、“眼睛部分闭合”、“稀便”）。
Metadata Filter Generation: concurrently, the LLM generates a metadata filter based on the query. For example, a filter eq(study_id, T123456-2) is extracted to narrow the search space. This filter is dynamically generated using few-shot prompting with various permutation and combination examples provided to the model, ensuring it can handle diverse filtering requests.元数据过滤器生成：同时，LLM根据查询生成一个元数据过滤器。例如，提取过滤器eq(study_id, T123456-2)以缩小搜索空间。此过滤器使用少样本提示动态生成，向模型提供各种排列和组合示例，确保其能够处理多样化的过滤请求。
Query Expansion: to ensure comprehensive retrieval and account for variations in phrasing and terminology, query expansion (multi query or query rewrite) is performed by a smaller, faster model. This generates n=5 semantically similar queries based on the original question. For the example query, this might include variations like:查询扩展：为了确保全面检索并考虑措辞和术语的变化，查询扩展（多查询或查询重写）由更小、更快的模型执行。这基于原始问题生成n=5个语义相似的查询。对于示例查询，这可能包括变体如：

“Clinical symptoms reported in research T123456-2, including goosebumps, lack of coordination, semi-closed eyelids, or diarrhea.”“研究T123456-2中报告的临床症状，包括鸡皮疙瘩、缺乏协调、半闭眼睑或腹泻。”
“Recorded observations in experiment T123456-2 regarding hair standing on end, unsteady movement, eyes not fully open, or watery stools.”“实验T123456-2中关于毛发竖立、不稳定运动、眼睛未完全睁开或水样便的记录观察。”
“What were the clinical observations noted in trial T123456-2, particularly regarding the presence of hair bristling, impaired balance, partially shut eyes, or soft bowel movements.”“在试验T123456-2中注意到的临床观察是什么，特别是关于毛发竖立、平衡受损、部分闭眼或软便的存在。”

Hybrid Retriever: information retrieval from the vector database (Amazon OpenSearch Service) utilizes a Hybrid Search approach that combines metadata filtering, semantic vector similarity search (kNN), and keyword-based retrieval. This process is executed as follows:混合检索器：从向量数据库（Amazon OpenSearch Service）进行信息检索采用混合搜索方法，结合了元数据过滤、语义向量相似性搜索（kNN）和基于关键词的检索。此过程执行如下：

Metadata Filtering: the metadata filter generated in the previous step (e.g., eq(study_id, T123456-2)) is applied directly to the vector database query. This pre-filters the search space based on the structured metadata attached to the chunks during the ingestion process from Amazon Athena, ensuring that only chunks associated with the specified study ID (or other relevant metadata) are considered. This significantly reduces the search space from millions of vectors to a more manageable range of tens to hundreds, improving efficiency and relevance.元数据过滤：上一步生成的元数据过滤器（例如，eq(study_id, T123456-2)）直接应用于向量数据库查询。这基于在摄取过程中从Amazon Athena附加到块的结构化元数据预过滤搜索空间，确保只考虑与指定研究ID（或其他相关元数据）关联的块。这显著将搜索空间从数百万个向量减少到更易管理的数十到数百个范围，提高了效率和相关性。
Parallel Hybrid Search Execution: for each of the n=5 expanded queries, a single hybrid search query is executed in parallel against the filtered Amazon OpenSearch Service vector database. This query combines both semantic vector similarity search (kNN) and keyword-based search, leveraging OpenSearch's capabilities for efficient multi-vector and text search.并行混合搜索执行：对于n=5个扩展查询中的每一个，针对过滤后的Amazon OpenSearch Service向量数据库并行执行单个混合搜索查询。此查询结合了语义向量相似性搜索（kNN）和基于关键词的搜索，利用OpenSearch的高效多向量和文本搜索能力。
Weighted Result Scoring: within each individual hybrid search executed in parallel, a weighted approach is applied to the results. A weight of 0.7 is given to the semantic search results and 0.3 to the keyword search results to balance contextual understanding and precise term matching. This weighting was determined through experimentation to optimize retrieval effectiveness for our data.加权结果评分：在并行执行的每个单独混合搜索中，对结果应用加权方法。语义搜索结果权重为0.7，关键词搜索结果权重为0.3，以平衡上下文理解和精确术语匹配。此权重通过实验确定，以优化我们数据的检索效果。
Result Aggregation and Initial Ranking: the results (sets of relevant chunks with their weighted scores) from all 5 parallel hybrid search executions are aggregated. Unique chunks from all search results are pulled together, and their highest weighted score across the parallel searches is used to determine an initial ranking. This step initially retrieves a larger set of potential context chunks (k=~20) based on these aggregated and weighted scores.结果聚合与初始排名：来自所有5个并行混合搜索执行的结果（相关块及其加权分数的集合）被聚合。来自所有搜索结果的唯一块被汇集在一起，它们跨并行搜索的最高加权分数用于确定初始排名。此步骤基于这些聚合和加权分数初始检索一组更大的潜在上下文块（k=~20）。

Reranking: the initial set of retrieved chunks (k=~20) is then refined using a Rerank step. A cross-encoder model (bge-reranker-large) evaluates the relevance of each retrieved chunk against the original question, selecting the top k=7 most relevant chunks to be used as context for the LLM. This reranking step is crucial for ensuring that the most pertinent information, even if not the highest in initial semantic similarity or keyword match, is prioritized for the final response generation.重新排序：初始检索到的块集（k=~20）然后使用重新排序步骤进行细化。交叉编码器模型（bge-reranker-large）评估每个检索到的块与原始问题的相关性，选择最相关的top k=7个块作为LLM的上下文。此重新排序步骤对于确保即使不是初始语义相似性或关键词匹配最高的最相关信息也被优先用于最终响应生成至关重要。
Final LLM Prompt Generation: the refined context (k=7 chunks) is then combined with the original question to form the final LLM prompt. This prompt is carefully constructed to guide the LLM in generating a focused and accurate response based on the provided context, minimizing the risk of hallucination.最终LLM提示生成：精炼后的上下文（k=7个块）然后与原始问题结合形成最终的LLM提示。此提示经过精心构建，以指导LLM基于提供的上下文生成聚焦且准确的响应，最小化幻觉风险。
Response Generation with Citation: a state-of-the-art reasoning model then processes the final prompt and the provided context to generate response with citation. The LLM synthesizes the information from the context to formulate a coherent and accurate answer. Crucially, the response automatically includes citations linking back to the specific chunks in the original document(s) that support the generated answer.带引用的响应生成：然后，最先进的推理模型处理最终提示和提供的上下文，以生成带引用的响应。LLM综合上下文中的信息，形成连贯且准确的答案。关键的是，响应自动包含引用，链接回支持生成答案的原始文档中的特定块。
Monitoring: the entire Query-Time RAG process, from initial query to final response generation, is continuously monitored using Langfuse for observability, performance and quality analysis.监控：整个查询时RAG过程，从初始查询到最终响应生成，使用Langfuse持续监控，以实现可观测性、性能和质量分析。

Text-to-SQL for Structured Data用于结构化数据的Text-to-SQL

While RAG excels at unstructured data, queries requiring precise filtering, aggregation, or comparison of structured data points are better suited for Text-to-SQL. Examples include “Give me 50 example studies done on RAT” or retrieving specific numerical assay results including dosage groups. As shown in the Researcher Agent can intelligently decide to hand over such queries to the Text-to-SQL tool.虽然RAG擅长处理非结构化数据，但需要精确过滤、聚合或比较结构化数据点的查询更适合Text-to-SQL。例如，“给我50个在大鼠上进行的示例研究”或检索包括剂量组在内的特定数值测定结果。如图2所示，研究员代理可以智能地决定将此类查询交给Text-to-SQL工具。

Figure 3: Text-to-SQL tool图3：Text-to-SQL工具

The process for converting a natural language question into an executable SQL query and retrieving results involves several key steps:将自然语言问题转换为可执行SQL查询并检索结果的过程涉及几个关键步骤：

Query Analysis and Intent Recognition: the user's natural language query is analyzed to understand the user's intent and identify the specific data points and filters being requested from the structured metadata.查询分析与意图识别：分析用户的自然语言查询，以理解用户意图并识别从结构化元数据中请求的特定数据点和过滤器。
Schema Understanding and Relevant Schema Selection: to accurately generate a SQL query, the LLM requires an understanding of the relevant database schema. For large and complex schemas, only the necessary schema components relevant to the user's query are dynamically injected into the LLM's context. This reduces the complexity for the model and improves the accuracy of the generated SQL.模式理解与相关模式选择：为了准确生成SQL查询，LLM需要理解相关数据库模式。对于大型复杂模式，只有与用户查询相关的必要模式组件被动态注入LLM的上下文。这降低了模型的复杂性，并提高了生成SQL的准确性。
Dynamic Few-Shot Prompting for SQL Generation: converting complex natural language queries into precise SQL dialect (in our case, Athena) can be challenging for LLMs. To address this, we employ dynamic few-shot prompting. A collection of carefully hand-picked examples, representing various complex query patterns and their corresponding correct SQL translations in the Athena dialect, is stored in a separate collection within our vector database. Based on the user's query, relevant examples are retrieved from this “semantic layer” using vector similarity search and included in the prompt to the LLM. This provides the LLM with in-context learning examples, guiding it to generate accurate SQL queries in the correct dialect. Continuous addition of new examples based on encountered challenges further improves the system's performance over time.用于SQL生成的动态少样本提示：将复杂的自然语言查询转换为精确的SQL方言（在我们的案例中是Athena）对LLM来说可能具有挑战性。为了解决这个问题，我们采用了动态少样本提示。一组精心挑选的示例，代表各种复杂查询模式及其在Athena方言中对应的正确SQL翻译，存储在我们的向量数据库中的一个单独集合中。根据用户的查询，使用向量相似性搜索从该“语义层”中检索相关示例，并将其包含在给LLM的提示中。这为LLM提供了上下文学习示例，指导其生成准确且符合正确方言的SQL查询。根据遇到的挑战持续添加新示例，随着时间的推移进一步提高了系统的性能。
SQL Query Generation and Validation: a model with strong code generation capabilities, conditioned on the relevant schema information and dynamic few-shot examples, generates the corresponding SQL query. To ensure the LLM can accurately process the results and identify the correct rows for subsequent synthesis, certain essential columns, such as study ID and study title, are always included in the generated SELECT query. The generated query is then validated to ensure it adheres to allowed operations (e.g., only SELECT queries are permitted; DELETE, INSERT, or UPDATE queries are explicitly blocked for data integrity and security). Notably, an earlier iteration of this process included an LLM review step for generated SQL queries; however, this step was later removed as it was found that the reviewing LLM sometimes incorrectly flagged valid queries as erroneous, hindering efficiency without a commensurate gain in accuracy.SQL查询生成与验证：一个具有强大代码生成能力的模型，在相关模式信息和动态少样本示例的条件下，生成相应的SQL查询。为了确保LLM能够准确处理结果并识别用于后续合成的正确行，某些基本列（如研究ID和研究标题）始终包含在生成的SELECT查询中。然后验证生成的查询，确保其符合允许的操作（例如，仅允许SELECT查询；明确禁止DELETE、INSERT或UPDATE查询以维护数据完整性和安全性）。值得注意的是，该过程的早期迭代包括对生成的SQL查询进行LLM审查步骤；然而，后来移除了这一步骤，因为发现审查LLM有时会错误地将有效查询标记为错误，从而在准确性没有相应提高的情况下阻碍了效率。
Query Execution and Result Limiting: the validated SQL query is executed against the structured metadata database in Amazon Athena. To prevent data flooding and manage response size, the system enforces a limit, fetching not more than 50 records at a time.查询执行与结果限制：验证后的SQL查询在Amazon Athena中对结构化元数据数据库执行。为了防止数据泛滥并管理响应大小，系统强制执行限制，每次最多获取50条记录。
Error Handling and Iteration: if the SQL query execution is successful, the retrieved results (up to the specified limit) are returned and integrated into the overall response generation process. If the query fails due to syntax errors, schema issues, or other execution errors, the error message from the database, along with the generated query and the original context, is passed back to the same model. The LLM analyzes the error and the context to generate a corrected SQL query. This iterative process of generating and executing SQL queries is attempted up to 3 times before the tool gives up and reports a failure, potentially indicating an unresolvable query or a limitation in the model's ability to handle the specific request.错误处理与迭代：如果SQL查询执行成功，检索到的结果（不超过指定限制）将被返回并整合到整体响应生成过程中。如果查询因语法错误、模式问题或其他执行错误而失败，数据库的错误消息以及生成的查询和原始上下文将被传回同一模型。LLM分析错误和上下文以生成修正后的SQL查询。这种生成和执行SQL查询的迭代过程最多尝试3次，之后工具放弃并报告失败，这可能表明查询无法解决或模型处理特定请求的能力有限。

The Reflection Agent: Data Validation and Sufficiency

While the Think & Plan step provides process reflection, the Reflection Agent performs a complementary but distinct type of reflection: data reflection. This crucial component evaluates whether the data retrieved from various tools is sufficient and relevant to answer the user's question—a fundamentally different concern from whether the workflow itself is progressing correctly.虽然“思考与计划”步骤提供了过程反思，但“反思代理”执行一种互补但不同类型的反思：数据反思。这个关键组件评估从各种工具检索到的数据是否足够且相关，以回答用户的问题——这与工作流本身是否正确进行是根本不同的关注点。

In multi-step agentic workflows, these two types of reflection serve different but equally important purposes. Process reflection (Think & Plan) ensures the agent is taking the right steps and making appropriate progress toward the goal. Data reflection (Reflection Agent) ensures that the information gathered through those steps is adequate to fulfill the user's request. Both are essential: an agent might execute a perfectly valid workflow (good process) but still retrieve insufficient data to answer the question, or conversely, might have access to sufficient data but fail to progress effectively through the workflow.在多步骤代理工作流中，这两种反思服务于不同但同等重要的目的。过程反思（思考与计划）确保代理正在采取正确的步骤并朝着目标取得适当进展。数据反思（反思代理）确保通过这些步骤收集的信息足以满足用户请求。两者都是必不可少的：代理可能执行完全有效的工作流（良好的过程），但仍然检索到不足以回答问题的数据，或者相反，可能拥有足够的数据但未能有效地通过工作流取得进展。

As illustrated in the research workflow diagram (Figure 2), after initial information retrieval and 'think & plan' loops, the Reflection Agent is invoked when Think & Plan step thinks that the process has progressed well enough and is ready to evaluate the data. 'Reflection Agent' evaluates the sufficiency and relevance of the collected data by comparing the retrieved context against the user's original query and identifying potential gaps or missing information. If the gathered information is deemed insufficient to provide a complete response, the Reflection Agent generates specific follow-up questions designed to acquire the necessary missing information. These follow-up questions are then handed back to the Think & Plan step, which initiates further retrieval steps to obtain more comprehensive results. This iterative process of data validation and subsequent information retrieval, driven by the Reflection Agent's generated questions, demonstrates the system's ability to refine its search strategy based on the initial results. If the information is sufficient, the workflow proceeds to the next step.如研究工作流图（图2）所示，在初始信息检索和“思考与计划”循环之后，当“思考与计划”步骤认为过程进展足够好并准备评估数据时，会调用“反思代理”。“反思代理”通过将检索到的上下文与用户的原始查询进行比较，并识别潜在的差距或缺失信息，来评估所收集数据的充分性和相关性。如果认为收集到的信息不足以提供完整响应，反思代理会生成具体的后续问题，旨在获取必要的缺失信息。这些后续问题随后被交回给“思考与计划”步骤，该步骤启动进一步的检索步骤以获得更全面的结果。这种由反思代理生成的问题驱动的数据验证和后续信息检索的迭代过程，展示了系统根据初始结果优化搜索策略的能力。如果信息足够，工作流将进入下一步。

The Writer Agent: Answer Synthesis and Formatting

Once the Researcher Agent has collected the relevant evidence from RAG and Text-to-SQL, the Writer Agent is responsible for turning that raw material into the final answer shown to the user. Its job is not to “discover” new information, but to synthesize the retrieved context, respect user instructions, and enforce PRINCE's quality constraints during generation.一旦研究者代理从RAG和Text-to-SQL收集了相关证据，写作者代理负责将这些原始材料转化为向用户显示的最终答案。它的工作不是“发现”新信息，而是综合检索到的上下文，尊重用户指令，并在生成过程中执行PRINCE的质量约束。

The Writer Agent operates with a few non-negotiable rules. It must ground every claim in the supplied context and attach accurate citations back to the underlying chunks and study IDs, since verifiability is critical in a regulated environment. It is also responsible for honoring user-level formatting requirements (for example, tables, bullet points, or specific section structures) and for aligning with domain-specific answer standards used by the preclinical scientists.写作者代理遵循一些不可协商的规则。它必须将每个主张建立在所提供的上下文基础上，并附加准确的引用，指向底层块和研究ID，因为在受监管的环境中可验证性至关重要。它还负责遵守用户级别的格式要求（例如，表格、项目符号或特定的章节结构），并与临床前科学家使用的领域特定答案标准保持一致。

For more complex responses—such as multi-section summaries or partially filled regulatory templates—the architecture supports extending the Writer Agent with a short internal review loop. In this pattern, the Writer would first draft an answer, then a reviewing step would check for missing sections, inconsistent tables, or gaps relative to the original question, and may send targeted instructions back to the Writer to revise specific parts. This design enables a lightweight form of reflection focused on answer completeness and presentation, complementing the Reflection Agent's focus on data sufficiency earlier in the workflow. Importantly, all outputs from these regulatory drafting workflows are intended for expert review; final submissions are authored and approved by qualified personnel.对于更复杂的响应——例如多章节摘要或部分填充的监管模板——架构支持通过一个简短的内部审查循环来扩展写作者代理。在这种模式中，写作者首先起草答案，然后审查步骤检查缺失的章节、不一致的表格或相对于原始问题的差距，并可能向写作者发送有针对性的指令以修改特定部分。这种设计实现了一种轻量级的反思形式，侧重于答案的完整性和呈现，补充了工作流早期反思代理对数据充分性的关注。重要的是，这些监管起草工作流的所有输出都旨在供专家审查；最终提交由合格人员撰写和批准。

This gives PRINCE three complementary reflection loops. Process reflection checks whether the workflow is on the right path and helps catch bad trajectory, wrong tool choice, or poor sequencing. Data reflection checks whether the gathered evidence is sufficient and helps catch thin evidence, missing context, or gaps in coverage. Draft reflection checks whether the generated output is complete and helps catch missing sections, incomplete tables, or synthesis gaps.这为PRINCE提供了三个互补的反思循环。过程反思检查工作流是否在正确的路径上，有助于发现不良轨迹、错误的工具选择或糟糕的排序。数据反思检查收集的证据是否充分，有助于发现证据不足、上下文缺失或覆盖缺口。草稿反思检查生成的输出是否完整，有助于发现缺失的章节、不完整的表格或综合缺口。

Together, these agents form a practical context engineering pattern. The system does not simply keep adding more information to the prompt. It routes the right context to the right capability at the right time: planning context for Think & Plan, retrieval context for the Researcher, evidence context for the Reflection Agent, and synthesis context for the Writer. This plays out in concrete decisions throughout the system: the Text-to-SQL step injects only the schema components relevant to the current query rather than the full database schema; the Reflection Agent receives the original question alongside collected evidence to assess gaps, not the full workflow history; and the Writer Agent receives curated chunks with citation constraints, not raw retrieval output. Moving from a monolithic agent to this structured workflow meant each agent could be evaluated, debugged, and improved in isolation.这些代理共同形成了一种实用的上下文工程模式。系统不是简单地向提示中添加更多信息。它在正确的时间将正确的上下文路由到正确的能力：规划上下文给“思考与计划”，检索上下文给研究者，证据上下文给反思代理，综合上下文给写作者。这在系统的具体决策中得以体现：Text-to-SQL步骤仅注入与当前查询相关的模式组件，而不是完整的数据库模式；反思代理接收原始问题以及收集的证据以评估差距，而不是完整的工作流历史；写作者代理接收带有引用约束的精选块，而不是原始检索输出。从单一代理转向这种结构化工作流意味着每个代理都可以被独立评估、调试和改进。

Building Trust in a Production LLM System在生产级LLM系统中建立信任

Building and maintaining user trust is paramount for the successful adoption of any AI system, particularly in a critical environment like preclinical drug discovery where decisions have significant implications. For a production LLM application, trust is not just about accuracy; it's also about reliability, transparency, and the ability for users to verify the information provided. Several mechanisms are integrated into PRINCE to achieve this:建立和维护用户信任对于任何AI系统的成功采用至关重要，尤其是在临床前药物发现这样的关键环境中，决策具有重大影响。对于生产级LLM应用程序，信任不仅关乎准确性，还关乎可靠性、透明度以及用户验证所提供信息的能力。PRINCE中集成了多种机制来实现这一点：

Transparency and Explainability

Ensuring transparency and explainability is a critical aspect of PRINCE's design, fostering user trust and enabling verification of the generated responses. The system incorporates several mechanisms to achieve this:确保透明度和可解释性是PRINCE设计的关键方面，有助于建立用户信任并实现对生成响应的验证。系统采用了多种机制来实现这一点：

Intermediate Steps and Transparency: given the iterative nature of the workflow and the potential time required to generate a final answer, maintaining transparency is crucial. The intermediate steps executed by the system during query processing, information retrieval, and reflection, including the queries formulated and the tools utilized, are displayed to the user. This provides visibility into the system's reasoning process and allows users to follow the steps taken to arrive at the final answer. Additionally, when relevant context (chunks) is identified, links to these source materials are presented on the screen, allowing users to see precisely which information was shortlisted and used to formulate the final response.中间步骤与透明度：鉴于工作流的迭代性质以及生成最终答案所需的潜在时间，保持透明度至关重要。系统在查询处理、信息检索和反思过程中执行的中间步骤，包括制定的查询和使用的工具，都会显示给用户。这提供了对系统推理过程的可见性，并允许用户跟踪得出最终答案所采取的步骤。此外，当识别到相关上下文（块）时，这些源材料的链接会显示在屏幕上，使用户能够准确查看哪些信息被列入候选列表并用于制定最终响应。
Factuality Verification through Citation: the system facilitates user verification of factuality through a robust citation mechanism. The generated answer is consistently accompanied by citations referencing the original source documents and structured metadata. These citations are directly linked to the context displayed to the user, enabling them to easily verify the accuracy of the claims made in the response and trace the information back to its origin. Users can hover over any sentence in the generated response to see the corresponding citation, which provides a link to the PRINCE and to the source document, including the page number and the exact quote from the report used to support that part of the answer. This granular level of citation significantly enhances the credibility and trustworthiness of the system's output and simplifies the human review process.通过引用进行事实核查：系统通过强大的引用机制促进用户对事实的验证。生成的答案始终附带引用原始源文档和结构化元数据的引用。这些引用直接链接到显示给用户的上下文，使用户能够轻松验证响应中主张的准确性，并将信息追溯至其来源。用户可以将鼠标悬停在生成响应中的任何句子上，以查看相应的引用，该引用提供指向PRINCE和源文档的链接，包括页码和用于支持该部分答案的报告中的确切引用。这种细粒度的引用显著增强了系统输出的可信度和可靠性，并简化了人工审查过程。

Evaluation

Rigorous evaluation is fundamental to building and maintaining a reliable LLM application. PRINCE's performance and reliability are assessed through a combination of two types of evaluations: Dataset Evaluations and Live Traffic Evaluations.严格的评估是构建和维护可靠LLM应用程序的基础。PRINCE的性能和可靠性通过两种评估的组合进行评估：数据集评估和实时流量评估。

Dataset Evaluations: conducted whenever significant changes are made to the core workflow, prompts, or underlying models, these evaluations utilize curated datasets with pre-defined reference answers, meticulously prepared by subject matter experts and stored in Langfuse. A custom evaluation script processes each question and compares the generated response against the reference answer, yielding quantitative metrics such as Faithfulness (degree to which the answer is supported by context), Answer Relevancy (how well the answer addresses the query), Context Relevancy (relevance of retrieved chunks), Answer Accuracy (comparison to ground truth), and Semantic Similarity with Reference (semantic similarity to reference answer). Given the agentic nature of the system, applying appropriate evaluation metrics at different workflow stages, analogous to a testing pyramid, is crucial in addition to evaluating overall end-to-end performance.数据集评估：每当对核心工作流、提示或底层模型进行重大更改时，都会使用由主题专家精心准备并存储在Langfuse中的、带有预定义参考答案的精选数据集进行评估。自定义评估脚本处理每个问题，并将生成的响应与参考答案进行比较，得出定量指标，例如忠实度（答案受上下文支持的程度）、答案相关性（答案满足查询的程度）、上下文相关性（检索块的相关性）、答案准确性（与真实情况的比较）以及与参考的语义相似性（与参考答案的语义相似性）。鉴于系统的代理性质，除了评估整体端到端性能外，在工作流的不同阶段应用适当的评估指标（类似于测试金字塔）也至关重要。
Live Traffic Evaluations: performed daily as a batch job on real user queries from the live environment (without pre-defined reference answers), these evaluations provide valuable insights into real-world performance. Metrics such as Faithfulness and Answer Relevancy can still be assessed. Live traffic evaluations are essential for monitoring system behavior, identifying potential issues like hallucinations in production, and understanding performance on diverse live queries.实时流量评估：每天作为批处理作业对来自实时环境的真实用户查询（没有预定义的参考答案）执行，这些评估提供了对真实世界性能的宝贵见解。仍然可以评估忠实度和答案相关性等指标。实时流量评估对于监控系统行为、识别生产中的潜在问题（如幻觉）以及了解在多样化实时查询上的性能至关重要。

Monitoring

Continuous monitoring of the system's performance and outputs is essential for proactive identification and resolution of issues in a production environment. Using platforms like Langfuse, we continuously monitor PRINCE to identify potential biases, errors, or areas for improvement, ensuring the reliability and safety of the system's responses.持续监控系统的性能和输出对于在生产环境中主动识别和解决问题至关重要。使用Langfuse等平台，我们持续监控PRINCE，以识别潜在的偏见、错误或改进领域，确保系统响应的可靠性和安全性。

Engineering for Resilience: Error Handling and Recovery

Given the complexity of the multi-step workflow inherent in PRINCE, robust error handling and recovery mechanisms are critical to ensure the system's reliability and provide a seamless user experience. The system is engineered to recover gracefully from failures at various stages without requiring a complete restart of the entire workflow.鉴于PRINCE中多步骤工作流的复杂性，强大的错误处理和恢复机制对于确保系统可靠性和提供无缝用户体验至关重要。系统被设计为能够在各个阶段优雅地从故障中恢复，而无需完全重启整个工作流。

Key aspects of our error handling and recovery approach include:我们错误处理和恢复方法的关键方面包括：

State Persistence: the state of the entire workflow graph is persistently stored, enabling the system to resume execution directly from the failed node. This is achieved by storing the Agent State, representing the progress of the agents through the workflow, in Postgres. Other aspects of the application state, such as logs, intermediate steps, and citations, are stored in DynamoDB. This separation and persistence of state are crucial for achieving robustness in a stateful agentic system.状态持久化：整个工作流图的状态被持久存储，使系统能够直接从失败的节点恢复执行。这是通过将代理状态（代表代理在工作流中的进度）存储在Postgres中来实现的。应用程序状态的其他方面，如日志、中间步骤和引用，存储在DynamoDB中。这种状态的分离和持久化对于在有状态代理系统中实现鲁棒性至关重要。
Built-in Retries: the system is configured with built-in retries at various steps in the workflow. If a particular step encounters a transient failure, the system will automatically attempt to re-execute it a predefined number of times before signaling a more permanent error.内置重试：系统在工作流的各个步骤配置了内置重试。如果某个步骤遇到瞬时故障，系统将自动尝试重新执行预定义的次数，然后发出更永久的错误信号。
User-Initiated Retries: in addition to automated retries, users have the option to manually retry a failed query through the interface. When a user initiates a retry, the system leverages the persisted state to continue the workflow directly from the point of failure, intelligently skipping the steps that were successfully completed in the previous attempt. This significantly improves user experience and saves computational resources.用户发起的重试：除了自动重试外，用户还可以选择通过界面手动重试失败的查询。当用户发起重试时，系统利用持久化的状态直接从故障点继续工作流，智能地跳过先前尝试中成功完成的步骤。这显著改善了用户体验并节省了计算资源。
Framework-Level Support: the error recovery mechanisms are significantly supported by the underlying framework, LangGraph, which offers solid built-in capabilities for managing workflow state and handling errors within the graph structure. This provides a robust foundation for building resilient agentic workflows.框架级支持：错误恢复机制得到了底层框架LangGraph的显著支持，该框架提供了强大的内置功能，用于管理工作流状态和处理图结构中的错误。这为构建弹性代理工作流提供了坚实的基础。
LLM Fallbacks: to enhance reliability and mitigate issues related to model availability or performance, the system incorporates custom LLM fallback handling. If a call to a primary LLM provider or a specific model fails after a few retries, the system automatically falls back to an alternative LLM from a different provider. This mechanism is crucial for maintaining system availability and responsiveness, especially as platform downtimes for external services are outside of our direct control.LLM回退：为了增强可靠性并缓解与模型可用性或性能相关的问题，系统包含了自定义的LLM回退处理。如果对主要LLM提供商或特定模型的调用在几次重试后仍然失败，系统会自动回退到来自不同提供商的替代LLM。这种机制对于维持系统可用性和响应性至关重要，尤其是当外部服务的平台停机不在我们的直接控制范围内时。

This comprehensive approach to error handling and recovery minimizes the impact of transient failures, reduces the need for users to restart complex queries from scratch, and contributes to cost and latency savings by avoiding redundant execution of successful steps and LLM calls, all of which are essential for a production-ready system.这种全面的错误处理和恢复方法最大限度地减少了瞬时故障的影响，减少了用户从头开始重新启动复杂查询的需求，并通过避免成功步骤和LLM调用的冗余执行，有助于节省成本和延迟，所有这些对于生产级系统都是必不可少的。

These mechanisms are harness engineering in practice. The LangGraph workflow acts as the control layer around the agents: it defines which component can act, which tools it can use, where the workflow can pause, how failures are retried, how state is persisted, and when the system should move from research to reflection to writing. This harness makes the system less opaque and more reliable than an unconstrained autonomous agent. It gives the application clear control points for recovery, inspection, evaluation, and human intervention.这些机制是实践中的“约束工程”。LangGraph工作流充当代理周围的控制层：它定义哪个组件可以行动、可以使用哪些工具、工作流可以在哪里暂停、如何重试失败、如何持久化状态，以及系统何时应从研究转向反思再到写作。这种约束使系统比不受约束的自主代理更不透明、更可靠。它为应用程序提供了清晰的恢复、检查、评估和人工干预的控制点。

Enhancing Data Quality: Named Entity Recognition and Annotation

The accuracy and completeness of the structured metadata in Amazon Athena are critical for the performance of the Text-to-SQL component and overall data discoverability within PRINCE. Due to historical data migrations and varied annotation practices across different laboratories and systems over Bayer's extensive operational history, the metadata can sometimes be incomplete, missing, or incorrect.Amazon Athena中结构化元数据的准确性和完整性对于Text-to-SQL组件的性能以及PRINCE内的整体数据可发现性至关重要。由于历史数据迁移以及拜耳广泛运营历史中不同实验室和系统之间的不同注释实践，元数据有时可能不完整、缺失或不正确。

To address this challenge and continuously enhance the quality of the structured metadata, we have developed a utility system that employs Named Entity Recognition (NER) to extract and create accurate annotations directly from the study PDFs. This system is designed to read the textual content of the preclinical reports and identify key entities and associated information that should be represented in the structured metadata.为了解决这一挑战并持续提高结构化元数据的质量，我们开发了一个实用系统，该系统使用命名实体识别（NER）直接从研究PDF中提取和创建准确的注释。该系统旨在读取临床前报告的文本内容，并识别应在结构化元数据中表示的关键实体和相关信息。

The process involves:该过程包括：

Processing study PDFs to extract text and identify relevant entities (e.g., study IDs, compound names, species, routes of administration, dosage information, clinical findings, etc.).处理研究PDF以提取文本并识别相关实体（例如，研究ID、化合物名称、物种、给药途径、剂量信息、临床发现等）。
Generating structured annotations based on the identified entities and their relationships within the text.根据识别的实体及其在文本中的关系生成结构化注释。

We are actively working on integrating this utility system into our data pipelines to automatically correct and enrich the data within the Amazon Athena database. The system's performance in generating accurate annotations has been evaluated against curated datasets, demonstrating promising results. To manage the integration of these annotations into the production database, we are developing an evaluation system that provides a confidence score for each extracted field. Fields with a high confidence score will be automatically used to update the corresponding entries in Amazon Athena. Fields with lower confidence scores will be quarantined and flagged for human review and intervention, ensuring data accuracy while leveraging automation. This approach aims to continuously improve the quality of the structured metadata, making it a more reliable source of information for PRINCE and other downstream applications.我们正在积极致力于将该实用系统集成到我们的数据管道中，以自动纠正和丰富Amazon Athena数据库中的数据。该系统在生成准确注释方面的性能已针对精选数据集进行了评估，显示出有希望的结果。为了管理将这些注释集成到生产数据库中，我们正在开发一个评估系统，为每个提取的字段提供置信度分数。高置信度分数的字段将自动用于更新Amazon Athena中的相应条目。较低置信度分数的字段将被隔离并标记为人工审查和干预，从而在利用自动化的同时确保数据准确性。这种方法旨在持续提高结构化元数据的质量，使其成为PRINCE和其他下游应用程序更可靠的信息来源。

The Journey Continues: Iterative Development

PRINCE has been available to end-users since early 2024, with the agentic integration introduced later that year. This has been crucial for gathering real-world feedback and driving iterative development. A key principle guiding our development has been the understanding that building a production-ready LLM application is an iterative process; we don't wait for features to be absolutely perfect before seeking user feedback. Instead, we prioritize delivering value early and continuously refining the system based on real-world usage.PRINCE自2024年初起向最终用户开放，代理集成于当年晚些时候推出。这对于收集真实世界反馈和推动迭代开发至关重要。指导我们开发的一个关键原则是理解构建生产级LLM应用程序是一个迭代过程；我们不会在寻求用户反馈之前等待功能绝对完美。相反，我们优先考虑尽早交付价值，并根据真实世界使用情况持续优化系统。

In the initial stages, our focus was squarely on achieving the desired accuracy and performance for core functionalities, even if it meant incurring higher costs. We recognized that optimizing for cost prematurely could compromise the system's effectiveness and hinder user adoption. Only after achieving the desired level of accuracy and performance did we begin to focus on cost optimization, ensuring that efficiency gains did not negatively impact the user experience or the quality of the results.在初始阶段，我们专注于实现核心功能的预期准确性和性能，即使这意味着承担更高的成本。我们认识到过早优化成本可能会损害系统的有效性并阻碍用户采用。只有在达到所需的准确性和性能水平后，我们才开始关注成本优化，确保效率提升不会对用户体验或结果质量产生负面影响。

The development of PRINCE follows a continuous, iterative process. User feedback, ongoing monitoring data, and insights from expert scientists are continuously fed back into the development cycle, leading to refinements in the architecture, retrieval techniques, agent behaviors, and user interface to enhance performance, usability, and ultimately, scientific impact.PRINCE的开发遵循持续的迭代过程。用户反馈、持续监控数据以及来自专家科学家的见解不断反馈到开发周期中，导致架构、检索技术、代理行为和用户界面的改进，以增强性能、可用性，并最终提升科学影响力。

Conclusion

Building a production-ready LLM application in a complex enterprise environment like preclinical drug discovery is a journey marked by significant technical and engineering challenges. The PRINCE case study demonstrates that by combining robust data infrastructure, sophisticated information retrieval techniques like RAG and Text-to-SQL, and an intelligent multi-agent orchestration system, it is possible to unlock valuable insights from vast, previously inaccessible data repositories.在像临床前药物发现这样的复杂企业环境中构建生产级LLM应用程序是一个充满重大技术和工程挑战的旅程。PRINCE案例研究表明，通过结合强大的数据基础设施、像RAG和Text-to-SQL这样的复杂信息检索技术，以及智能的多代理编排系统，可以从庞大且以前无法访问的数据存储库中解锁有价值的见解。

Our experience highlights the critical importance of focusing on engineering for reliability, including robust error handling, state persistence, and LLM fallbacks. Furthermore, building user trust is paramount, achieved through transparency in the workflow, clear explainability via granular citations, and continuous evaluation and monitoring of the system's performance.我们的经验强调了关注可靠性工程的关键重要性，包括强大的错误处理、状态持久化和LLM回退。此外，建立用户信任至关重要，这通过工作流的透明度、通过细粒度引用实现的清晰可解释性，以及系统性能的持续评估和监控来实现。

PRINCE has already shown promising results in enhancing data accessibility and research efficiency at Bayer, transforming how scientists interact with preclinical information. This is not the end of the journey, but rather a significant step towards creating truly intelligent research assistants.PRINCE已经在增强拜耳的数据可访问性和研究效率方面显示出有希望的结果，改变了科学家与临床前信息交互的方式。这并非旅程的终点，而是朝着创建真正智能研究助手迈出的重要一步。

The broader lesson from PRINCE is that production-ready agentic AI is not only about better models or better prompts. Reliability comes from engineering both the context the model sees and the harness within which the model acts. Context engineering helped ensure that each model had the right information, and only the right information, at the right stage of the workflow. Harness engineering helped ensure that the workflow remained bounded, observable, recoverable, and suitable for a regulated research environment.从PRINCE中得到的更广泛的经验是，生产级代理AI不仅仅关乎更好的模型或更好的提示。可靠性来自于对模型看到的上下文以及模型行动的约束进行工程化。上下文工程有助于确保每个模型在工作流的正确阶段拥有正确的信息，并且只拥有正确的信息。约束工程有助于确保工作流保持有界、可观察、可恢复，并且适用于受监管的研究环境。

As model capabilities improve, some parts of today's harness may become thinner or move into native model capabilities. But in enterprise research systems, especially where trust, traceability, and reviewability matter, explicit control over context, workflow state, recovery, reflection, and verification remains essential.随着模型能力的提高，当今约束的某些部分可能会变得更薄或转移到原生模型能力中。但在企业研究系统中，尤其是在信任、可追溯性和可审查性至关重要的地方，对上下文、工作流状态、恢复、反思和验证的显式控制仍然至关重要。

We hope this overview provides valuable insights into the practical considerations and technical depth required to build and productionise LLM applications in a regulated and data-rich domain.我们希望这个概述能够为在受监管且数据丰富的领域中构建和生产LLM应用程序所需的实际考虑和技术深度提供有价值的见解。

Acknowledgments致谢

The author gratefully acknowledges the invaluable contributions of Adam Zalewski, Annika Kreuchwig, Carlos Henrique Vieira-Vieira, Jobst Löffler, and Jonas Münch from the Bayer team.作者衷心感谢拜耳团队的Adam Zalewski、Annika Kreuchwig、Carlos Henrique Vieira-Vieira、Jobst Löffler和Jonas Münch的宝贵贡献。

The author also thanks Bala Hari, Balu Saravanan, Bernice Mercy Sharon M, Deril X, Jigar Jani, Manibalan Baskaran, Nafis Aslam, Priyalakshmi R, Rohit Bansal, Sai Prabhanj Turaga, Saksham Srivastava, Shivam Sehgal, Sowmya Adimoulame, and Subhashini Rajamani from the Thoughtworks team for their contributions to this work.作者还感谢Thoughtworks团队的Bala Hari、Balu Saravanan、Bernice Mercy Sharon M、Deril X、Jigar Jani、Manibalan Baskaran、Nafis Aslam、Priyalakshmi R、Rohit Bansal、Sai Prabhanj Turaga、Saksham Srivastava、Shivam Sehgal、Sowmya Adimoulame和Subhashini Rajamani对这项工作的贡献。

The author used AI assistance during the writing of this article. AI tools were used for brainstorming ideas, creating outlines, and reviewing drafts to polish language and improve clarity.作者在撰写本文时使用了AI辅助。AI工具用于头脑风暴想法、创建大纲以及审查草稿以润色语言和提高清晰度。

Disclaimer免责声明

All activities described conform to Bayer's information classification, data governance, and external communication policies, and do not constitute claims regarding regulatory decision‑making or product performance.所有描述的活动均符合拜耳的信息分类、数据治理和外部沟通政策，并不构成关于监管决策或产品性能的声明。

Significant Revisions重要修订

16 June 2026: published2026年6月16日：发布