DataAgents: How we turned 9 months of analysis into 10 daysDataAgents:我们如何将 9 个月的分析缩短至 10 天
An engineering deep dive into the pattern that changed how we approach large-scale classification problems.深入工程技术,探讨改变我们处理大规模分类问题方式的模式。
Every engineering team has that project sitting in the backlog. The one where someone says, “We really should analyze all of these,” and the room goes quiet. Everyone knows what “all of these” means — hundreds of entities, complex rules, no clear starting point.每个工程团队都有一个停滞在待办事项列表中的项目。有人会说,“我们真的应该分析所有这些东西”,房间里便一片寂静。大家都知道“所有这些东西”指的是什么——数百个实体、复杂的规则,没有明确的起点。
For us, it was cloud resource dormancy detection.对我们来说,这就是云资源休眠检测。
We had around 350 distinct cloud resource types spread across AWS, Azure and Google Cloud Platform (GCP). Each type has different behavior patterns. An EC2 instance sitting idle looks nothing like a dormant Amazon S3 (S3) bucket or an unattached Elastic IP. Detecting dormancy required understanding what “active” means for each specific resource, then writing detection logic that wouldn’t flood operations teams with false positives.我们在 AWS、Azure 和 Google Cloud Platform (GCP) 上分布着大约 350 种独特的云资源类型。每种类型都有不同的行为模式。一个处于空闲状态的 EC2 实例与休眠的 Amazon S3 (S3) 存储桶或未附加的弹性公网 IP 完全不同。检测休眠需要理解每个特定资源的“活跃”含义,然后编写检测逻辑,以免向运维团队发送过多的误报。
Traditional estimate: 6-9 months of expert analysis.传统估计:需要 6-9 个月专家分析。
Actual time: 10 days.实际耗时:10 天。
Here’s how we did it, and more importantly, here’s the reusable pattern behind it.下面是我们是如何做到的,更重要的是,这是其背后的可重用模式。
The problem with large-scale analysis:大规模分析的问题:
Before we get to the solution, it’s worth naming the pattern that makes these projects so painful. It shows up everywhere:在给出解决方案之前,值得命名一下让这些项目如此痛苦的模式。它无处不在:
- Cloud resources - Which of our 350 resource types are dormant?云资源——我们的 350 种资源类型中哪些处于休眠状态?
- Data governance - Which of our 800 tables have quality issues we should monitor?数据治理——我们应该监控哪 800 张表中存在的问题?
- Security - Which of our access entitlements violate least-privilege principles?安全——我们的哪些访问权限违反了最小权限原则?
- Compliance - Which of our 500 policy controls need remediation?合规——我们的 500 项策略控制中哪些需要整改?
In every case, the structure is similar — a large catalog of heterogeneous entities, entity-specific rules that don’t generalize, unknown priorities and a high cost for getting it wrong.在所有情况下,结构都是相似的——一个包含异构实体的大型目录,这些实体具有特定规则且不具通用性,优先级未知,而且错误成本很高。
The traditional approach is not just slow. It’s structurally limited. You get coverage of the “obvious” cases, inconsistent logic across analysts, and tribal knowledge that evaporates when people leave. What you need is something that can assess each entity, apply consistent criteria, prioritize by confidence and document its reasoning. The DataAgents pattern gets you there.传统方法不仅缓慢,而且在结构上也有局限性。你只能覆盖“显而易见”的情况,分析师之间的逻辑不一致,以及人员离职后消失的固有知识。你需要的是一种能够评估每个实体、应用统一标准、按置信度进行优先排序并记录其推理过程的方法。DataAgents 模式正是你所需要的。
The data foundation: Why quality data comes first数据基础:为什么高质量数据排在首位
The quality of your analysis is bounded by the quality of your data.分析的质量受限于数据的质量。
An AI agent can reason only over what it’s given. If your data is incomplete, inconsistently structured or untrustworthy, the agent’s outputs will be too—just faster and at greater scale.AI Agent 只能基于它获得的数据进行推理。如果你的数据不完整、结构不一致或不可信,Agent 的输出也会如此——只是速度更快且规模更大。
For our cloud analysis, we had a genuine advantage — an authoritative data product called cloud-asset-data-product. It contains a daily snapshot of every cloud resource across all providers, with a standardized schema. What made this data product valuable was data quality and rich metadata: formal field definitions, PK-FK relationships for cross-entity reasoning and lineage tracking for provenance and freshness. Quality data isn’t optional — it’s what makes AI analysis trustworthy.对于我们的云分析,我们有一个真正的优势——一个名为 cloud-asset-data-product 的权威数据产品。它包含所有提供商每个云资源的每日快照,采用了标准化结构。该数据产品的价值在于数据质量和丰富的元数据:正式的字段定义、用于跨实体推理的主键-外键关系,以及用于来源和新鲜度的血缘追踪。高质量数据不是可选项——正是它让 AI 分析值得信赖。
- resource_id - Unique composite identifierresource_id - 唯一组合标识符
- resource_type - EC2 instance, S3 bucket, EIP, etc.resource_type - EC2 实例、S3 存储桶、EIP 等
- service_id - Service groupingservice_id - 服务分组
- business_application_name - Ownershipbusiness_application_name - 所有权
- data_structured_tag - JSON: configuration, tags, state flagsdata_structured_tag - JSON:配置、标签、状态标志
- resource_updated_utc_timestamp - Last change timestampresource_updated_utc_timestamp - 最后更改时间戳
The principle: the richer and more standardized your data product, the more an AI agent can do with it. Without quality data, AI analysis is unreliable. This is not a caveat. It’s the prerequisite.原则是:你的数据产品越丰富、越标准化,AI Agent 处理它的能力就越强。没有高质量数据,AI 分析就不靠谱。这不仅仅是一个小小的警告或注意事项。这是先决条件。
The DataAgents patternDataAgents 模式
Once you have the data foundation, the pattern is straightforward:一旦拥有数据基础,该模式就很简单明了:
Authoritative data product + AI agent = DataAgent权威数据产品 + AI Agent = DataAgents
A DataAgent is not just “AI doing analysis.” It’s a structured combination of:DataAgent 不仅仅是“正在进行分析的 AI”。它是一个结构化组合,包括:
- An authoritative data source—the single source of truth for your domain权威数据源——你所在领域的单一事实来源
- An AI agent that understands domain behavior, applies entity-specific rules and generates confidence-rated outputs with documented reasoning一个理解领域行为、应用特定实体规则并生成附带文档推理的高置信度评分输出的 AI Agent
- A human-AI validation loop that catches errors and guides refinement一个捕捉错误并指导改进的人机验证循环
The output is not a spreadsheet of results. It’s a self-documenting artifact—detection logic, confidence classifications, false-positive risk assessments and plain-English reasoning for every entity.输出不是结果电子表格。它是一个自文档化产物——包括检测逻辑、置信度分类、误报风险评估以及每个实体的通俗易懂的推理说明。
The three-phase process三阶段流程
Phase 1: Broad assessment第一阶段:广泛评估
Input the full entity catalog. Ask the agent to categorize each resource type by analysis feasibility:输入完整的实体目录。要求 Agent 根据分析的可行性对每种资源类型进行分类:
- Config-detectable - Dormancy is detectable from configuration data aloneConfig-detectable - 仅凭配置数据即可检测休眠状态
- Needs telemetry - Reliable detection requires additional usage signalsNeeds telemetry - 可靠检测需要额外的使用信号
Phase 2: Classification and logic generation第二阶段:分类与逻辑生成
For each entity, the agent analyzes three questions:对于每个实体,Agent 会分析三个问题:
- What indicates the target state (dormancy)? → Detection logic什么指示目标状态(休眠)? → 检测逻辑
- How reliable is this signal? → Confidence level此信号的可靠性如何? → 置信度级别
- What’s the false-positive risk? → Risk assessment误报风险是什么? → 风险评估
Output: Spark SQL detection queries and documented reasoning for every resource type.输出:每种资源类型的 Spark SQL 检测查询和记录在案的推理说明。
Phase 3: Deep validation (the game-changer)第三阶段:深度验证(颠覆性的改变)
This is where the approach separates itself from “run the AI and ship the output.”这就是该方法区别于‘跑完 AI 就交付结果’的地方。
Human: “Double-check all MEDIUM confidence entities one by one.”Human: “逐一复核所有高置信度实体。”
The agent reviewed every MEDIUM classification individually-not as a batch. Some were upgraded to HIGH. Some moved to Phase 2 for requiring telemetry. Then the same for LOW confidence. Every entity reviewed, every decision documented.Agent 逐一审查了所有高置信度实体,而不是成批审查。一些被升级为高置信度,一些因需要遥测数据而移动到第二阶段。然后对低置信度实体也是如此。每个实体都经过了审查,每个决定都记录在案。
Without Phase 3, you’re stuck at, “Let’s try some and see what happens.” With it, you get systematic validation of every entity in your catalog.如果没有第三阶段,你就会陷入“试一试看看会发生什么”的困境。有了它,你就获得了对目录中每个实体的系统化验证。
What the agent discovered that humans would missAgent 发现了人类会遗漏的事物
The most valuable outputs weren’t the easy HIGH confidence cases-those are obvious. The value was in what systematic analysis uncovered across 350 types.最有价值的输出不是容易的高置信度案例——那些是显而易见的。其价值在于系统性分析在 350 种类型中发现的内容。
The false-positive trap. Some resource types look dormant by age but are actively used without any detectable configuration changes. An S3 bucket with no config updates for 90 days might be accessed millions of times per day-access patterns don’t touch the config. Age-based detection on these types runs 40-60% false-positive rates. The agent identified these and moved them to Phase 2.误报陷阱。某些资源类型按年龄看似乎处于休眠状态,但实际上在没有任何可检测配置更改的情况下正在积极使用。一个 90 天没有配置更新的 S3 存储桶,每天可能被访问数百万次——访问模式不涉及配置。基于年龄的检测在这些类型上会导致 40-60% 的误报率。Agent 识别出了这些并将其移动到第二阶段。
State-based vs. age-based detection. The agent surfaced a clean framework from the analysis: High-confidence detection uses explicit state indicators-a binary state (stopped, unattached, disabled) that definitively signals dormancy. Low-confidence detection relies only on timestamps.基于状态与基于年龄的检测。Agent 从分析中梳理出了一个清晰框架:高置信度检测使用明确的状态指示器——一种明确表明休眠的二元状态(已停止、未附加、已禁用)。低置信度检测仅依赖时间戳。
Disproportionate value concentration. Just 12 HIGH confidence resource types account for 30-40% of total dormancy savings-from 3.6% of resource types. Without systematic analysis of all 350, you’d find some of these, but you’d miss others.不成比例的价值集中。仅 12 种高置信度资源类型就占到了总休眠节省额的 30-40%——即 3.6% 的资源类型。如果没有对全部 350 种资源的系统化分析,你可能只能发现其中一些,但会遗漏其他的。
The audit trail: Every decision documented审计追踪:每个决定都有记录
One of the underappreciated outputs of this approach is the reasoning column. Every entity gets a plain-English explanation of why it received its classification:该方法的一个不太为人所知的输出则是推理列。每个实体都会获得一份通俗易懂的解释,说明它之所以获得某种分类的原因:
▎ HIGH Confidence: “Resource is in an explicit dormant state (stopped) and has not been updated for 90+ days. State-based detection with <5% false-positive rate. Automation-ready.”▎ 高置信度:“资源处于明确的休眠状态(已停止),且 90 多天未更新。基于状态的检测,误报率低于 5%。自动化就绪。”
▎ Phase 2 Reclassification: “This resource type can be actively used without configuration changes, resulting in 40-60% false-positive rate with age-based detection. Requires usage telemetry for reliable detection.”▎ 第二阶段重新分类:“这种资源类型可以在没有配置更改的情况下被积极使用,导致基于年龄的检测产生 40-60% 的误报率。可靠检测需要使用遥测数据。”
▎ LOW Confidence: “No state indicators available. Age-based detection only-active entities may qualify. Manual owner review required before action.”▎ 低置信度:“无状态指示器可用。仅基于年龄的检测——活跃实体可能符合条件。行动前需要手动所有者审查。”
This is not a nice-to-have. Months later, when someone asks, “Why does this logic work this way,” the answer is in the output. When a developer is assigned to maintain the detection queries, the context is self-documenting. When stakeholders ask why a resource type is flagged, the reasoning is already written.这并非可有可无。几个月后,当有人问“为什么这种逻辑是这样运作的”时,答案就在输出中。当被指派维护检测查询的开发者上手时,上下文是自文档化的。当利益相关者询问为什么某个资源类型被标记时,推理说明早已写好。
Without AI-generated documentation, these explanations either don’t exist or live in someone’s head.没有 AI 生成的文档,这些解释要么不存在,要么存在于某人的脑海中。
Of course, humans have to prompt AI to provide such output in a field.当然,人类必须在字段中提示 AI 以提供此类输出。
Human-AI partnership: The part that actually makes it work人机协作:使之真正奏效的关键部分
AI made errors. We should be clear about this.AI 出现了错误。我们应该明确这一点。
- Field name casing was wrong on several generated queries.多个生成查询中的字段名大小写有误。
- Column names in some detection logic referenced fields that didn’t exist in the schema.一些检测逻辑中的列名引用了架构中不存在的字段。
- A handful of confidence levels were over-optimistic before the deep-dive review.在深度审查之前,少数置信度级别过于乐观。
Every one of these errors was caught by human validation before production.这些错误中的每一个都是在投入生产前通过人工验证发现并修正的。
AI contributes speed, pattern recognition and consistency at scale. Humans contribute strategic direction, quality control and domain validation. The 18-27x speed improvement is the net result after human validation-not in spite of it.AI 提供了速度、模式识别和规模化的一致性。人类提供了战略方向、质量控制和领域验证。18-27 倍的速度提升是基于人工验证的结果——而不是违背人工验证。
Where this pattern applies该模式的适用场景
The DataAgents pattern works for any domain with these properties:DataAgents 模式适用于具有以下特性的任何领域:
- A large catalog of heterogeneous entities包含异构实体的大型目录
- Entity-specific rules and thresholds (one-size-fits-all doesn’t work)特定实体的规则和阈值(一刀切的方法行不通)
- An authoritative data product/source with a standardized schema具有标准化结构的权威数据产品/源
- A need for confidence-based prioritization and documented reasoning需要基于置信度的优先考虑和记录在案的推理
Beyond cloud resources, the same pattern applies to:除了云资源之外,同样的模式还适用于:
- Data and governance - Generate data quality rules across hundreds of tables. Classify PII and sensitive data at scale. Detect schema drift with documented rationale.数据和治理——生成数百张表的数据质量规则。大规模分类 PII(个人身份信息)和敏感数据。使用记录在案的推理来检测架构漂移。
- Risk and compliance - Detect policy violations across entity catalogs. Review access entitlements. Map regulatory controls to technical implementations.风险和合规——检测实体目录中的策略违规情况。审查访问权限。将监管控制措施映射到技术实现。
- Security - Assess security posture across resource configurations. Identify misconfigured services with ranked confidence.安全——评估资源配置的网络安全态势。识别配置错误的服务及其置信度排名。
The pattern is reusable. The data product changes. The agent changes. The output structure adapts. But the three-phase process-broad assessment, classification with logic generation, deep validation-applies every time.该模式是可重用的。数据产品会发生变化。Agent 会发生变化。输出结构会适应。但是,三阶段流程——广泛评估、包含逻辑生成的分类、深度验证——每次都适用。
What you need to try this尝试此方法所需的内容
Before you start:在开始之前:
- Identify your highest-value analysis problem确定您价值最高的分析问题
- Confirm you have an authoritative data product/source with standardized schema for that domain确认您拥有该领域具有标准化结构的权威数据产品/源
- Define what “target state” looks like for your entities (what does dormant/at-risk/non-compliant mean?)定义您的实体的“目标状态”是什么样子的(休眠/有风险/不合规是什么意思?)
The minimum viable setup:最小可行设置:
- A data product/source you trust您值得信赖的数据产品/源
- An AI agent with enough context about your domain (schema, sample data, domain documentation)一个拥有关于您领域足够上下文的 AI Agent(架构、样本数据、领域文档)
- A human validator who knows the domain and can challenge the outputs一个了解该领域并能对输出提出质疑的验证者
The investment: 10 days of focused work for something that would otherwise take 9 months. Most of that time is in Phase 3: the deep validation loop. Don’t skip it.投入:10 天的专注工作换取原本需要 9 个月的东西。其中大部分时间在第三阶段:深度验证循环。不要跳过它。
The bottom line核心结论
AI doesn’t replace human expertise. It amplifies it.AI 不会取代人类专业知识。它会放大它。
What changed with DataAgents is that humans can now spend their time on judgment calls and validation-instead of manually analyzing entity 47 through entity 350. The AI capability is available. What’s often missing is the structured data foundation that makes it reliable.DataAgents 的变化在于,人类现在可以将时间花在判断和验证上——而不是手动分析实体 47 到实体 350。AI 能力是可用的。通常缺少的是使其可靠的结构化数据基础。
If you have that foundation, you have more analytical capability available to you right now than you probably realize.如果您拥有那个基础,您现在可用的分析能力可能远超您的想象。
Originally published at https://www.capitalone.com.最初发布于 https://www.capitalone.com。
This blog was authored by Ram Manohar Bheemana, Senior Lead Data Engineer, Cloud Radar. Ram Bheemana is a Senior Data Engineer at Capital One, where he builds cloud-scale data products that give engineers and business teams real-time visibility into cloud resources across AWS, Azure and GCP. His recent work includes developing the DataAgents pattern — an approach that pairs authoritative data products with AI agents to automate large-scale analysis tasks that once took months, and pioneering Spark Streaming architectures that have delivered over a million dollars in annual cost savings. Ram is passionate about the intersection of data engineering and AI — not just using AI as a tool, but rethinking how data pipelines and intelligent agents can work together to solve problems at enterprise scale. Outside of work, he enjoys mentoring engineers, contributing to community initiatives and staying curious about what’s next in the data and AI space.该博客由 Ram Manohar Bheemana 撰写,他是 Cloud Radar 的高级首席数据工程师。Ram Bheemana 是 Capital One 的高级数据工程师,在那里他构建云规模的数据产品,为 AWS、Azure 和 GCP 上的工程师和业务团队提供了实时的云资源可视化。他的近期工作包括开发 DataAgents 模式——一种将权威数据产品与 AI Agent 配对以自动执行曾耗时数月的大规模分析任务的方法,以及开创了带来超过百万美元年度成本节约的 Spark Streaming 架构。Ram 对数据工程和 AI 的交叉领域充满热情——不仅仅是将 AI 作为工具使用,而是重新思考数据管道和智能 Agent 如何协同工作,以解决企业规模的问题。工作之余,他喜欢指导工程师、参与社区倡议,并保持对数据和 AI 领域未来发展的好奇心。
DISCLOSURE STATEMENT: © 2026 Capital One. Opinions are those of the individual author and are not necessarily those of Capital One. Unless noted otherwise, Capital One is not affiliated with, nor endorsed by, any third parties mentioned and is not responsible for the content or privacy policies of any linked third-party sites. Any trademarks and other intellectual property used or displayed are property of their respective owners.披露声明:© 2026 Capital One。意见属于个人作者,不一定属于 Capital One。除非另有说明,Capital One 与任何提及的第三方无关联,也不受其认可,也不对其任何链接第三方网站的隐私政策或内容负责。使用的任何商标和其他知识产权均属于其各自所有者。

