Building AI Investigation Agents: Lessons From Automating Root Cause Analysis at Scale

I’ve spent countless hours doing what engineers euphemistically call “investigations”, the tedious process of tracing through logs, correlating data across systems, and trying to figure out why something broke. It’s detective work, and it’s expensive.

Recently, I’ve been exploring how to build an AI investigation agent that can automate this process. Not a chatbot that summarizes data, but a genuine investigative system that follows evidence, tests hypotheses, and produces actionable root cause analysis at scale.

What I’ve learned has challenged many assumptions I had about how to apply LLMs to complex analytical problems. This post shares the architecture patterns, technical decisions, and hard-won lessons from building investigation systems that actually work in production.

The Problem With Operational Investigations

Every organization has investigation workflows buried somewhere in their operations. They look something like this:

  • Logistics: Why was this shipment late? What caused the delay?
  • E-commerce: Why is this product showing as unavailable when we have inventory?
  • Payments: Why did this invoice fail to process?
  • Infrastructure: What’s causing this service degradation?

These investigations share two characteristics that make them deceptively difficult:

First, they require deep domain knowledge. You can’t investigate a delivery failure without understanding how routing works, what labor constraints exist, how weather affects operations, and dozens of other domain-specific factors. This knowledge often lives in tribal memory, scattered wiki pages, and the heads of senior operators.

Second, they require navigating large amounts of data. A single investigation might touch 10+ data sources, hundreds of parameters, and thousands of rows. The investigator needs to know which data matters, how to query it, and how to interpret the results.

Today, these investigations are done by humans. Expert analysts spend 30-60 minutes per case, manually querying databases, cross-referencing documentation, and writing up findings. At scale like thousands of investigations per day, this becomes enormously expensive.

The question I became obsessed with: can we build an AI investigation agent that handles this automatically?

Why Standard RAG Approaches Fail

My first instinct was to throw RAG at the problem. Index all the documentation, give the LLM access to the databases, and let it figure things out.

This doesn’t work. Here’s why:

Documentation is a mess. In any real organization, wiki pages are outdated, conflicting, and incomplete. SOPs assume context that isn’t written down. Critical knowledge lives only in experts’ heads. When you dump all of this into a vector database, you get an incoherent mess that confuses the model more than it helps.

Data is overwhelming. Even a focused investigation might involve 200+ parameters across multiple tables. You can’t just stuff all of this into context and expect coherent analysis. The model needs a strategy for navigating the data, not just access to it.

Investigations require iteration. Root cause analysis isn’t a single query, it’s a process. You form hypotheses, gather evidence, refine your understanding, and converge on an answer. Standard RAG pipelines are designed for question-answering, not iterative reasoning.

Consistency matters more than creativity. When you’re running thousands of investigations, you need reliable, reproducible results. An investigation system that gives different answers to the same problem on different runs is useless, no matter how clever any individual answer might be.

These realizations led me toward a very different architecture.

The Core Architecture: Pseudo-Reinforcement Learning

The key insight that unlocked everything: frame investigation as an optimization problem, not a generation problem.

Instead of asking the model to generate an answer, we ask it to maximize a confidence score through iterative exploration. Each step in the investigation should bring us closer to a well-supported conclusion.

Here’s how a single investigation step works:

  1. Thinking and reflection: The agent reviews its current understanding, identifies gaps, and plans next actions
  2. Journaling: The agent records its reasoning, hypotheses, and evidence in a structured format
  3. Tool use: The agent executes queries, retrieves data, or consults sub-agents
  4. Evidence integration: New data is incorporated into the investigation
  5. Confidence scoring: A composite score is calculated based on multiple factors

The confidence score has sub-metrics covering things like:

  • Data quality and completeness
  • Hypothesis support strength
  • Cross-validation status
  • Assumption count
  • Knowledge gaps

This scoring mechanism creates emergent behavior. If data quality is dragging down the score, the agent focuses on cross-validation. If hypotheses aren’t well-supported, it gathers more evidence. The agent isn’t following a rigid script, it’s responding to the actual state of the investigation.

The process continues until the score converges (typically 5-8 iterations) or hits a maximum step limit. Crucially, if the score can’t reach a high threshold, the agent explicitly reports why, what data was missing, what assumptions it had to make, what uncertainties remain.

Journal-Mediated Context Management

One of the most important technical innovations is using a structured journal for context management.

The journal isn’t just a log, it’s the agent’s working memory. It has defined sections:

  • Reasoning: Current analytical thinking
  • Hypotheses: Active theories with confidence levels and evidence links
  • Evidence: Discovered facts with source attribution
  • Patterns: Identified correlations and anomalies
  • Questions: Outstanding unknowns to investigate

This structure solves several problems:

Compression without loss. Instead of carrying raw conversation history, the agent maintains a curated summary of what matters. This keeps context windows manageable even for complex investigations.

Bias mitigation. One challenge with LLMs is premature convergence, they latch onto an early hypothesis and interpret all evidence through that lens. The journal structure forces explicit tracking of multiple hypotheses with independent evidence, reducing confirmation bias.

Orchestration. The journal also serves as a communication channel. When the investigator needs something from a sub-agent, it writes the request to the journal. An orchestrator picks up these requests and routes them appropriately.

Transparency. Everything the agent considers is recorded. When an investigation completes, you have a full audit trail of the reasoning process, not just a final answer.

Tiered Knowledge Architecture

Remember how I said dumping all documentation into RAG doesn’t work? Here’s what does.

We organize knowledge into three tiers based on trust and accessibility:

Tier 1: Core Reference Documents

This is the foundation like a carefully curated document (typically 20-40 pages) containing:

  • Domain fundamentals (how the system works)
  • Key metrics and their definitions
  • Common root causes and their signatures
  • Investigation templates and examples
  • Tribal knowledge made explicit

This document is created specifically for the AI agent, not repurposed from existing documentation. Yes, this requires upfront effort. But it’s a one-time investment that pays off across thousands of investigations.

The agent treats Tier 1 as ground truth. If there’s a conflict between Tier 1 and other sources, Tier 1 wins.

Tier 2: Extended Documentation

Wiki pages, SOPs, detailed procedures, the broader knowledge base. The agent can query this through a sub-agent when it needs specific details not covered in the core document.

This information is trusted but acknowledged as potentially outdated or incomplete.

Tier 3: General Search

Web search, broader knowledge retrieval. Used sparingly when the agent encounters something genuinely novel.

This tiered approach means the agent has reliable core knowledge always available, extended knowledge on demand, and fallback options for edge cases without drowning in contradictory information.

Hypothesis Evolution: Breadth-First With Pruning

Root cause analysis rarely yields a single clean answer. Real-world problems usually have multiple contributing factors with varying degrees of responsibility.

The agent uses a breadth-first search approach to hypothesis evolution:

  1. Initial hypothesis generation: Based on the problem description and early data exploration, generate candidate root causes
  2. Evidence gathering: For each hypothesis, identify what evidence would support or refute it
  3. Parallel evaluation: Gather evidence across hypotheses simultaneously
  4. Contribution scoring: Assign percentage contributions to supported hypotheses
  5. Tree pruning: Eliminate hypotheses that lack evidence support
  6. Leveling: For supported hypotheses, investigate deeper causes

This approach handles a subtle but important problem: when two root causes have similar contribution levels (say, 55% and 45%), the agent reports both rather than arbitrarily picking one. Different runs might surface them in different orders, but both are captured.

The leveling concept is crucial for meaningful analysis. Surface-level answers (“it was late because of weather”) aren’t actionable. You need to understand the chain of causation (“severe weather triggered protocol X, which reduced labor availability by Y%, which created backlog Z”).

Separating Investigation From Tool Use

Here’s an architectural decision that might seem counterintuitive: the core investigator agent doesn’t know how to use tools.

The investigator is purely analytical. It thinks, reasons, forms hypotheses, and requests information. When it needs data, it writes a request to the journal. When it needs to query external knowledge, it formulates a question.

A separate companion agent (I call it the “operations agent”) handles all tool interactions:

  • Executing SQL queries
  • Retrieving documents
  • Calling APIs
  • Managing MCP integrations

Why this separation?

Cognitive budget. Every capability you give an LLM consumes some of its reasoning capacity. Tool-use instructions, function schemas, error handling logic – it all competes for attention with the actual analytical task. By offloading operations to a companion, the investigator can focus purely on investigation.

Specialization. The investigator can be optimized for analytical reasoning. The operations agent can be optimized for reliable tool execution. Different prompts, different validation, different error handling.

Flexibility. You can swap out the operations layer without touching the investigation logic. Need to add a new data source? Update the operations agent. The investigator doesn’t need to know or care.

This pattern, a “boring” but reliable specialist agent paired with a capable but generic operations agent, has proven remarkably effective.

Handling Data at Scale

A typical investigation might involve:

  • 7-10 data tables
  • 200-300 parameters
  • 100,000+ rows

You can’t put all of this in context. But you also can’t know in advance exactly which data will be relevant.

The solution is what I call “high-recall data retrieval” that fetches everything that might be relevant, then let the agent explore.

For a delivery investigation, this means pulling:

  • All metrics for the target facility and date
  • Data from adjacent facilities (for comparison)
  • Historical baselines
  • Related operational events

This creates a “data of interest” pool, maybe 500-800 parameters, far more than will actually be used. The agent then queries this pool iteratively, pulling what it needs as the investigation unfolds.

The key insight: it’s okay to over-fetch data, but it’s not okay to miss critical data. Better to give the agent access to information it doesn’t use than to have it reach wrong conclusions because relevant data wasn’t available.

For truly large-scale needs, a dedicated data sub-agent can handle dynamic data retrieval by responding to agent requests for additional context that wasn’t in the initial pool.

Validation and Error Handling

LLMs make mistakes. In high-stakes investigations, you need systematic error catching.

We use a validator component that checks for obvious errors:

  • SQL queries that reference non-existent columns
  • Calculations that don’t match the underlying data
  • Claims that contradict explicit evidence
  • Logical inconsistencies

Importantly, the validator doesn’t interfere with reasoning. Early experiments with validators that questioned analytical judgments backfired because they broke the coherence of the investigation process. The validator’s job is catching mechanical errors, not second-guessing conclusions.

Common error patterns we’ve observed:

  • Using wrong column names (especially when names are similar)
  • Off-by-one errors in time ranges
  • Dialect mismatches in SQL
  • Forgetting constraints when queries get complex

Most of these self-correct through iteration, a failed query in the other term returns an error, and the agent adjusts. The validator catches the subtler issues that might not cause explicit failures but would corrupt results.

Results and Economics

For a well-configured investigation domain, here’s what we’ve achieved:

Speed: 10-15 minutes per investigation vs. 35-45 minutes for human analysts

Cost: $0.20-0.40 per investigation (primarily LLM inference costs)

Accuracy: 90-95% agreement with expert human analysis

Consistency: Dramatically reduced variance in how similar cases are analyzed

The accuracy metric deserves elaboration. We measure against historical human investigations, checking:

  • Root cause identification (did it find the same causes?)
  • Contribution ordering (did it rank them correctly?)
  • Supporting metrics (did it cite the right evidence?)
  • Actionable recommendations (are the suggested fixes appropriate?)

An important caveat: in some cases, the AI investigation found errors in the human ground truth. Expert review confirmed the AI was actually correct. This cuts both ways and you need to be fair when evaluating, and not assume human investigations are always right.

Unconventional Learnings

Building these systems taught me several things that contradict conventional wisdom about LLM applications:

Complex prompts outperformed simple ones.

The standard advice is to keep prompts simple and focused. For investigation agents, the opposite proved true. Rich, detailed prompts with extensive context about the domain, the investigation process, and the expected reasoning patterns dramatically improved results.

This makes sense in retrospect. Investigation is inherently complex. Trying to simplify the prompt just pushed complexity elsewhere into more iterations, more errors, more inconsistency.

Prompt optimization techniques didn’t help.

Automated prompt optimization (DSPy, similar frameworks) is designed for task-execution agents. For deep analytical work, the optimization targets don’t capture what matters. We got better results from manual prompt engineering informed by failure analysis.

Standard agent frameworks were limiting.

Most agent frameworks assume a task-execution paradigm: receive instruction, use tools, return result. Investigation requires something different like sustained reasoning, hypothesis management, evidence tracking. We ended up building custom orchestration rather than fighting framework assumptions.

The journal pattern was transformative.

If I had to pick one architectural decision that mattered most, it’s the structured journal. It solved context management, bias mitigation, orchestration, and transparency in one elegant pattern. I’d recommend exploring this approach for any complex analytical agent.

When This Approach Fits

Not every problem needs an investigation agent. This architecture shines when:

  • Investigations are repetitive: The same question structure applied to different objects (different shipments, different products, different accounts)
  • Scale matters: Thousands or millions of investigations per year
  • Domain is bounded: Well-defined problem space with knowable root causes
  • Data is structured: Queryable databases, not unstructured documents
  • Accuracy is critical: Results drive real decisions or actions
  • Consistency matters: Different cases should be analyzed the same way

If you’re doing ad-hoc research, exploratory analysis, or one-off investigations, simpler approaches (including just using Claude or ChatGPT directly) are probably better fits.

Getting Started

If you’re considering building an AI investigation agent, here’s my recommended sequence:

  1. Define your investigation clearly. What’s the question? What are the possible answers? What data is involved?
  2. Create your Tier 1 document. This is the hardest part. Extract and codify the domain knowledge that experts carry in their heads. Budget real time for this, it’s not a weekend project.
  3. Set up your data pipeline. Figure out how to assemble the “data of interest” for a given investigation object. This is usually integration work with existing data infrastructure.
  4. Build the journal structure. Define what sections you need, what gets tracked, how compression works.
  5. Implement the confidence scoring. Start simple. You can refine based on observed failure modes.
  6. Add validation. Identify common error patterns and catch them explicitly.
  7. Iterate on real cases. Run historical investigations, compare to human results, refine based on gaps.

The upfront investment is significant, probably 4-8 weeks to get a domain working well. But once configured, the system runs autonomously at scale, which is the whole point.

The Bigger Picture

What excites me about investigation agents isn’t just the efficiency gains (though 90% cost reduction is nothing to sneeze at). It’s what becomes possible when investigation is cheap.

Today, organizations only investigate the most important cases like the ones with the biggest dollar impact or the angriest customers. Thousands of smaller issues go uninvestigated, patterns go unnoticed, and systemic problems persist.

When you can investigate everything, you see everything. Patterns emerge from aggregate analysis. Root causes that span thousands of cases become visible. The organization develops genuine operational intelligence, not just sampled snapshots.

We’re still early in understanding what this enables. But I’m convinced that AI investigation agents represent a genuinely new capability, not just automation of existing work, but a qualitative expansion of what’s possible to know about complex operations.

The future belongs to organizations that can investigate at scale. The technology to build that future is available today.