Pipeline

The 4-stage diagnosis pipeline in detail.

Risicare's diagnosis pipeline processes errors through four stages.

Pipeline Overview

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ Stage 1         │     │ Stage 2         │     │ Stage 3         │     │ Stage 4         │
│ Context         │ ──▶ │ Taxonomy        │ ──▶ │ Root Cause      │ ──▶ │ Fix             │
│ Extraction      │     │ Classification  │     │ Analysis        │     │ Suggestion      │
│ (~100ms)        │     │ (~1s)           │     │ (~2s)           │     │ (~500ms)        │
└─────────────────┘     └─────────────────┘     └─────────────────┘     └─────────────────┘

Stage 1: Context Extraction

Purpose

Gather all relevant information from the error trace.

Process

Identify error span
- Find span with status: error
- Extract error message and stack trace
Collect parent chain
- Walk up the span tree
- Include agent contexts
- Include phase contexts (think/decide/act)
Gather sibling spans
- Preceding spans (what happened before)
- Following spans (if any)
Extract content
- LLM prompts and completions
- Tool inputs and outputs
- Agent messages

Context Window Management

MAX_SPANS = 50
MAX_TOKENS = 100_000
 
# Priority order for inclusion:
# 1. Error span (always)
# 2. Direct parents (always)
# 3. LLM spans (high priority)
# 4. Tool spans with errors (high priority)
# 5. Recent sibling spans (medium priority)
# 6. Older context (low priority, summarized)

Output

{
  "error_span": { ... },
  "parent_spans": [ ... ],
  "context_spans": [ ... ],
  "llm_content": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ],
  "tool_io": [
    {"tool": "search", "input": "...", "output": "..."}
  ],
  "metadata": {
    "agent": "researcher",
    "phase": "act",
    "iteration": 2
  }
}

Stage 2: Taxonomy Classification

Purpose

Map the error to a specific code in the error taxonomy (154 codes across 10 modules and 35+ categories).

Heuristic-First Classification

Before invoking an LLM, Risicare runs a pattern matcher with 381 regex and keyword rules against the error message, stack trace, and span attributes. Each rule maps to a specific error code and produces a confidence score.

If the pattern matcher produces a confidence >= 0.6, the classification is accepted immediately (typically under 10ms).
If confidence is < 0.6 or no pattern matches, the error is forwarded to the LLM classifier.

This heuristic-first approach handles the majority of common errors without an LLM call, significantly reducing latency and cost.

LLM Classifier (Fallback)

Model: gpt-4o-mini - Fast, cost-effective classification for errors that the pattern matcher cannot confidently classify.

Prompt Structure

System: You are an AI agent error classifier. Classify errors into this taxonomy:

[TAXONOMY DEFINITION - 10 modules, 35+ categories]

Rules:
- Choose the most specific code that fits
- Confidence must be 0.0-1.0
- Explain your reasoning briefly

User: [EXTRACTED CONTEXT]

Error: {error_message}
Stack trace: {stack_trace}
Agent: {agent_name}
Phase: {phase}

Output

{
  "module": "TOOL",
  "category": "EXECUTION",
  "subcategory": "TIMEOUT",
  "error_code": "TOOL.EXECUTION.TIMEOUT",
  "confidence": 0.92,
  "reasoning": "The error occurred during tool execution with a 30s timeout. The API call exceeded the configured timeout limit."
}

Confidence Thresholds

Confidence	Action
>= 0.8	High confidence, proceed directly to Stage 3
0.6 - 0.8	Medium confidence, proceed but flag for review
< 0.6	Low confidence, escalate to Stage 3 for deeper analysis with `gpt-4o`

Fix suggestion threshold

In Stage 4, suggested fixes require a minimum confidence of 0.5 to be included in the diagnosis output. Fixes below this threshold are discarded.

Stage 3: Root Cause Analysis

Purpose

Determine why the error actually occurred, not just what happened.

Model

gpt-4o - Deep reasoning capabilities

Analysis Framework

Proximate cause: What directly caused the error?
Contributing factors: What made it more likely?
Underlying issues: What systemic problems exist?
Prevention: How could this be avoided?

Prompt Structure

System: You are an expert debugger for AI agents. Analyze root causes deeply.

Consider:
- Was this a one-time issue or systemic?
- What assumptions failed?
- What dependencies failed?
- What could prevent recurrence?

User:
Classification: {error_code}
Context: {extracted_context}

Output

{
  "root_cause": {
    "summary": "External API timeout due to unexpectedly large payload",
    "proximate_cause": "API call exceeded 30s timeout",
    "contributing_factors": [
      "Payload size (2.5MB) was 25x larger than typical",
      "No retry logic configured",
      "No payload size validation before API call"
    ],
    "underlying_issues": [
      "Missing input validation for tool arguments",
      "No circuit breaker for external dependencies"
    ],
    "severity": "medium",
    "frequency_estimate": "likely_recurring"
  }
}

Stage 4: Fix Suggestion

Purpose

Recommend actionable fixes ranked by confidence.

Process

Knowledge base lookup
- Search for similar error patterns
- Find proven fixes for this error code
- Match by embedding similarity (threshold: 0.85)
Fix generation (if no match)
- Generate fix based on root cause
- Use fix templates for the error type
- Validate fix configuration
Ranking
- Score by historical success rate
- Score by root cause alignment
- Score by implementation complexity

Fix Types

Type	Description
`prompt`	Modify system prompt
`parameter`	Adjust LLM parameters
`tool`	Fix tool configuration
`retry`	Add retry logic
`fallback`	Add fallback strategy
`guard`	Add validation
`routing`	Change agent routing

Output

{
  "suggested_fixes": [
    {
      "fix_type": "retry",
      "confidence": 0.85,
      "description": "Add exponential backoff retry for timeout errors",
      "config": {
        "max_retries": 3,
        "initial_delay_ms": 1000,
        "exponential_base": 2.0,
        "max_delay_ms": 30000,
        "jitter": true,
        "retry_on": ["TOOL.EXECUTION.TIMEOUT"]
      },
      "evidence": "This fix resolved 78% of similar timeout errors"
    }
  ]
}

Pipeline Monitoring

Metrics

Metric	Description
`diagnosis_latency_ms`	Total pipeline time
`stage_1_latency_ms`	Context extraction time
`stage_2_latency_ms`	Classification time
`stage_3_latency_ms`	Root cause analysis time
`stage_4_latency_ms`	Fix suggestion time
`classification_confidence`	Stage 2 confidence
`knowledge_base_hit_rate`	Stage 4 cache hits

Error Handling

If any stage fails:

Log error with context
Return partial diagnosis
Mark diagnosis as incomplete
Queue for retry (max 3 attempts)

Next Steps

Error Taxonomy

All 154 error codes

Learn more

Self-Healing

Automatic fix generation

Learn more

PreviousRESOURCES NextHeal