Skip to main content
GitHub

Diagnose

Automatic error analysis using LLM-powered diagnosis.

Risicare automatically diagnoses errors in your AI agents using a 4-stage LLM-powered pipeline.

Beyond observability

Most AI observability platforms stop at showing you errors. Risicare goes further -- it automatically classifies failures using a 154-code taxonomy and performs root cause analysis. No manual investigation required.

Overview

When an error occurs, the diagnosis engine:

  1. Extracts context from the trace (spans, messages, tool I/O)
  2. Classifies the error using the 10-module taxonomy
  3. Analyzes root cause with deep LLM reasoning
  4. Suggests fixes based on patterns and templates

The 4-Stage Pipeline

Error Detected (span with has_error=true)
           ↓
┌─────────────────────────────┐
│ Stage 1: Context Extraction │  Extract relevant spans,
│         (~100ms)            │  messages, and state
└─────────────────────────────┘
           ↓
┌─────────────────────────────┐
│ Stage 2: Classification     │  Classify using gpt-4o-mini
│         (~500ms)            │  Fast, cheap categorization
└─────────────────────────────┘
           ↓
┌─────────────────────────────┐
│ Stage 3: Root Cause         │  Deep analysis using gpt-4o
│         (~2-3s)             │  Thorough reasoning
└─────────────────────────────┘
           ↓
┌─────────────────────────────┐
│ Stage 4: Fix Suggestion     │  Generate fix configs
│         (~500ms)            │  Pattern match first
└─────────────────────────────┘
           ↓
    DiagnosisResult

Error Taxonomy

Risicare classifies errors into a hierarchical taxonomy:

ModuleWhat It CoversExample Codes
PERCEPTIONInput parsing, validationPERCEPTION.PARSING.FORMAT_ERROR
REASONINGLogic errors, hallucinationsREASONING.LOGIC.CONTRADICTION
TOOLTool execution failuresTOOL.EXECUTION.TIMEOUT
MEMORYState management issuesMEMORY.RETRIEVAL.NOT_FOUND
OUTPUTResponse formattingOUTPUT.FORMAT.INVALID_JSON
COORDINATIONWorkflow problemsCOORDINATION.FLOW.DEADLOCK
COMMUNICATIONInter-agent messagesCOMMUNICATION.ROUTING.INVALID_TARGET
ORCHESTRATIONAgent lifecycleORCHESTRATION.LIFECYCLE.TIMEOUT
CONSENSUSMulti-agent agreementCONSENSUS.VOTING.NO_QUORUM
RESOURCESResource contentionRESOURCES.CONTENTION.LOCK_TIMEOUT

Error Code Format

Error codes follow the pattern: MODULE.CATEGORY.SPECIFIC_ERROR

Example: TOOL.EXECUTION.TIMEOUT means:

  • Module: TOOL (action errors)
  • Category: EXECUTION (runtime execution)
  • Specific: TIMEOUT (operation timed out)

Diagnosis Result

Each diagnosis produces a structured result:

{
  "diagnosis_id": "diag-abc123",
  "trace_id": "trace-xyz789",
  "error_code": "TOOL.EXECUTION.TIMEOUT",
  "module": "TOOL",
  "category": "EXECUTION",
  "confidence": 0.92,
  "root_cause": "The weather API call timed out after 30 seconds due to network latency",
  "evidence": [
    "Span 'weather_api_call' duration: 30.2s",
    "Error message: 'Request timed out'",
    "Previous successful calls averaged 2.1s"
  ],
  "suggested_fixes": [
    {
      "type": "retry",
      "config": {"max_retries": 3, "initial_delay_ms": 1000, "exponential_base": 2.0},
      "confidence": 0.85
    },
    {
      "type": "timeout",
      "config": {"timeout_ms": 60000},
      "confidence": 0.72
    }
  ]
}

Automatic vs Manual Diagnosis

Automatic

By default, Risicare automatically diagnoses errors when:

  • A span has has_error=true
  • The error rate exceeds a threshold
  • The error is new (not seen before)

Manual

Trigger diagnosis manually via the API:

from risicare import diagnose
 
result = await diagnose(
    trace_id="trace-xyz789",
    span_id="span-abc123",  # Optional: specific span
)
 
print(result.error_code)
print(result.root_cause)

Or via the dashboard: Traces → Select Trace → Diagnose.

Caching

Similar errors are cached for 24 hours:

  • Same error_code + similar stack trace = cache hit
  • Avoids redundant LLM calls
  • Cache can be invalidated manually

Next Steps