Diagnose

Automatic error analysis using LLM-powered diagnosis.

Risicare automatically diagnoses errors in your AI agents using a 4-stage LLM-powered pipeline.

Beyond observability

Most AI observability platforms stop at showing you errors. Risicare goes further -- it automatically classifies failures using a 154-code taxonomy and performs root cause analysis. No manual investigation required.

Overview

When an error occurs, the diagnosis engine:

Extracts context from the trace (spans, messages, tool I/O)
Classifies the error using the 10-module taxonomy
Analyzes root cause with deep LLM reasoning
Suggests fixes based on patterns and templates

Error Taxonomy

10 modules, 31 categories, 154 error codes

Learn more

Diagnosis Pipeline

How the 4-stage pipeline works

Learn more

Overview

How diagnosis works

Learn more

The 4-Stage Pipeline

Error Detected (span with has_error=true)
           ↓
┌─────────────────────────────┐
│ Stage 1: Context Extraction │  Extract relevant spans,
│         (~100ms)            │  messages, and state
└─────────────────────────────┘
           ↓
┌─────────────────────────────┐
│ Stage 2: Classification     │  Classify using gpt-4o-mini
│         (~500ms)            │  Fast, cheap categorization
└─────────────────────────────┘
           ↓
┌─────────────────────────────┐
│ Stage 3: Root Cause         │  Deep analysis using gpt-4o
│         (~2-3s)             │  Thorough reasoning
└─────────────────────────────┘
           ↓
┌─────────────────────────────┐
│ Stage 4: Fix Suggestion     │  Generate fix configs
│         (~500ms)            │  Pattern match first
└─────────────────────────────┘
           ↓
    DiagnosisResult

Error Taxonomy

Risicare classifies errors into a hierarchical taxonomy:

Module	What It Covers	Example Codes
PERCEPTION	Input parsing, validation	`PERCEPTION.PARSING.FORMAT_ERROR`
REASONING	Logic errors, hallucinations	`REASONING.LOGIC.CONTRADICTION`
TOOL	Tool execution failures	`TOOL.EXECUTION.TIMEOUT`
MEMORY	State management issues	`MEMORY.RETRIEVAL.NOT_FOUND`
OUTPUT	Response formatting	`OUTPUT.FORMAT.INVALID_JSON`
COORDINATION	Workflow problems	`COORDINATION.FLOW.DEADLOCK`
COMMUNICATION	Inter-agent messages	`COMMUNICATION.ROUTING.INVALID_TARGET`
ORCHESTRATION	Agent lifecycle	`ORCHESTRATION.LIFECYCLE.TIMEOUT`
CONSENSUS	Multi-agent agreement	`CONSENSUS.VOTING.NO_QUORUM`
RESOURCES	Resource contention	`RESOURCES.CONTENTION.LOCK_TIMEOUT`

Error Code Format

Error codes follow the pattern: MODULE.CATEGORY.SPECIFIC_ERROR

Example: TOOL.EXECUTION.TIMEOUT means:

Module: TOOL (action errors)
Category: EXECUTION (runtime execution)
Specific: TIMEOUT (operation timed out)

Diagnosis Result

Each diagnosis produces a structured result:

{
  "diagnosis_id": "diag-abc123",
  "trace_id": "trace-xyz789",
  "error_code": "TOOL.EXECUTION.TIMEOUT",
  "module": "TOOL",
  "category": "EXECUTION",
  "confidence": 0.92,
  "root_cause": "The weather API call timed out after 30 seconds due to network latency",
  "evidence": [
    "Span 'weather_api_call' duration: 30.2s",
    "Error message: 'Request timed out'",
    "Previous successful calls averaged 2.1s"
  ],
  "suggested_fixes": [
    {
      "type": "retry",
      "config": {"max_retries": 3, "initial_delay_ms": 1000, "exponential_base": 2.0},
      "confidence": 0.85
    },
    {
      "type": "timeout",
      "config": {"timeout_ms": 60000},
      "confidence": 0.72
    }
  ]
}

Automatic vs Manual Diagnosis

Automatic

By default, Risicare automatically diagnoses errors when:

A span has has_error=true
The error rate exceeds a threshold
The error is new (not seen before)

Manual

Trigger diagnosis via the REST API:

curl -X POST "https://app.risicare.ai/api/v1/diagnoses" \
  -H "Authorization: Bearer rsk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"trace_id": "trace-xyz789", "span_id": "span-abc123"}'

Or from the dashboard: view a trace → click "Diagnose".

Or via the dashboard: Traces → Select Trace → Diagnose.

Caching

Similar errors are cached for 24 hours:

Same error_code + similar stack trace = cache hit
Avoids redundant LLM calls
Cache can be invalidated manually

Next Steps

Full Taxonomy Reference

All 154 error codes explained

Learn more

Self-Healing

Automatic fix generation

Learn more

Edit this page on GitHub

PreviousCost Tracking NextOverview