Overview

How Risicare's diagnosis engine works.

Risicare automatically diagnoses agent failures using a 4-stage LLM-powered pipeline.

How It Works

When an error occurs in your agent, Risicare:

Detects the error from trace data
Extracts relevant context (spans, messages, tool I/O)
Classifies using the error taxonomy
Suggests potential fixes

Error Detected → Context Extraction → Classification → Fix Suggestion
    (auto)           (100ms)            (1-2s)          (500ms)

Diagnosis Pipeline

Stage 1: Context Extraction

Extract relevant information from the error trace:

Error span and parent spans
Recent LLM prompts and completions
Tool inputs and outputs
Agent state and messages
Surrounding context (before/after)

Max context: 50 spans, 100K tokens

Stage 2: Taxonomy Classification

Classify the error using the 10-module taxonomy (154 error codes across 35+ categories):

Module	Focus Area
PERCEPTION	Input processing
REASONING	Logic and inference
TOOL	Tool execution
MEMORY	State management
OUTPUT	Response generation
COORDINATION	Workflow control
COMMUNICATION	Inter-agent messages
ORCHESTRATION	Lifecycle management
CONSENSUS	Agreement protocols
RESOURCES	Shared resource access

Classification uses a heuristic-first approach: a pattern matcher with 381 rules attempts to classify the error before any LLM call. If the pattern matcher produces a classification with confidence >= 0.6, the LLM step is skipped. This keeps typical classification under 100ms and reduces LLM costs.

LLM fallback: gpt-4o-mini (used only when pattern matching confidence is below threshold)

Stage 3: Root Cause Analysis

Deep analysis of why the error occurred:

Identify contributing factors
Trace causal chain
Distinguish symptoms from causes
Assess severity and impact

Model: gpt-4o (detailed reasoning)

Stage 4: Fix Suggestion

Recommend fixes based on the diagnosis:

Pattern match against knowledge base (cosine similarity threshold: 0.85)
Generate new fix if no match found
Rank fixes by confidence (minimum 0.5 to be included)
Estimate improvement probability

Diagnosis Output

{
  "diagnosis_id": "diag-abc123",
  "trace_id": "trace-xyz789",
  "error_code": "TOOL.EXECUTION.TIMEOUT",
  "module": "TOOL",
  "category": "EXECUTION",
  "subcategory": "TIMEOUT",
  "confidence": 0.92,
  "root_cause": {
    "summary": "External API timeout due to large payload",
    "factors": [
      "Payload size: 2.5MB exceeds typical 100KB",
      "No timeout configured on API call",
      "Single retry with no backoff"
    ],
    "evidence": ["span-123 shows 30s duration", "tool input was 2.5MB JSON"]
  },
  "suggested_fixes": [
    {
      "type": "retry",
      "confidence": 0.85,
      "description": "Add exponential backoff retry",
      "config": {
        "max_retries": 3,
        "initial_delay_ms": 1000,
        "exponential_base": 2.0,
        "max_delay_ms": 30000,
        "jitter": true,
        "retry_on": []
      }
    },
    {
      "type": "parameter",
      "confidence": 0.70,
      "description": "Increase timeout to 60s",
      "config": {
        "timeout_ms": 60000
      }
    }
  ]
}

Triggering Diagnosis

Automatic

Diagnosis runs automatically when:

Span has status: error
Error rate exceeds threshold
Latency exceeds P99 baseline

Manual

Trigger diagnosis via API:

curl -X POST "https://app.risicare.ai/v1/diagnoses" \
  -H "Authorization: Bearer rsk-..." \
  -d '{"trace_id": "trace-xyz789"}'

Or from the dashboard:

View trace detail
Click "Diagnose"
View diagnosis results

Diagnosis Caching

Similar errors use cached diagnoses:

Cache key: error_code + stack_trace_hash
Cache TTL: 24 hours
Cache hit rate: ~60% typical

Performance

Metric	Target
Detection to diagnosis	< 5s P50
Classification accuracy	> 90%
Fix suggestion relevance	> 80%
Cache hit rate	> 50%

Next Steps

Error Taxonomy

154 error codes explained

Learn more

Pipeline Details

Deep dive into each stage

Learn more

PreviousDiagnose NextError Taxonomy