Skip to main content
GitHub

Overview

How Risicare's diagnosis engine works.

Risicare automatically diagnoses agent failures using a 4-stage LLM-powered pipeline.

How It Works

When an error occurs in your agent, Risicare:

  1. Detects the error from trace data
  2. Extracts relevant context (spans, messages, tool I/O)
  3. Classifies using the error taxonomy
  4. Suggests potential fixes
Error Detected → Context Extraction → Classification → Fix Suggestion
    (auto)           (100ms)            (1-2s)          (500ms)

Diagnosis Pipeline

Stage 1: Context Extraction

Extract relevant information from the error trace:

  • Error span and parent spans
  • Recent LLM prompts and completions
  • Tool inputs and outputs
  • Agent state and messages
  • Surrounding context (before/after)

Max context: 50 spans, 100K tokens

Stage 2: Taxonomy Classification

Classify the error using the 10-module taxonomy (154 error codes across 35+ categories):

ModuleFocus Area
PERCEPTIONInput processing
REASONINGLogic and inference
TOOLTool execution
MEMORYState management
OUTPUTResponse generation
COORDINATIONWorkflow control
COMMUNICATIONInter-agent messages
ORCHESTRATIONLifecycle management
CONSENSUSAgreement protocols
RESOURCESShared resource access

Classification uses a heuristic-first approach: a pattern matcher with 381 rules attempts to classify the error before any LLM call. If the pattern matcher produces a classification with confidence >= 0.6, the LLM step is skipped. This keeps typical classification under 100ms and reduces LLM costs.

LLM fallback: gpt-4o-mini (used only when pattern matching confidence is below threshold)

Stage 3: Root Cause Analysis

Deep analysis of why the error occurred:

  • Identify contributing factors
  • Trace causal chain
  • Distinguish symptoms from causes
  • Assess severity and impact

Model: gpt-4o (detailed reasoning)

Stage 4: Fix Suggestion

Recommend fixes based on the diagnosis:

  • Pattern match against knowledge base (cosine similarity threshold: 0.85)
  • Generate new fix if no match found
  • Rank fixes by confidence (minimum 0.5 to be included)
  • Estimate improvement probability

Diagnosis Output

{
  "diagnosis_id": "diag-abc123",
  "trace_id": "trace-xyz789",
  "error_code": "TOOL.EXECUTION.TIMEOUT",
  "module": "TOOL",
  "category": "EXECUTION",
  "subcategory": "TIMEOUT",
  "confidence": 0.92,
  "root_cause": {
    "summary": "External API timeout due to large payload",
    "factors": [
      "Payload size: 2.5MB exceeds typical 100KB",
      "No timeout configured on API call",
      "Single retry with no backoff"
    ],
    "evidence": ["span-123 shows 30s duration", "tool input was 2.5MB JSON"]
  },
  "suggested_fixes": [
    {
      "type": "retry",
      "confidence": 0.85,
      "description": "Add exponential backoff retry",
      "config": {
        "max_retries": 3,
        "initial_delay_ms": 1000,
        "exponential_base": 2.0,
        "max_delay_ms": 30000,
        "jitter": true,
        "retry_on": []
      }
    },
    {
      "type": "parameter",
      "confidence": 0.70,
      "description": "Increase timeout to 60s",
      "config": {
        "timeout_ms": 60000
      }
    }
  ]
}

Triggering Diagnosis

Automatic

Diagnosis runs automatically when:

  • Span has status: error
  • Error rate exceeds threshold
  • Latency exceeds P99 baseline

Manual

Trigger diagnosis via API:

curl -X POST "https://app.risicare.ai/v1/diagnoses" \
  -H "Authorization: Bearer rsk-..." \
  -d '{"trace_id": "trace-xyz789"}'

Or from the dashboard:

  1. View trace detail
  2. Click "Diagnose"
  3. View diagnosis results

Diagnosis Caching

Similar errors use cached diagnoses:

  • Cache key: error_code + stack_trace_hash
  • Cache TTL: 24 hours
  • Cache hit rate: ~60% typical

Performance

MetricTarget
Detection to diagnosis< 5s P50
Classification accuracy> 90%
Fix suggestion relevance> 80%
Cache hit rate> 50%

Next Steps