Overview
How Risicare's diagnosis engine works.
Risicare automatically diagnoses agent failures using a 4-stage LLM-powered pipeline.
How It Works
When an error occurs in your agent, Risicare:
- Detects the error from trace data
- Extracts relevant context (spans, messages, tool I/O)
- Classifies using the error taxonomy
- Suggests potential fixes
Error Detected → Context Extraction → Classification → Fix Suggestion
(auto) (100ms) (1-2s) (500ms)
Diagnosis Pipeline
Stage 1: Context Extraction
Extract relevant information from the error trace:
- Error span and parent spans
- Recent LLM prompts and completions
- Tool inputs and outputs
- Agent state and messages
- Surrounding context (before/after)
Max context: 50 spans, 100K tokens
Stage 2: Taxonomy Classification
Classify the error using the 10-module taxonomy (154 error codes across 35+ categories):
| Module | Focus Area |
|---|---|
| PERCEPTION | Input processing |
| REASONING | Logic and inference |
| TOOL | Tool execution |
| MEMORY | State management |
| OUTPUT | Response generation |
| COORDINATION | Workflow control |
| COMMUNICATION | Inter-agent messages |
| ORCHESTRATION | Lifecycle management |
| CONSENSUS | Agreement protocols |
| RESOURCES | Shared resource access |
Classification uses a heuristic-first approach: a pattern matcher with 381 rules attempts to classify the error before any LLM call. If the pattern matcher produces a classification with confidence >= 0.6, the LLM step is skipped. This keeps typical classification under 100ms and reduces LLM costs.
LLM fallback: gpt-4o-mini (used only when pattern matching confidence is below threshold)
Stage 3: Root Cause Analysis
Deep analysis of why the error occurred:
- Identify contributing factors
- Trace causal chain
- Distinguish symptoms from causes
- Assess severity and impact
Model: gpt-4o (detailed reasoning)
Stage 4: Fix Suggestion
Recommend fixes based on the diagnosis:
- Pattern match against knowledge base (cosine similarity threshold: 0.85)
- Generate new fix if no match found
- Rank fixes by confidence (minimum 0.5 to be included)
- Estimate improvement probability
Diagnosis Output
{
"diagnosis_id": "diag-abc123",
"trace_id": "trace-xyz789",
"error_code": "TOOL.EXECUTION.TIMEOUT",
"module": "TOOL",
"category": "EXECUTION",
"subcategory": "TIMEOUT",
"confidence": 0.92,
"root_cause": {
"summary": "External API timeout due to large payload",
"factors": [
"Payload size: 2.5MB exceeds typical 100KB",
"No timeout configured on API call",
"Single retry with no backoff"
],
"evidence": ["span-123 shows 30s duration", "tool input was 2.5MB JSON"]
},
"suggested_fixes": [
{
"type": "retry",
"confidence": 0.85,
"description": "Add exponential backoff retry",
"config": {
"max_retries": 3,
"initial_delay_ms": 1000,
"exponential_base": 2.0,
"max_delay_ms": 30000,
"jitter": true,
"retry_on": []
}
},
{
"type": "parameter",
"confidence": 0.70,
"description": "Increase timeout to 60s",
"config": {
"timeout_ms": 60000
}
}
]
}Triggering Diagnosis
Automatic
Diagnosis runs automatically when:
- Span has
status: error - Error rate exceeds threshold
- Latency exceeds P99 baseline
Manual
Trigger diagnosis via API:
curl -X POST "https://app.risicare.ai/v1/diagnoses" \
-H "Authorization: Bearer rsk-..." \
-d '{"trace_id": "trace-xyz789"}'Or from the dashboard:
- View trace detail
- Click "Diagnose"
- View diagnosis results
Diagnosis Caching
Similar errors use cached diagnoses:
- Cache key:
error_code + stack_trace_hash - Cache TTL: 24 hours
- Cache hit rate: ~60% typical
Performance
| Metric | Target |
|---|---|
| Detection to diagnosis | < 5s P50 |
| Classification accuracy | > 90% |
| Fix suggestion relevance | > 80% |
| Cache hit rate | > 50% |