Diagnose
Automatic error analysis using LLM-powered diagnosis.
Risicare automatically diagnoses errors in your AI agents using a 4-stage LLM-powered pipeline.
Beyond observability
Most AI observability platforms stop at showing you errors. Risicare goes further -- it automatically classifies failures using a 154-code taxonomy and performs root cause analysis. No manual investigation required.
Overview
When an error occurs, the diagnosis engine:
- Extracts context from the trace (spans, messages, tool I/O)
- Classifies the error using the 10-module taxonomy
- Analyzes root cause with deep LLM reasoning
- Suggests fixes based on patterns and templates
Error Taxonomy
10 modules, 35 categories, 154 error codes
Diagnosis Pipeline
How the 4-stage pipeline works
Overview
How diagnosis works
The 4-Stage Pipeline
Error Detected (span with has_error=true)
↓
┌─────────────────────────────┐
│ Stage 1: Context Extraction │ Extract relevant spans,
│ (~100ms) │ messages, and state
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ Stage 2: Classification │ Classify using gpt-4o-mini
│ (~500ms) │ Fast, cheap categorization
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ Stage 3: Root Cause │ Deep analysis using gpt-4o
│ (~2-3s) │ Thorough reasoning
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ Stage 4: Fix Suggestion │ Generate fix configs
│ (~500ms) │ Pattern match first
└─────────────────────────────┘
↓
DiagnosisResult
Error Taxonomy
Risicare classifies errors into a hierarchical taxonomy:
| Module | What It Covers | Example Codes |
|---|---|---|
| PERCEPTION | Input parsing, validation | PERCEPTION.PARSING.FORMAT_ERROR |
| REASONING | Logic errors, hallucinations | REASONING.LOGIC.CONTRADICTION |
| TOOL | Tool execution failures | TOOL.EXECUTION.TIMEOUT |
| MEMORY | State management issues | MEMORY.RETRIEVAL.NOT_FOUND |
| OUTPUT | Response formatting | OUTPUT.FORMAT.INVALID_JSON |
| COORDINATION | Workflow problems | COORDINATION.FLOW.DEADLOCK |
| COMMUNICATION | Inter-agent messages | COMMUNICATION.ROUTING.INVALID_TARGET |
| ORCHESTRATION | Agent lifecycle | ORCHESTRATION.LIFECYCLE.TIMEOUT |
| CONSENSUS | Multi-agent agreement | CONSENSUS.VOTING.NO_QUORUM |
| RESOURCES | Resource contention | RESOURCES.CONTENTION.LOCK_TIMEOUT |
Error Code Format
Error codes follow the pattern: MODULE.CATEGORY.SPECIFIC_ERROR
Example: TOOL.EXECUTION.TIMEOUT means:
- Module: TOOL (action errors)
- Category: EXECUTION (runtime execution)
- Specific: TIMEOUT (operation timed out)
Diagnosis Result
Each diagnosis produces a structured result:
{
"diagnosis_id": "diag-abc123",
"trace_id": "trace-xyz789",
"error_code": "TOOL.EXECUTION.TIMEOUT",
"module": "TOOL",
"category": "EXECUTION",
"confidence": 0.92,
"root_cause": "The weather API call timed out after 30 seconds due to network latency",
"evidence": [
"Span 'weather_api_call' duration: 30.2s",
"Error message: 'Request timed out'",
"Previous successful calls averaged 2.1s"
],
"suggested_fixes": [
{
"type": "retry",
"config": {"max_retries": 3, "initial_delay_ms": 1000, "exponential_base": 2.0},
"confidence": 0.85
},
{
"type": "timeout",
"config": {"timeout_ms": 60000},
"confidence": 0.72
}
]
}Automatic vs Manual Diagnosis
Automatic
By default, Risicare automatically diagnoses errors when:
- A span has
has_error=true - The error rate exceeds a threshold
- The error is new (not seen before)
Manual
Trigger diagnosis manually via the API:
from risicare import diagnose
result = await diagnose(
trace_id="trace-xyz789",
span_id="span-abc123", # Optional: specific span
)
print(result.error_code)
print(result.root_cause)Or via the dashboard: Traces → Select Trace → Diagnose.
Caching
Similar errors are cached for 24 hours:
- Same
error_code+ similar stack trace = cache hit - Avoids redundant LLM calls
- Cache can be invalidated manually