Heal
Automatic fix generation and deployment for AI agent failures.
Risicare's self-healing pipeline automatically detects errors, diagnoses root causes, and generates fixes. Fix deployment via A/B testing is available as an opt-in feature.
Beyond observability
No other platform offers automated error diagnosis with a 154-code taxonomy, fix generation across 7 fix types, and statistical A/B deployment. While competitors stop at showing you the error, Risicare diagnoses why it happened and generates a fix.
Overview
The healing pipeline follows the DoVer methodology (Diagnosis via Observation of Verification):
- Generate Hypotheses - Create testable hypotheses about fixes
- Validate Statistically - Test fixes with A/B testing
- Deploy Safely - Canary release with automatic rollback
Hypothesis Testing
DoVer methodology for fix validation
Fix Types
7 types of automatic fixes
Overview
How self-healing works
Fix Types
Risicare can generate 7 types of fixes:
| Type | What It Does | Example |
|---|---|---|
| Prompt | Modify system prompt or add few-shot examples | Add clarifying instructions |
| Parameter | Adjust LLM parameters | Lower temperature, increase max_tokens |
| Tool | Fix tool configuration | Add timeout, fix validation |
| Retry | Add retry logic | Exponential backoff on transient errors |
| Fallback | Use alternative model/strategy | Fall back to gpt-4o-mini on timeout |
| Guard | Add input/output validation | JSON schema validation |
| Routing | Change agent delegation | Route to different specialist agent |
Fix Configuration
Fixes are JSON configurations, not code:
{
"fix_id": "fix-abc123",
"fix_type": "retry",
"config": {
"max_retries": 3,
"initial_delay_ms": 1000,
"exponential_base": 2.0,
"max_delay_ms": 30000,
"jitter": true,
"retry_on": ["TimeoutError"]
},
"rollback_strategy": {
"type": "immediate",
"trigger": "error_rate > 0.1"
}
}No Code Injection
Fixes are declarative configurations applied by the SDK at runtime. Risicare never injects code into your system.
Hypothesis Testing
Before deployment, fixes are validated through hypothesis testing:
Generate Hypotheses
Diagnosis: TOOL.EXECUTION.TIMEOUT on weather_api
Hypothesis 1: Adding retry with backoff will reduce timeout errors
Prior probability: 0.75 (based on similar patterns)
Hypothesis 2: Increasing timeout to 60s will reduce errors
Prior probability: 0.60
Hypothesis 3: Adding fallback to cached data will maintain uptime
Prior probability: 0.55
Statistical Validation
Each hypothesis is tested with:
- Sample size calculation for statistical power (0.8)
- Two-proportion z-test for significance (p < 0.05)
- Bayesian updates to posterior probability
- O'Brien-Fleming boundaries for early stopping
Test Results:
Baseline error rate: 12.3%
Treatment error rate: 2.1%
Effect size (Cohen's h): 0.38
P-value: 0.0023 ✓
Decision: Hypothesis VALIDATED
Deployment Pipeline
Fix Created
↓
┌─────────────────┐
│ Canary (5%) │ Minimum 100 samples
│ │ Monitor error rate
└─────────────────┘
↓ (if passing)
┌─────────────────┐
│ Ramp (25%) │ Statistical A/B test
│ │ O'Brien-Fleming boundaries
└─────────────────┘
↓ (if winning)
┌─────────────────┐
│ Ramp (50%) │ Continue testing
│ │
└─────────────────┘
↓ (if winning)
┌─────────────────┐
│ Graduate (100%) │ Hold for 24 hours
│ │ Mark as graduated
└─────────────────┘
Automatic Rollback
Fixes are automatically rolled back if:
- Error rate increases >10% vs baseline
- P99 latency exceeds 2x baseline
- Manual rollback triggered
Rollback latency target: under 500ms (Redis routing update)
Fix Runtime
The SDK includes a fix runtime that:
- Loads fixes from the API on startup
- Caches locally with periodic refresh
- Routes requests based on A/B assignment
- Applies fixes at LLM call time
# Fix runtime is automatic when using the SDK
import risicare
risicare.init()
# Fixes are applied automatically to LLM calls
response = client.chat.completions.create(...)Knowledge Base
Successful fixes are stored in a knowledge base:
- Error patterns as embeddings (pgvector)
- Fix templates with parameters
- Cross-customer learning (federated, no raw data)
- Similarity threshold: 0.85
When a new error occurs, the knowledge base is checked first before generating a new fix.