Heal
Automatic fix generation and deployment for AI agent failures.
Risicare's self-healing pipeline automatically generates, validates, and deploys fixes for diagnosed errors.
Automated healing
No other platform offers automated fix generation and deployment. While competitors require manual debugging, Risicare generates fixes, tests them via hypothesis validation, and deploys them through statistical A/B testing -- all automatically.
Overview
The healing pipeline follows the DoVer methodology (Diagnosis via Observation of Verification):
- Generate Hypotheses - Create testable hypotheses about fixes
- Validate Statistically - Test fixes with A/B testing
- Deploy Safely - Canary release with automatic rollback
Hypothesis Testing
DoVer methodology for fix validation
Fix Types
7 types of automatic fixes
Overview
How self-healing works
Fix Types
Risicare can generate 7 types of fixes:
| Type | What It Does | Example |
|---|---|---|
| Prompt | Modify system prompt or add few-shot examples | Add clarifying instructions |
| Parameter | Adjust LLM parameters | Lower temperature, increase max_tokens |
| Tool | Fix tool configuration | Add timeout, fix validation |
| Retry | Add retry logic | Exponential backoff on transient errors |
| Fallback | Use alternative model/strategy | Fall back to gpt-4o-mini on timeout |
| Guard | Add input/output validation | JSON schema validation |
| Routing | Change agent delegation | Route to different specialist agent |
Fix Configuration
Fixes are JSON configurations, not code:
{
"fix_id": "fix-abc123",
"fix_type": "retry",
"config": {
"max_retries": 3,
"initial_delay_ms": 1000,
"exponential_base": 2.0,
"max_delay_ms": 30000,
"jitter": true,
"retry_on": ["TimeoutError"]
},
"rollback_strategy": {
"type": "immediate",
"trigger": "error_rate > 0.1"
}
}No Code Injection
Fixes are declarative configurations applied by the SDK at runtime. Risicare never injects code into your system.
Hypothesis Testing
Before deployment, fixes are validated through hypothesis testing:
Generate Hypotheses
Diagnosis: TOOL.EXECUTION.TIMEOUT on weather_api
Hypothesis 1: Adding retry with backoff will reduce timeout errors
Prior probability: 0.75 (based on similar patterns)
Hypothesis 2: Increasing timeout to 60s will reduce errors
Prior probability: 0.60
Hypothesis 3: Adding fallback to cached data will maintain uptime
Prior probability: 0.55
Statistical Validation
Each hypothesis is tested with:
- Sample size calculation for statistical power (0.8)
- Two-proportion z-test for significance (p < 0.05)
- Bayesian updates to posterior probability
- O'Brien-Fleming boundaries for early stopping
Test Results:
Baseline error rate: 12.3%
Treatment error rate: 2.1%
Effect size (Cohen's h): 0.38
P-value: 0.0023 ✓
Decision: Hypothesis VALIDATED
Deployment Pipeline
Fix Created
↓
┌─────────────────┐
│ Canary (5%) │ Minimum 100 samples
│ │ Monitor error rate
└─────────────────┘
↓ (if passing)
┌─────────────────┐
│ Ramp (25%) │ Statistical A/B test
│ │ O'Brien-Fleming boundaries
└─────────────────┘
↓ (if winning)
┌─────────────────┐
│ Ramp (50%) │ Continue testing
│ │
└─────────────────┘
↓ (if winning)
┌─────────────────┐
│ Graduate (100%) │ Hold for 24 hours
│ │ Mark as graduated
└─────────────────┘
Automatic Rollback
Fixes are automatically rolled back if:
- Error rate increases >10% vs baseline
- P99 latency exceeds 2x baseline
- Manual rollback triggered
Rollback latency target: under 500ms (Redis routing update)
Fix Runtime
The SDK includes a fix runtime that:
- Loads fixes from the API on startup
- Caches locally with periodic refresh
- Routes requests based on A/B assignment
- Applies fixes at LLM call time
# Fix runtime is automatic when using the SDK
import risicare
risicare.init()
# Fixes are applied automatically to LLM calls
response = client.chat.completions.create(...)Knowledge Base
Successful fixes are stored in a knowledge base:
- Error patterns as embeddings (pgvector)
- Fix templates with parameters
- Cross-customer learning (federated, no raw data)
- Similarity threshold: 0.85
When a new error occurs, the knowledge base is checked first before generating a new fix.