Skip to main content
GitHub

Overview

How Risicare's self-healing pipeline works.

Risicare automatically detects errors, diagnoses root causes, and generates fixes. Fix deployment is available as an opt-in feature.

Self-Healing Pipeline

Self-healing pipeline: Error → Diagnosis → Fix Generation → Canary Deploy → A/B Testing → Graduate

Diagnosis → Hypothesis Generation → Validation → Deployment → Learning
             ↓                        ↓            ↓           ↓
         Generate fix ideas     Test each one   Canary → A/B  Store pattern

How It Works

1. Receive Diagnosis

When an error is diagnosed, the healing pipeline receives:

  • Error code (e.g., TOOL.EXECUTION.TIMEOUT)
  • Root cause analysis
  • Context from the error trace
  • Similar past errors (if any)

2. Generate Hypotheses

Create testable hypotheses about what might fix the issue:

Diagnosis: TOOL.EXECUTION.TIMEOUT on weather_api

Hypothesis 1: Retry with backoff (0.75 prior)
Hypothesis 2: Increase timeout (0.60 prior)
Hypothesis 3: Add fallback (0.55 prior)

3. Validate Statistically

Each hypothesis is tested:

  1. A/B Test: Split traffic between baseline and fix
  2. Measure: Error rate, latency, cost
  3. Analyze: Statistical significance (p < 0.05)
  4. Decide: Accept, reject, or continue testing

4. Deploy Safely

Deployment is opt-in

Fix deployment requires auto_fix_enabled to be turned on in your project settings. By default, the pipeline generates and validates fixes but does not deploy them automatically. You can review generated fixes in the dashboard before enabling auto-deployment.

Validated fixes are deployed progressively:

Canary (5%) → Ramp (25%) → Ramp (50%) → Graduate (100%)

With automatic rollback if:

  • Error rate increases >10%
  • Latency exceeds 2x baseline
  • Manual intervention

5. Learn

Successful fixes become knowledge:

  • Store error pattern as embedding
  • Create fix template
  • Share across customers (federated)
  • Improve future suggestions

Fix Types

Risicare generates 7 types of fixes:

TypeWhat It Does
PromptModify system prompt
ParameterAdjust LLM settings
ToolFix tool configuration
RetryAdd retry with backoff
FallbackUse alternative strategy
GuardAdd validation
RoutingChange agent delegation

No Code Injection

Declarative Fixes

Fixes are JSON configurations, not code. The Python SDK's Fix Runtime interprets these at runtime. Risicare never injects code into your system. The JavaScript SDK does not yet include Fix Runtime — fixes must be applied manually.

Example fix:

{
  "fix_id": "fix-abc123",
  "fix_type": "retry",
  "config": {
    "max_retries": 3,
    "initial_delay_ms": 1000,
    "exponential_base": 2.0,
    "max_delay_ms": 30000,
    "jitter": true,
    "retry_on": []
  }
}

Confidence Levels

Fixes have confidence scores:

ConfidenceMeaning
> 0.8High - auto-deploy to canary
0.6 - 0.8Medium - require approval
< 0.6Low - suggest only

Dashboard

View healing activity:

  • Active Fixes: Currently deployed fixes
  • Testing: Fixes in A/B testing
  • Candidates: Suggested but not deployed
  • Graduated: Successfully deployed
  • Rolled Back: Failed fixes

Metrics

MetricDescription
Fix Rate% of errors with deployed fixes
Success Rate% of fixes that graduate
MTTRMean time to remediation
Error Reduction% error reduction from fixes

Next Steps