Skip to main content
GitHub

Overview

How Risicare's self-healing pipeline works.

Risicare automatically generates, validates, and deploys fixes for diagnosed errors.

Self-Healing Pipeline

Diagnosis → Hypothesis Generation → Validation → Deployment → Learning
             ↓                        ↓            ↓           ↓
         Generate fix ideas     Test each one   Canary → A/B  Store pattern

How It Works

1. Receive Diagnosis

When an error is diagnosed, the healing pipeline receives:

  • Error code (e.g., TOOL.EXECUTION.TIMEOUT)
  • Root cause analysis
  • Context from the error trace
  • Similar past errors (if any)

2. Generate Hypotheses

Create testable hypotheses about what might fix the issue:

Diagnosis: TOOL.EXECUTION.TIMEOUT on weather_api

Hypothesis 1: Retry with backoff (0.75 prior)
Hypothesis 2: Increase timeout (0.60 prior)
Hypothesis 3: Add fallback (0.55 prior)

3. Validate Statistically

Each hypothesis is tested:

  1. A/B Test: Split traffic between baseline and fix
  2. Measure: Error rate, latency, cost
  3. Analyze: Statistical significance (p < 0.05)
  4. Decide: Accept, reject, or continue testing

4. Deploy Safely

Validated fixes are deployed progressively:

Canary (5%) → Ramp (25%) → Ramp (50%) → Graduate (100%)

With automatic rollback if:

  • Error rate increases >10%
  • Latency exceeds 2x baseline
  • Manual intervention

5. Learn

Successful fixes become knowledge:

  • Store error pattern as embedding
  • Create fix template
  • Share across customers (federated)
  • Improve future suggestions

Fix Types

Risicare generates 7 types of fixes:

TypeWhat It Does
PromptModify system prompt
ParameterAdjust LLM settings
ToolFix tool configuration
RetryAdd retry with backoff
FallbackUse alternative strategy
GuardAdd validation
RoutingChange agent delegation

No Code Injection

Declarative Fixes

Fixes are JSON configurations, not code. The SDK interprets these at runtime. Risicare never injects code into your system.

Example fix:

{
  "fix_id": "fix-abc123",
  "fix_type": "retry",
  "config": {
    "max_retries": 3,
    "initial_delay_ms": 1000,
    "exponential_base": 2.0,
    "max_delay_ms": 30000,
    "jitter": true,
    "retry_on": []
  }
}

Confidence Levels

Fixes have confidence scores:

ConfidenceMeaning
> 0.8High - auto-deploy to canary
0.6 - 0.8Medium - require approval
< 0.6Low - suggest only

Dashboard

View healing activity:

  • Active Fixes: Currently deployed fixes
  • Testing: Fixes in A/B testing
  • Candidates: Suggested but not deployed
  • Graduated: Successfully deployed
  • Rolled Back: Failed fixes

Metrics

MetricDescription
Fix Rate% of errors with deployed fixes
Success Rate% of fixes that graduate
MTTRMean time to remediation
Error Reduction% error reduction from fixes

Next Steps