Skip to main content
GitHub

Heal

Automatic fix generation and deployment for AI agent failures.

Risicare's self-healing pipeline automatically generates, validates, and deploys fixes for diagnosed errors.

Automated healing

No other platform offers automated fix generation and deployment. While competitors require manual debugging, Risicare generates fixes, tests them via hypothesis validation, and deploys them through statistical A/B testing -- all automatically.

Overview

The healing pipeline follows the DoVer methodology (Diagnosis via Observation of Verification):

  1. Generate Hypotheses - Create testable hypotheses about fixes
  2. Validate Statistically - Test fixes with A/B testing
  3. Deploy Safely - Canary release with automatic rollback

Fix Types

Risicare can generate 7 types of fixes:

TypeWhat It DoesExample
PromptModify system prompt or add few-shot examplesAdd clarifying instructions
ParameterAdjust LLM parametersLower temperature, increase max_tokens
ToolFix tool configurationAdd timeout, fix validation
RetryAdd retry logicExponential backoff on transient errors
FallbackUse alternative model/strategyFall back to gpt-4o-mini on timeout
GuardAdd input/output validationJSON schema validation
RoutingChange agent delegationRoute to different specialist agent

Fix Configuration

Fixes are JSON configurations, not code:

{
  "fix_id": "fix-abc123",
  "fix_type": "retry",
  "config": {
    "max_retries": 3,
    "initial_delay_ms": 1000,
    "exponential_base": 2.0,
    "max_delay_ms": 30000,
    "jitter": true,
    "retry_on": ["TimeoutError"]
  },
  "rollback_strategy": {
    "type": "immediate",
    "trigger": "error_rate > 0.1"
  }
}

No Code Injection

Fixes are declarative configurations applied by the SDK at runtime. Risicare never injects code into your system.

Hypothesis Testing

Before deployment, fixes are validated through hypothesis testing:

Generate Hypotheses

Diagnosis: TOOL.EXECUTION.TIMEOUT on weather_api

Hypothesis 1: Adding retry with backoff will reduce timeout errors
  Prior probability: 0.75 (based on similar patterns)

Hypothesis 2: Increasing timeout to 60s will reduce errors
  Prior probability: 0.60

Hypothesis 3: Adding fallback to cached data will maintain uptime
  Prior probability: 0.55

Statistical Validation

Each hypothesis is tested with:

  • Sample size calculation for statistical power (0.8)
  • Two-proportion z-test for significance (p < 0.05)
  • Bayesian updates to posterior probability
  • O'Brien-Fleming boundaries for early stopping
Test Results:
  Baseline error rate: 12.3%
  Treatment error rate: 2.1%
  Effect size (Cohen's h): 0.38
  P-value: 0.0023 ✓

  Decision: Hypothesis VALIDATED

Deployment Pipeline

Fix Created
     ↓
┌─────────────────┐
│ Canary (5%)     │  Minimum 100 samples
│                 │  Monitor error rate
└─────────────────┘
     ↓ (if passing)
┌─────────────────┐
│ Ramp (25%)      │  Statistical A/B test
│                 │  O'Brien-Fleming boundaries
└─────────────────┘
     ↓ (if winning)
┌─────────────────┐
│ Ramp (50%)      │  Continue testing
│                 │
└─────────────────┘
     ↓ (if winning)
┌─────────────────┐
│ Graduate (100%) │  Hold for 24 hours
│                 │  Mark as graduated
└─────────────────┘

Automatic Rollback

Fixes are automatically rolled back if:

  • Error rate increases >10% vs baseline
  • P99 latency exceeds 2x baseline
  • Manual rollback triggered

Rollback latency target: under 500ms (Redis routing update)

Fix Runtime

The SDK includes a fix runtime that:

  1. Loads fixes from the API on startup
  2. Caches locally with periodic refresh
  3. Routes requests based on A/B assignment
  4. Applies fixes at LLM call time
# Fix runtime is automatic when using the SDK
import risicare
 
risicare.init()
 
# Fixes are applied automatically to LLM calls
response = client.chat.completions.create(...)

Knowledge Base

Successful fixes are stored in a knowledge base:

  • Error patterns as embeddings (pgvector)
  • Fix templates with parameters
  • Cross-customer learning (federated, no raw data)
  • Similarity threshold: 0.85

When a new error occurs, the knowledge base is checked first before generating a new fix.

Next Steps