Hypothesis Testing

DoVer methodology for fix validation.

Coming Soon

Hypothesis testing is under active development. 55 hypotheses have been generated but the experiment runner is not yet connected.

Risicare validates fixes using the DoVer (Diagnosis via Observation of Verification) methodology.

DoVer Methodology

DoVer treats debugging as hypothesis testing:

Observe - Collect data about the error
Hypothesize - Generate possible fixes
Verify - Test each hypothesis statistically
Diagnose - Confirm which fix works

Hypothesis Generation

From Diagnosis

When a diagnosis completes, generate hypotheses:

diagnosis = {
    "error_code": "TOOL.EXECUTION.TIMEOUT",
    "root_cause": "API timeout due to large payload",
    "factors": ["payload_size", "no_retry", "short_timeout"]
}
 
hypotheses = [
    {
        "type": "retry",
        "prior": 0.75,  # Based on similar cases
        "rationale": "Transient timeouts often succeed on retry"
    },
    {
        "type": "parameter",
        "prior": 0.60,
        "rationale": "Longer timeout may allow completion"
    },
    {
        "type": "fallback",
        "prior": 0.55,
        "rationale": "Fallback maintains availability"
    }
]

Prior Probability

Priors are calculated from:

Historical success rate for this error code
Similarity to past fixes
Root cause alignment
Implementation complexity

Statistical Validation

Sample Size

Calculate required samples for statistical power:

Power = 0.80 (80% chance of detecting real effect)
α = 0.05 (5% false positive rate)
Expected effect = 50% error reduction
Baseline error rate = 10%

Required samples ≈ 400 per group

A/B Test

Split traffic between control and treatment:

Control (50%): No fix applied
Treatment (50%): Fix applied

Measure:
- Error rate
- Latency (P50, P95, P99)
- Cost per request

Analysis

Two-proportion z-test:

Baseline error rate: 12.3% (n=500)
Treatment error rate: 2.1% (n=500)

z = 5.89
p-value = 0.0000003
Effect size (Cohen's h) = 0.38

Result: Statistically significant improvement

Bayesian Update

Update hypothesis probability:

Prior: P(H) = 0.75
Likelihood: P(data|H) = 0.95
Evidence: P(data) = 0.30

Posterior: P(H|data) = 0.75 * 0.95 / 0.30 = 0.89

Early Stopping

Use O'Brien-Fleming boundaries for sequential testing:

Check 1 (25% samples): Need z > 4.049 to stop
Check 2 (50% samples): Need z > 2.863 to stop
Check 3 (75% samples): Need z > 2.337 to stop
Check 4 (100% samples): Need z > 2.024 to stop

Benefits:

Stop early if fix clearly works
Stop early if fix clearly fails
Reduce exposure to bad fixes

Experiment Lifecycle

Pending → Testing → Validated
                  → Rejected
                  → Inconclusive
                  → Failed

Hypothesis States

State	Description
`pending`	Hypothesis created, awaiting test
`testing`	Traffic being split, collecting samples
`validated`	Fix proven statistically effective
`rejected`	Fix proven not effective
`inconclusive`	Not enough data to decide
`failed`	Test encountered an error

Experiment Output

{
  "experiment_id": "exp-abc123",
  "hypothesis": "retry_with_backoff",
  "status": "validated",
  "results": {
    "control": {
      "samples": 512,
      "error_rate": 0.123,
      "latency_p50_ms": 234,
      "latency_p99_ms": 1890
    },
    "treatment": {
      "samples": 498,
      "error_rate": 0.021,
      "latency_p50_ms": 267,
      "latency_p99_ms": 2100
    },
    "statistics": {
      "z_score": 5.89,
      "p_value": 0.0000003,
      "effect_size": 0.38,
      "confidence_interval": [0.072, 0.132]
    }
  },
  "decision": "DEPLOY",
  "confidence": 0.89
}