Skip to main content
GitHub

Hypothesis Testing

DoVer methodology for fix validation.

Risicare validates fixes using the DoVer (Diagnosis via Observation of Verification) methodology.

DoVer Methodology

DoVer treats debugging as hypothesis testing:

  1. Observe - Collect data about the error
  2. Hypothesize - Generate possible fixes
  3. Verify - Test each hypothesis statistically
  4. Diagnose - Confirm which fix works

Hypothesis Generation

From Diagnosis

When a diagnosis completes, generate hypotheses:

diagnosis = {
    "error_code": "TOOL.EXECUTION.TIMEOUT",
    "root_cause": "API timeout due to large payload",
    "factors": ["payload_size", "no_retry", "short_timeout"]
}
 
hypotheses = [
    {
        "type": "retry",
        "prior": 0.75,  # Based on similar cases
        "rationale": "Transient timeouts often succeed on retry"
    },
    {
        "type": "parameter",
        "prior": 0.60,
        "rationale": "Longer timeout may allow completion"
    },
    {
        "type": "fallback",
        "prior": 0.55,
        "rationale": "Fallback maintains availability"
    }
]

Prior Probability

Priors are calculated from:

  • Historical success rate for this error code
  • Similarity to past fixes
  • Root cause alignment
  • Implementation complexity

Statistical Validation

Sample Size

Calculate required samples for statistical power:

Power = 0.80 (80% chance of detecting real effect)
α = 0.05 (5% false positive rate)
Expected effect = 50% error reduction
Baseline error rate = 10%

Required samples ≈ 400 per group

A/B Test

Split traffic between control and treatment:

Control (50%): No fix applied
Treatment (50%): Fix applied

Measure:
- Error rate
- Latency (P50, P95, P99)
- Cost per request

Analysis

Two-proportion z-test:

Baseline error rate: 12.3% (n=500)
Treatment error rate: 2.1% (n=500)

z = 5.89
p-value = 0.0000003
Effect size (Cohen's h) = 0.38

Result: Statistically significant improvement

Bayesian Update

Update hypothesis probability:

Prior: P(H) = 0.75
Likelihood: P(data|H) = 0.95
Evidence: P(data) = 0.30

Posterior: P(H|data) = 0.75 * 0.95 / 0.30 = 0.89

Early Stopping

Use O'Brien-Fleming boundaries for sequential testing:

Check 1 (25% samples): Need z > 4.56 to stop
Check 2 (50% samples): Need z > 2.94 to stop
Check 3 (75% samples): Need z > 2.36 to stop
Check 4 (100% samples): Need z > 2.02 to stop

Benefits:

  • Stop early if fix clearly works
  • Stop early if fix clearly fails
  • Reduce exposure to bad fixes

Experiment Lifecycle

Pending → Testing → Validated
                  → Rejected
                  → Inconclusive
                  → Failed

Hypothesis States

StateDescription
pendingHypothesis created, awaiting test
testingTraffic being split, collecting samples
validatedFix proven statistically effective
rejectedFix proven not effective
inconclusiveNot enough data to decide
failedTest encountered an error

Experiment Output

{
  "experiment_id": "exp-abc123",
  "hypothesis": "retry_with_backoff",
  "status": "validated",
  "results": {
    "control": {
      "samples": 512,
      "error_rate": 0.123,
      "latency_p50_ms": 234,
      "latency_p99_ms": 1890
    },
    "treatment": {
      "samples": 498,
      "error_rate": 0.021,
      "latency_p50_ms": 267,
      "latency_p99_ms": 2100
    },
    "statistics": {
      "z_score": 5.89,
      "p_value": 0.0000003,
      "effect_size": 0.38,
      "confidence_interval": [0.072, 0.132]
    }
  },
  "decision": "DEPLOY",
  "confidence": 0.89
}

Multiple Hypotheses

Test hypotheses in parallel or sequence:

Parallel Testing

  • Split traffic N ways
  • Requires more total traffic
  • Faster time to answer

Sequential Testing

  • Test one at a time
  • Stop when one succeeds
  • Lower traffic requirement

Recommendation

Risicare uses adaptive approach:

  • High-prior hypotheses in parallel
  • Lower-prior hypotheses sequential
  • Stop when validated fix found

Next Steps