Hypothesis Testing
DoVer methodology for fix validation.
Risicare validates fixes using the DoVer (Diagnosis via Observation of Verification) methodology.
DoVer Methodology
DoVer treats debugging as hypothesis testing:
- Observe - Collect data about the error
- Hypothesize - Generate possible fixes
- Verify - Test each hypothesis statistically
- Diagnose - Confirm which fix works
Hypothesis Generation
From Diagnosis
When a diagnosis completes, generate hypotheses:
diagnosis = {
"error_code": "TOOL.EXECUTION.TIMEOUT",
"root_cause": "API timeout due to large payload",
"factors": ["payload_size", "no_retry", "short_timeout"]
}
hypotheses = [
{
"type": "retry",
"prior": 0.75, # Based on similar cases
"rationale": "Transient timeouts often succeed on retry"
},
{
"type": "parameter",
"prior": 0.60,
"rationale": "Longer timeout may allow completion"
},
{
"type": "fallback",
"prior": 0.55,
"rationale": "Fallback maintains availability"
}
]Prior Probability
Priors are calculated from:
- Historical success rate for this error code
- Similarity to past fixes
- Root cause alignment
- Implementation complexity
Statistical Validation
Sample Size
Calculate required samples for statistical power:
Power = 0.80 (80% chance of detecting real effect)
α = 0.05 (5% false positive rate)
Expected effect = 50% error reduction
Baseline error rate = 10%
Required samples ≈ 400 per group
A/B Test
Split traffic between control and treatment:
Control (50%): No fix applied
Treatment (50%): Fix applied
Measure:
- Error rate
- Latency (P50, P95, P99)
- Cost per request
Analysis
Two-proportion z-test:
Baseline error rate: 12.3% (n=500)
Treatment error rate: 2.1% (n=500)
z = 5.89
p-value = 0.0000003
Effect size (Cohen's h) = 0.38
Result: Statistically significant improvement
Bayesian Update
Update hypothesis probability:
Prior: P(H) = 0.75
Likelihood: P(data|H) = 0.95
Evidence: P(data) = 0.30
Posterior: P(H|data) = 0.75 * 0.95 / 0.30 = 0.89
Early Stopping
Use O'Brien-Fleming boundaries for sequential testing:
Check 1 (25% samples): Need z > 4.56 to stop
Check 2 (50% samples): Need z > 2.94 to stop
Check 3 (75% samples): Need z > 2.36 to stop
Check 4 (100% samples): Need z > 2.02 to stop
Benefits:
- Stop early if fix clearly works
- Stop early if fix clearly fails
- Reduce exposure to bad fixes
Experiment Lifecycle
Pending → Testing → Validated
→ Rejected
→ Inconclusive
→ Failed
Hypothesis States
| State | Description |
|---|---|
pending | Hypothesis created, awaiting test |
testing | Traffic being split, collecting samples |
validated | Fix proven statistically effective |
rejected | Fix proven not effective |
inconclusive | Not enough data to decide |
failed | Test encountered an error |
Experiment Output
{
"experiment_id": "exp-abc123",
"hypothesis": "retry_with_backoff",
"status": "validated",
"results": {
"control": {
"samples": 512,
"error_rate": 0.123,
"latency_p50_ms": 234,
"latency_p99_ms": 1890
},
"treatment": {
"samples": 498,
"error_rate": 0.021,
"latency_p50_ms": 267,
"latency_p99_ms": 2100
},
"statistics": {
"z_score": 5.89,
"p_value": 0.0000003,
"effect_size": 0.38,
"confidence_interval": [0.072, 0.132]
}
},
"decision": "DEPLOY",
"confidence": 0.89
}Multiple Hypotheses
Test hypotheses in parallel or sequence:
Parallel Testing
- Split traffic N ways
- Requires more total traffic
- Faster time to answer
Sequential Testing
- Test one at a time
- Stop when one succeeds
- Lower traffic requirement
Recommendation
Risicare uses adaptive approach:
- High-prior hypotheses in parallel
- Lower-prior hypotheses sequential
- Stop when validated fix found