Hypothesis Testing
DoVer methodology for fix validation.
Coming Soon
Hypothesis testing is under active development. 55 hypotheses have been generated but the experiment runner is not yet connected.
Risicare validates fixes using the DoVer (Diagnosis via Observation of Verification) methodology.
DoVer Methodology
DoVer treats debugging as hypothesis testing:
- Observe - Collect data about the error
- Hypothesize - Generate possible fixes
- Verify - Test each hypothesis statistically
- Diagnose - Confirm which fix works
Hypothesis Generation
From Diagnosis
When a diagnosis completes, generate hypotheses:
diagnosis = {
"error_code": "TOOL.EXECUTION.TIMEOUT",
"root_cause": "API timeout due to large payload",
"factors": ["payload_size", "no_retry", "short_timeout"]
}
hypotheses = [
{
"type": "retry",
"prior": 0.75, # Based on similar cases
"rationale": "Transient timeouts often succeed on retry"
},
{
"type": "parameter",
"prior": 0.60,
"rationale": "Longer timeout may allow completion"
},
{
"type": "fallback",
"prior": 0.55,
"rationale": "Fallback maintains availability"
}
]Prior Probability
Priors are calculated from:
- Historical success rate for this error code
- Similarity to past fixes
- Root cause alignment
- Implementation complexity
Statistical Validation
Sample Size
Calculate required samples for statistical power:
Power = 0.80 (80% chance of detecting real effect)
α = 0.05 (5% false positive rate)
Expected effect = 50% error reduction
Baseline error rate = 10%
Required samples ≈ 400 per group
A/B Test
Split traffic between control and treatment:
Control (50%): No fix applied
Treatment (50%): Fix applied
Measure:
- Error rate
- Latency (P50, P95, P99)
- Cost per request
Analysis
Two-proportion z-test:
Baseline error rate: 12.3% (n=500)
Treatment error rate: 2.1% (n=500)
z = 5.89
p-value = 0.0000003
Effect size (Cohen's h) = 0.38
Result: Statistically significant improvement
Bayesian Update
Update hypothesis probability:
Prior: P(H) = 0.75
Likelihood: P(data|H) = 0.95
Evidence: P(data) = 0.30
Posterior: P(H|data) = 0.75 * 0.95 / 0.30 = 0.89
Early Stopping
Use O'Brien-Fleming boundaries for sequential testing:
Check 1 (25% samples): Need z > 4.049 to stop
Check 2 (50% samples): Need z > 2.863 to stop
Check 3 (75% samples): Need z > 2.337 to stop
Check 4 (100% samples): Need z > 2.024 to stop
Benefits:
- Stop early if fix clearly works
- Stop early if fix clearly fails
- Reduce exposure to bad fixes
Experiment Lifecycle
Pending → Testing → Validated
→ Rejected
→ Inconclusive
→ Failed
Hypothesis States
| State | Description |
|---|---|
pending | Hypothesis created, awaiting test |
testing | Traffic being split, collecting samples |
validated | Fix proven statistically effective |
rejected | Fix proven not effective |
inconclusive | Not enough data to decide |
failed | Test encountered an error |
Experiment Output
{
"experiment_id": "exp-abc123",
"hypothesis": "retry_with_backoff",
"status": "validated",
"results": {
"control": {
"samples": 512,
"error_rate": 0.123,
"latency_p50_ms": 234,
"latency_p99_ms": 1890
},
"treatment": {
"samples": 498,
"error_rate": 0.021,
"latency_p50_ms": 267,
"latency_p99_ms": 2100
},
"statistics": {
"z_score": 5.89,
"p_value": 0.0000003,
"effect_size": 0.38,
"confidence_interval": [0.072, 0.132]
}
},
"decision": "DEPLOY",
"confidence": 0.89
}Multiple Hypotheses
Test hypotheses in parallel or sequence:
Parallel Testing
- Split traffic N ways
- Requires more total traffic
- Faster time to answer
Sequential Testing
- Test one at a time
- Stop when one succeeds
- Lower traffic requirement
Recommendation
Risicare uses adaptive approach:
- High-prior hypotheses in parallel
- Lower-prior hypotheses sequential
- Stop when validated fix found