A/B Testing
Statistical validation of fixes.
After canary, fixes undergo rigorous A/B testing to prove effectiveness.
How It Works
- Split traffic between control and treatment
- Measure key metrics
- Analyze statistical significance
- Decide based on evidence
Deployment States
A/B testing operates within the deployment lifecycle:
| State | Description |
|---|---|
pending | Deployment created, not yet started |
active | Live and serving traffic |
ramping | Traffic percentage increasing through stages |
graduated | Fix reached 100% and held for 24 hours |
rolled_back | Deployment reverted |
failed | Unrecoverable error during deployment |
Test Configuration
{
"ab_test": {
"control_percentage": 50,
"treatment_percentage": 50,
"primary_metric": "error_rate",
"secondary_metrics": ["latency_p99", "cost_per_request"],
"significance_level": 0.05,
"power": 0.80,
"minimum_effect_size": 0.20
}
}Statistical Method
Two-Proportion Z-Test
For error rate comparison:
H0: p_treatment >= p_control (fix doesn't help)
H1: p_treatment < p_control (fix helps)
z = (p_control - p_treatment) / sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
Sample Size Calculation
Required samples for power 0.80:
Baseline error rate: 10%
Expected improvement: 50% reduction (to 5%)
Significance level: 0.05
Power: 0.80
Required per group: ~400 samples
Sequential Testing
Use O'Brien-Fleming boundaries for early stopping:
| Analysis | Samples | Z threshold |
|---|---|---|
| 1st | 25% | 4.56 |
| 2nd | 50% | 2.94 |
| 3rd | 75% | 2.36 |
| Final | 100% | 2.02 |
Benefits:
- Stop early if fix clearly works
- Stop early if fix clearly fails
- Maintain overall Type I error rate
A/B Results
{
"ab_test_id": "ab-abc123",
"status": "completed",
"results": {
"control": {
"samples": 512,
"error_rate": 0.102,
"latency_p50_ms": 234,
"latency_p99_ms": 890
},
"treatment": {
"samples": 498,
"error_rate": 0.051,
"latency_p50_ms": 245,
"latency_p99_ms": 920
},
"analysis": {
"error_rate_reduction": 0.50,
"z_score": 3.21,
"p_value": 0.0007,
"confidence_interval": [0.031, 0.071],
"effect_size_cohens_h": 0.35
},
"decision": "WINNER",
"recommendation": "Promote to 50% traffic"
}
}Dashboard View
Test Overview
| Metric | Control | Treatment | Diff | P-value |
|---|---|---|---|---|
| Error Rate | 10.2% | 5.1% | -50% | 0.0007 |
| P50 Latency | 234ms | 245ms | +5% | 0.23 |
| P99 Latency | 890ms | 920ms | +3% | 0.45 |
| Cost/Request | $0.012 | $0.013 | +8% | 0.12 |
Time Series
Error rate over time showing control vs treatment with confidence bands.
Cumulative Results
Running totals showing convergence to final result.
Decision Rules
| Outcome | Condition | Action |
|---|---|---|
| Winner | p < 0.05, treatment better | Ramp to next stage |
| Loser | p < 0.05, control better | Rollback |
| Inconclusive | p >= 0.05 | Extend test |
| Guardrail Fail | Secondary metric degraded | Rollback |
Guardrail Metrics
Secondary metrics that must not degrade:
{
"guardrails": {
"latency_p99": {
"max_increase_factor": 1.5
},
"cost_per_request": {
"max_increase_factor": 1.25
}
}
}Ramp Stages
Fixes progress through these stages, each requiring a passing statistical test:
5% -> 25% -> 50% -> 100%
Auto-rollback at any stage if:
- Error rate increases >10% relative to baseline
- P99 latency exceeds 2x baseline