Skip to main content
GitHub

A/B Testing

Statistical validation of fixes.

After canary, fixes undergo rigorous A/B testing to prove effectiveness.

How It Works

  1. Split traffic between control and treatment
  2. Measure key metrics
  3. Analyze statistical significance
  4. Decide based on evidence

Deployment States

A/B testing operates within the deployment lifecycle:

StateDescription
pendingDeployment created, not yet started
activeLive and serving traffic
rampingTraffic percentage increasing through stages
graduatedFix reached 100% and held for 24 hours
rolled_backDeployment reverted
failedUnrecoverable error during deployment

Test Configuration

{
  "ab_test": {
    "control_percentage": 50,
    "treatment_percentage": 50,
    "primary_metric": "error_rate",
    "secondary_metrics": ["latency_p99", "cost_per_request"],
    "significance_level": 0.05,
    "power": 0.80,
    "minimum_effect_size": 0.20
  }
}

Statistical Method

Two-Proportion Z-Test

For error rate comparison:

H0: p_treatment >= p_control (fix doesn't help)
H1: p_treatment < p_control (fix helps)

z = (p_control - p_treatment) / sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))

Sample Size Calculation

Required samples for power 0.80:

Baseline error rate: 10%
Expected improvement: 50% reduction (to 5%)
Significance level: 0.05
Power: 0.80

Required per group: ~400 samples

Sequential Testing

Use O'Brien-Fleming boundaries for early stopping:

AnalysisSamplesZ threshold
1st25%4.56
2nd50%2.94
3rd75%2.36
Final100%2.02

Benefits:

  • Stop early if fix clearly works
  • Stop early if fix clearly fails
  • Maintain overall Type I error rate

A/B Results

{
  "ab_test_id": "ab-abc123",
  "status": "completed",
  "results": {
    "control": {
      "samples": 512,
      "error_rate": 0.102,
      "latency_p50_ms": 234,
      "latency_p99_ms": 890
    },
    "treatment": {
      "samples": 498,
      "error_rate": 0.051,
      "latency_p50_ms": 245,
      "latency_p99_ms": 920
    },
    "analysis": {
      "error_rate_reduction": 0.50,
      "z_score": 3.21,
      "p_value": 0.0007,
      "confidence_interval": [0.031, 0.071],
      "effect_size_cohens_h": 0.35
    },
    "decision": "WINNER",
    "recommendation": "Promote to 50% traffic"
  }
}

Dashboard View

Test Overview

MetricControlTreatmentDiffP-value
Error Rate10.2%5.1%-50%0.0007
P50 Latency234ms245ms+5%0.23
P99 Latency890ms920ms+3%0.45
Cost/Request$0.012$0.013+8%0.12

Time Series

Error rate over time showing control vs treatment with confidence bands.

Cumulative Results

Running totals showing convergence to final result.

Decision Rules

OutcomeConditionAction
Winnerp < 0.05, treatment betterRamp to next stage
Loserp < 0.05, control betterRollback
Inconclusivep >= 0.05Extend test
Guardrail FailSecondary metric degradedRollback

Guardrail Metrics

Secondary metrics that must not degrade:

{
  "guardrails": {
    "latency_p99": {
      "max_increase_factor": 1.5
    },
    "cost_per_request": {
      "max_increase_factor": 1.25
    }
  }
}

Ramp Stages

Fixes progress through these stages, each requiring a passing statistical test:

5% -> 25% -> 50% -> 100%

Auto-rollback at any stage if:

  • Error rate increases >10% relative to baseline
  • P99 latency exceeds 2x baseline

Next Steps