A/B Testing

Statistical validation of fixes.

Coming Soon

A/B testing for fix deployment is under active development and not yet available in production.

After canary, fixes undergo rigorous A/B testing to prove effectiveness.

How It Works

Split traffic between control and treatment
Measure key metrics
Analyze statistical significance
Decide based on evidence

Deployment States

A/B testing operates within the deployment lifecycle:

State	Description
`pending`	Deployment created, not yet started
`active`	Live and serving traffic
`ramping`	Traffic percentage increasing through stages
`graduated`	Fix reached 100% and held for 24 hours
`rolled_back`	Deployment reverted
`failed`	Unrecoverable error during deployment

Test Configuration

{
  "ab_test": {
    "control_percentage": 50,
    "treatment_percentage": 50,
    "primary_metric": "error_rate",
    "secondary_metrics": ["latency_p99", "cost_per_request"],
    "significance_level": 0.05,
    "power": 0.80,
    "minimum_effect_size": 0.20
  }
}

Statistical Method

Two-Proportion Z-Test

For error rate comparison:

H0: p_treatment >= p_control (fix doesn't help)
H1: p_treatment < p_control (fix helps)

z = (p_control - p_treatment) / sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))

Sample Size Calculation

Required samples for power 0.80:

Baseline error rate: 10%
Expected improvement: 50% reduction (to 5%)
Significance level: 0.05
Power: 0.80

Required per group: ~400 samples

Sequential Testing

Use O'Brien-Fleming boundaries for early stopping:

Analysis	Samples	Z threshold
1st	25%	4.049
2nd	50%	2.863
3rd	75%	2.337
Final	100%	2.024

Benefits:

Stop early if fix clearly works
Stop early if fix clearly fails
Maintain overall Type I error rate

A/B Results

{
  "ab_test_id": "ab-abc123",
  "status": "completed",
  "results": {
    "control": {
      "samples": 512,
      "error_rate": 0.102,
      "latency_p50_ms": 234,
      "latency_p99_ms": 890
    },
    "treatment": {
      "samples": 498,
      "error_rate": 0.051,
      "latency_p50_ms": 245,
      "latency_p99_ms": 920
    },
    "analysis": {
      "error_rate_reduction": 0.50,
      "z_score": 3.21,
      "p_value": 0.0007,
      "confidence_interval": [0.031, 0.071],
      "effect_size_cohens_h": 0.35
    },
    "decision": "WINNER",
    "recommendation": "Promote to 50% traffic"
  }
}

Dashboard View

Test Overview

Metric	Control	Treatment	Diff	P-value
Error Rate	10.2%	5.1%	-50%	0.0007
P50 Latency	234ms	245ms	+5%	0.23
P99 Latency	890ms	920ms	+3%	0.45
Cost/Request	$0.012	$0.013	+8%	0.12

Time Series

Error rate over time showing control vs treatment with confidence bands.

Cumulative Results

Running totals showing convergence to final result.

Decision Rules

Outcome	Condition	Action
Winner	p < 0.05, treatment better	Ramp to next stage
Loser	p < 0.05, control better	Rollback
Inconclusive	p >= 0.05	Extend test
Guardrail Fail	Secondary metric degraded	Rollback

Guardrail Metrics

Secondary metrics that must not degrade:

{
  "guardrails": {
    "latency_p99": {
      "max_increase_factor": 1.5
    },
    "cost_per_request": {
      "max_increase_factor": 1.25
    }
  }
}

Ramp Stages

Fixes progress through these stages, each requiring a passing statistical test:

5% -> 25% -> 50% -> 100%

Auto-rollback at any stage if:

Error rate increases >10% relative to baseline
P99 latency exceeds 2x baseline

Next Steps

Rollback

When tests fail

Learn more

Fix Types

What can be tested

Learn more

Edit this page on GitHub

PreviousCanary NextRollback