Skip to main content
GitHub

Canary

Initial 5% rollout for new fixes.

Canary deployment tests fixes on a small percentage of traffic before wider rollout.

How It Works

  1. Deploy to 5% of matching traffic
  2. Collect minimum 100 samples
  3. Compare error rate to baseline
  4. Decide to proceed or rollback

Ramp Stages

Each fix goes through 4 stages, and each stage requires passing a statistical test before advancing:

5% -> 25% -> 50% -> 100%
StageTrafficRequirement
Canary5%Minimum 100 samples, error rate not worse
Ramp 125%Statistical A/B test passes (p < 0.05)
Ramp 250%Statistical A/B test continues passing
Graduate100%Hold for 24 hours, then mark as graduated

Canary Configuration

{
  "deployment_id": "deploy-abc123",
  "fix_id": "fix-xyz789",
  "canary": {
    "traffic_percentage": 5,
    "minimum_samples": 100,
    "duration_hours": 1,
    "success_criteria": {
      "max_error_rate_increase": 0.1,
      "max_latency_increase_factor": 1.5
    }
  }
}

Traffic Splitting

Traffic is split deterministically by session hash:

def should_apply_fix(session_id: str, percentage: int) -> bool:
    bucket = hash(session_id) % 100
    return bucket < percentage

Benefits:

  • Same session always gets same treatment
  • No session inconsistency
  • Reproducible debugging

Success Criteria

Error Rate

Baseline error rate: 10%
Canary error rate: 11%
Increase: 1% (10% relative)

Pass: Yes (under 10% relative increase)

Latency

Baseline P99: 500ms
Canary P99: 600ms
Increase factor: 1.2x

Pass: Yes (under 2x baseline)

Minimum Samples

Wait for statistical confidence:

Current samples: 85
Minimum required: 100

Status: Collecting (85%)

Canary Dashboard

View canary deployments:

MetricBaselineCanaryStatus
Error Rate10.2%9.8%Good
P50 Latency234ms245msGood
P99 Latency890ms920msGood
Samples100052Collecting

Canary Events

Timeline of events:

10:00 - Canary started (5% traffic)
10:15 - 50 samples collected
10:30 - 100 samples collected
10:31 - Criteria checked: PASS
10:31 - Promoted to ramp stage (25%)

Failure Handling

If canary fails:

  1. Automatic rollback to 0% traffic
  2. Alert sent to team
  3. Analysis of failure metrics
  4. Log failure reason
10:00 - Canary started (5% traffic)
10:15 - 50 samples collected
10:30 - 100 samples collected
10:31 - Criteria checked: FAIL
        Error rate: 15% (50% increase)
10:31 - Rolled back to 0%
10:31 - Alert sent

Auto-Rollback Thresholds

ConditionThreshold
Error rate increase>10% relative to baseline
P99 latency increase>2x baseline

Manual Canary

Start a deployment via the API:

curl -X POST "https://app.risicare.ai/v1/deployments" \
  -H "Authorization: Bearer rsk-..." \
  -d '{"fix_id": "fix-xyz789"}'

Next Steps