Canary
Initial 5% rollout for new fixes.
Canary deployment tests fixes on a small percentage of traffic before wider rollout.
How It Works
- Deploy to 5% of matching traffic
- Collect minimum 100 samples
- Compare error rate to baseline
- Decide to proceed or rollback
Ramp Stages
Each fix goes through 4 stages, and each stage requires passing a statistical test before advancing:
5% -> 25% -> 50% -> 100%
| Stage | Traffic | Requirement |
|---|---|---|
| Canary | 5% | Minimum 100 samples, error rate not worse |
| Ramp 1 | 25% | Statistical A/B test passes (p < 0.05) |
| Ramp 2 | 50% | Statistical A/B test continues passing |
| Graduate | 100% | Hold for 24 hours, then mark as graduated |
Canary Configuration
{
"deployment_id": "deploy-abc123",
"fix_id": "fix-xyz789",
"canary": {
"traffic_percentage": 5,
"minimum_samples": 100,
"duration_hours": 1,
"success_criteria": {
"max_error_rate_increase": 0.1,
"max_latency_increase_factor": 1.5
}
}
}Traffic Splitting
Traffic is split deterministically by session hash:
def should_apply_fix(session_id: str, percentage: int) -> bool:
bucket = hash(session_id) % 100
return bucket < percentageBenefits:
- Same session always gets same treatment
- No session inconsistency
- Reproducible debugging
Success Criteria
Error Rate
Baseline error rate: 10%
Canary error rate: 11%
Increase: 1% (10% relative)
Pass: Yes (under 10% relative increase)
Latency
Baseline P99: 500ms
Canary P99: 600ms
Increase factor: 1.2x
Pass: Yes (under 2x baseline)
Minimum Samples
Wait for statistical confidence:
Current samples: 85
Minimum required: 100
Status: Collecting (85%)
Canary Dashboard
View canary deployments:
| Metric | Baseline | Canary | Status |
|---|---|---|---|
| Error Rate | 10.2% | 9.8% | Good |
| P50 Latency | 234ms | 245ms | Good |
| P99 Latency | 890ms | 920ms | Good |
| Samples | 1000 | 52 | Collecting |
Canary Events
Timeline of events:
10:00 - Canary started (5% traffic)
10:15 - 50 samples collected
10:30 - 100 samples collected
10:31 - Criteria checked: PASS
10:31 - Promoted to ramp stage (25%)
Failure Handling
If canary fails:
- Automatic rollback to 0% traffic
- Alert sent to team
- Analysis of failure metrics
- Log failure reason
10:00 - Canary started (5% traffic)
10:15 - 50 samples collected
10:30 - 100 samples collected
10:31 - Criteria checked: FAIL
Error rate: 15% (50% increase)
10:31 - Rolled back to 0%
10:31 - Alert sent
Auto-Rollback Thresholds
| Condition | Threshold |
|---|---|
| Error rate increase | >10% relative to baseline |
| P99 latency increase | >2x baseline |
Manual Canary
Start a deployment via the API:
curl -X POST "https://app.risicare.ai/v1/deployments" \
-H "Authorization: Bearer rsk-..." \
-d '{"fix_id": "fix-xyz789"}'