Skip to main content
GitHub

Rollback

Instant rollback when fixes fail.

Risicare provides instant rollback to protect your system from bad fixes.

Automatic Rollback

Fixes are automatically rolled back when:

TriggerThresholdSpeed
Error rate increase>10% relativeInstant
P99 latency increase>2x baselineInstant
A/B test failsp < 0.05 (treatment worse)Instant

Rollback Speed

Target: under 500ms

How it works:

  1. Redis update (10ms): Update routing config
  2. SDK notification (optional): Push invalidation
  3. SDK poll (60s max): Regular refresh interval
  4. Effective: Next request uses baseline

For critical rollbacks, push invalidation ensures instant effect.

Manual Rollback

Via Dashboard

  1. Navigate to Healing -> Deployments
  2. Find the deployment
  3. Click "Rollback"
  4. Confirm

Via API

Rollback a deployment by sending a DELETE request:

curl -X DELETE "https://app.risicare.ai/v1/deployments/{id}" \
  -H "Authorization: Bearer rsk-..."

Deployment API

Four endpoints manage the full deployment lifecycle:

MethodEndpointDescription
GET/v1/deploymentsList all deployments
GET/v1/deployments/{id}Get deployment detail
POST/v1/deploymentsCreate a new deployment
DELETE/v1/deployments/{id}Rollback a deployment

Deployment Management

All deployment state transitions (ramping, graduating) are handled automatically by the system based on statistical tests. There are no separate pause, resume, or graduate endpoints.

Deployment States

StateDescription
pendingDeployment created, not yet started
activeLive and serving traffic
rampingTraffic percentage increasing through stages
graduatedFix reached 100% and held for 24 hours
rolled_backDeployment reverted
failedUnrecoverable error during deployment

Rollback Events

{
  "event": "rollback",
  "deployment_id": "deploy-abc123",
  "fix_id": "fix-xyz789",
  "timestamp": "2024-01-15T10:30:00Z",
  "trigger": "automatic",
  "reason": "error_rate_exceeded",
  "metrics": {
    "baseline_error_rate": 0.10,
    "treatment_error_rate": 0.15,
    "increase_percentage": 50
  },
  "duration_ms": 234
}

Rollback History

View rollback history:

TimeFixTriggerReason
10:30fix-abcAutomaticError rate +50%
09:15fix-xyzManualCustomer report
Yesterdayfix-123AutomaticLatency 2.5x

Post-Rollback Analysis

After rollback:

  1. Alert sent to team
  2. Diagnosis triggered on new errors
  3. Fix marked as failed
  4. Learning recorded for future

Preventing Bad Deployments

Canary First

All fixes go through canary (5%) before wider rollout.

Gradual Ramp

5% -> 25% -> 50% -> 100%

Each stage requires passing a statistical A/B test.

Guardrails

Secondary metrics must not degrade even if primary improves.

Rollback Configuration

Customize rollback thresholds:

{
  "deployment_config": {
    "rollback_thresholds": {
      "error_rate_increase": 0.05,
      "latency_increase_factor": 1.25
    },
    "rollback_delay_seconds": 0,
    "require_manual_for_graduated": true
  }
}

Recovery After Rollback

To retry a rolled-back fix:

  1. Analyze failure reason
  2. Modify fix configuration
  3. Create new fix version
  4. Deploy from canary

Rolled-back fixes cannot be directly re-deployed.

Next Steps