Outlier Detection

Outlier detection automatically ejects endpoints that return consecutive errors — without requiring active health probes. It watches real traffic and reacts to actual failures, removing bad pods from the pool before they cause user-visible impact.

Configuration

{
  "options": {
    "outlierDetection": {
      "consecutive5xx": 5,
      "consecutiveGatewayErrors": 3,
      "interval": "10s",
      "baseEjectionTime": "30s",
      "maxEjectionPercent": 10
    }
  }
}

All fields

FieldTypeDefaultDescription
consecutive5xxnumber5Consecutive 5xx responses to trigger ejection
consecutiveGatewayErrorsnumber0Consecutive 502/503/504 to trigger ejection (0 = disabled)
intervalstring10sHow often ejection conditions are evaluated
baseEjectionTimestring30sHow long an endpoint stays ejected (first ejection)
maxEjectionPercentnumber10Max percentage of endpoints that can be ejected simultaneously

Examples

Basic outlier detection

{
  "options": {
    "outlierDetection": {
      "consecutive5xx": 5,
      "baseEjectionTime": "30s"
    }
  }
}

Ejects endpoints after 5 consecutive 5xx errors. Ejected for 30s, then restored.

Aggressive detection (critical path)

{
  "options": {
    "outlierDetection": {
      "consecutive5xx": 3,
      "consecutiveGatewayErrors": 2,
      "interval": "5s",
      "baseEjectionTime": "60s",
      "maxEjectionPercent": 30
    }
  }
}

Faster detection (check every 5s, trip at 3 errors). Allows ejecting up to 30% of endpoints. Longer ejection time. Use for critical payment or auth paths where a bad pod is worse than reduced capacity.

Conservative detection (large pool)

{
  "options": {
    "outlierDetection": {
      "consecutive5xx": 10,
      "interval": "30s",
      "baseEjectionTime": "15s",
      "maxEjectionPercent": 5
    }
  }
}

Tolerates more errors before ejecting. Short ejection time gives pods a chance to recover quickly. Low max ejection percentage protects large pools from cascading ejections.

Gateway errors only

{
  "options": {
    "outlierDetection": {
      "consecutive5xx": 0,
      "consecutiveGatewayErrors": 3,
      "baseEjectionTime": "30s"
    }
  }
}

Only ejects on 502/503/504 — ignores application-level 500 errors. Useful when your backend returns 500 for business logic errors that shouldn’t trigger ejection.

How ejection works

When an endpoint accumulates consecutive5xx errors in a row (checked every interval), it’s ejected. The ejection duration increases with each consecutive ejection:

First ejection:  baseEjectionTime × 1 = 30s
Second ejection: baseEjectionTime × 2 = 60s
Third ejection:  baseEjectionTime × 3 = 90s

After the ejection expires, the endpoint is restored to the pool. If it fails again, it’s ejected for longer. A single successful response resets the consecutive error counter.

The maxEjectionPercent is a safety valve — at least 100 - maxEjectionPercent percent of the pool always stays active, even if all endpoints are failing.

vs Health Checks

Health ChecksOutlier Detection
HowActive probes (extra HTTP requests)Passive (watches real traffic responses)
DetectsTotal failures (crashed, unreachable)Degraded performance (errors under load)
CostExtra HTTP requests per endpoint per intervalZero — piggybacks on real traffic
Best forBackend may be silently brokenBackend returns errors under load

Use both for defense in depth.

Monitoring

With collect.endpoint: true: