Outlier Detection
Outlier detection automatically ejects endpoints that return consecutive errors — without requiring active health probes. It watches real traffic and reacts to actual failures, removing bad pods from the pool before they cause user-visible impact.
Configuration
{
"options": {
"outlierDetection": {
"consecutive5xx": 5,
"consecutiveGatewayErrors": 3,
"interval": "10s",
"baseEjectionTime": "30s",
"maxEjectionPercent": 10
}
}
}
All fields
| Field | Type | Default | Description |
|---|---|---|---|
consecutive5xx | number | 5 | Consecutive 5xx responses to trigger ejection |
consecutiveGatewayErrors | number | 0 | Consecutive 502/503/504 to trigger ejection (0 = disabled) |
interval | string | 10s | How often ejection conditions are evaluated |
baseEjectionTime | string | 30s | How long an endpoint stays ejected (first ejection) |
maxEjectionPercent | number | 10 | Max percentage of endpoints that can be ejected simultaneously |
Examples
Basic outlier detection
{
"options": {
"outlierDetection": {
"consecutive5xx": 5,
"baseEjectionTime": "30s"
}
}
}
Ejects endpoints after 5 consecutive 5xx errors. Ejected for 30s, then restored.
Aggressive detection (critical path)
{
"options": {
"outlierDetection": {
"consecutive5xx": 3,
"consecutiveGatewayErrors": 2,
"interval": "5s",
"baseEjectionTime": "60s",
"maxEjectionPercent": 30
}
}
}
Faster detection (check every 5s, trip at 3 errors). Allows ejecting up to 30% of endpoints. Longer ejection time. Use for critical payment or auth paths where a bad pod is worse than reduced capacity.
Conservative detection (large pool)
{
"options": {
"outlierDetection": {
"consecutive5xx": 10,
"interval": "30s",
"baseEjectionTime": "15s",
"maxEjectionPercent": 5
}
}
}
Tolerates more errors before ejecting. Short ejection time gives pods a chance to recover quickly. Low max ejection percentage protects large pools from cascading ejections.
Gateway errors only
{
"options": {
"outlierDetection": {
"consecutive5xx": 0,
"consecutiveGatewayErrors": 3,
"baseEjectionTime": "30s"
}
}
}
Only ejects on 502/503/504 — ignores application-level 500 errors. Useful when your backend returns 500 for business logic errors that shouldn’t trigger ejection.
How ejection works
When an endpoint accumulates consecutive5xx errors in a row (checked every interval), it’s ejected. The ejection duration increases with each consecutive ejection:
First ejection: baseEjectionTime × 1 = 30s
Second ejection: baseEjectionTime × 2 = 60s
Third ejection: baseEjectionTime × 3 = 90s
After the ejection expires, the endpoint is restored to the pool. If it fails again, it’s ejected for longer. A single successful response resets the consecutive error counter.
The maxEjectionPercent is a safety valve — at least 100 - maxEjectionPercent percent of the pool always stays active, even if all endpoints are failing.
vs Health Checks
| Health Checks | Outlier Detection | |
|---|---|---|
| How | Active probes (extra HTTP requests) | Passive (watches real traffic responses) |
| Detects | Total failures (crashed, unreachable) | Degraded performance (errors under load) |
| Cost | Extra HTTP requests per endpoint per interval | Zero — piggybacks on real traffic |
| Best for | Backend may be silently broken | Backend returns errors under load |
Use both for defense in depth.
Monitoring
With collect.endpoint: true:
vrata_endpoint_healthy—0when ejected,1when healthyvrata_endpoint_consecutive_5xx— current error streak length per endpoint