Health Checks

Active health checks send periodic HTTP probes to each endpoint. Unhealthy endpoints are removed from the load balancing pool until they recover. This prevents traffic from being sent to pods that are crashed, starting up, or otherwise unable to serve requests.

Configuration

{
  "options": {
    "healthCheck": {
      "path": "/health",
      "interval": "10s",
      "timeout": "5s",
      "unhealthyThreshold": 3,
      "healthyThreshold": 2
    }
  }
}

All fields

FieldTypeDefaultDescription
pathstringrequiredHTTP path for the probe (GET request)
intervalstring10sHow often each endpoint is probed
timeoutstring5sMax time to wait for a probe response
unhealthyThresholdnumber3Consecutive failures before marking unhealthy
healthyThresholdnumber2Consecutive successes before marking healthy again

A response with status 200-399 is a success. Anything else (including timeouts and connection failures) is a failure.

Examples

Basic health check

{
  "options": {
    "healthCheck": {
      "path": "/health"
    }
  }
}

Probes every 10s, marks unhealthy after 3 failures, marks healthy after 2 successes. Good defaults for most services.

Fast failure detection

{
  "options": {
    "healthCheck": {
      "path": "/health",
      "interval": "3s",
      "timeout": "2s",
      "unhealthyThreshold": 2,
      "healthyThreshold": 1
    }
  }
}

Detects failures within 6 seconds (2 failed probes × 3s interval). Good for critical services where you need fast failover.

Slow warm-up services

{
  "options": {
    "healthCheck": {
      "path": "/ready",
      "interval": "5s",
      "timeout": "10s",
      "unhealthyThreshold": 6,
      "healthyThreshold": 3
    }
  }
}

Tolerates longer startup times. Uses /ready instead of /health to check actual readiness (loaded caches, connected to DB, etc.).

External service (longer intervals)

{
  "options": {
    "healthCheck": {
      "path": "/ping",
      "interval": "30s",
      "timeout": "10s",
      "unhealthyThreshold": 3,
      "healthyThreshold": 2
    }
  }
}

For third-party APIs where you don’t control the backend and don’t want to send too many probes.

How it works

Vrata sends GET <scheme>://<endpoint-host>:<endpoint-port><path> at the configured interval. If the destination uses TLS to upstream, the probe also uses TLS.

The health check runs independently per endpoint. An endpoint is removed from the pool after unhealthyThreshold consecutive failures and restored after healthyThreshold consecutive successes. The hysteresis prevents flapping — a single failed probe doesn’t remove an endpoint.

vs Outlier Detection

Health ChecksOutlier Detection
HowActive probes (extra HTTP requests)Passive (watches real traffic responses)
DetectsTotal failures (crashed, unreachable)Degraded performance (slow responses, 5xx errors)
CostExtra HTTP requests per endpoint per intervalZero overhead
Best forBackend may be silently brokenBackend returns errors under load

Use both for defense in depth: health checks catch silent failures (backend accepting connections but not serving), outlier detection catches degraded performance in real traffic.

Monitoring

The vrata_endpoint_healthy gauge (requires collect.endpoint: true) shows the current health state:

ValueMeaning
1Healthy (in the pool)
0Unhealthy (ejected)