Health Checks

Active health checks send periodic HTTP probes to each endpoint. Unhealthy endpoints are removed from the load balancing pool until they recover. This prevents traffic from being sent to pods that are crashed, starting up, or otherwise unable to serve requests.

Configuration

{
  "options": {
    "healthCheck": {
      "path": "/health",
      "interval": "10s",
      "timeout": "5s",
      "unhealthyThreshold": 3,
      "healthyThreshold": 2
    }
  }
}

All fields

Field	Type	Default	Description
`path`	string	required	HTTP path for the probe (`GET` request)
`interval`	string	`10s`	How often each endpoint is probed
`timeout`	string	`5s`	Max time to wait for a probe response
`unhealthyThreshold`	number	`3`	Consecutive failures before marking unhealthy
`healthyThreshold`	number	`2`	Consecutive successes before marking healthy again

A response with status 200-399 is a success. Anything else (including timeouts and connection failures) is a failure.

Examples

Basic health check

{
  "options": {
    "healthCheck": {
      "path": "/health"
    }
  }
}

Probes every 10s, marks unhealthy after 3 failures, marks healthy after 2 successes. Good defaults for most services.

Fast failure detection

{
  "options": {
    "healthCheck": {
      "path": "/health",
      "interval": "3s",
      "timeout": "2s",
      "unhealthyThreshold": 2,
      "healthyThreshold": 1
    }
  }
}

Detects failures within 6 seconds (2 failed probes × 3s interval). Good for critical services where you need fast failover.

Slow warm-up services

{
  "options": {
    "healthCheck": {
      "path": "/ready",
      "interval": "5s",
      "timeout": "10s",
      "unhealthyThreshold": 6,
      "healthyThreshold": 3
    }
  }
}

Tolerates longer startup times. Uses /ready instead of /health to check actual readiness (loaded caches, connected to DB, etc.).

External service (longer intervals)

{
  "options": {
    "healthCheck": {
      "path": "/ping",
      "interval": "30s",
      "timeout": "10s",
      "unhealthyThreshold": 3,
      "healthyThreshold": 2
    }
  }
}

For third-party APIs where you don’t control the backend and don’t want to send too many probes.

How it works

Vrata sends GET <scheme>://<endpoint-host>:<endpoint-port><path> at the configured interval. If the destination uses TLS to upstream, the probe also uses TLS.

The health check runs independently per endpoint. An endpoint is removed from the pool after unhealthyThreshold consecutive failures and restored after healthyThreshold consecutive successes. The hysteresis prevents flapping — a single failed probe doesn’t remove an endpoint.

vs Outlier Detection

	Health Checks	Outlier Detection
How	Active probes (extra HTTP requests)	Passive (watches real traffic responses)
Detects	Total failures (crashed, unreachable)	Degraded performance (slow responses, 5xx errors)
Cost	Extra HTTP requests per endpoint per interval	Zero overhead
Best for	Backend may be silently broken	Backend returns errors under load

Use both for defense in depth: health checks catch silent failures (backend accepting connections but not serving), outlier detection catches degraded performance in real traffic.

Monitoring

The vrata_endpoint_healthy gauge (requires collect.endpoint: true) shows the current health state:

Value	Meaning
`1`	Healthy (in the pool)
`0`	Unhealthy (ejected)