Skip to main content

Monitoring

Stratos exposes Prometheus metrics at :8080/metrics. This guide covers metrics, alerts, and dashboards.

Metrics Reference

Node Metrics

MetricTypeLabelsDescription
stratos_nodepool_nodes_totalGaugepool, stateNodes by pool and state
stratos_nodepool_starting_nodesGaugepoolNodes currently starting
stratos_nodepool_desired_standbyGaugepoolDesired standby count (minStandby)
stratos_nodepool_pool_sizeGaugepoolMaximum pool size

Operation Metrics

MetricTypeLabelsDescription
stratos_nodepool_scaleup_totalCounterpoolTotal scale-up operations
stratos_nodepool_scaledown_totalCounterpoolTotal scale-down operations
stratos_nodepool_scaleup_duration_secondsHistogrampoolScale-up latency
stratos_nodepool_warmup_duration_secondsHistogrampool, modeWarmup latency by completion mode
stratos_nodepool_drain_duration_secondsHistogrampoolDrain latency
stratos_nodepool_warmup_failures_totalCounterpool, reasonWarmup failures

Warmup Duration Mode Labels

The mode label on stratos_nodepool_warmup_duration_seconds indicates how warmup completed:

ModeDescription
self_stopInstance self-stopped via user data script (SelfStop mode)
controller_stopStratos stopped the instance when Ready (ControllerStop mode)
timeoutWarmup timed out and instance was force-stopped

Startup Taint Metrics

MetricTypeLabelsDescription
stratos_nodepool_startup_taint_removal_totalCounterpool, trigger, resultStartup taint removals
stratos_nodepool_startup_taint_duration_secondsHistogrampoolTime to remove startup taints

Controller Metrics

MetricTypeLabelsDescription
stratos_nodepool_reconciliation_duration_secondsHistogrampoolReconciliation latency
stratos_nodepool_reconciliation_errors_totalCounterpool, typeReconciliation errors

Cloud Provider Metrics

MetricTypeLabelsDescription
stratos_cloud_provider_calls_totalCounterprovider, operation, statusCloud API calls
stratos_cloud_provider_latency_secondsHistogramprovider, operationCloud API latency

Prometheus Queries

Pool Health

# Nodes by state per pool
stratos_nodepool_nodes_total{pool="workers"}

# Available standby nodes
stratos_nodepool_nodes_total{pool="workers", state="standby"}

# Nodes currently starting
stratos_nodepool_starting_nodes{pool="workers"}

# Pool utilization (running / poolSize)
stratos_nodepool_nodes_total{state="running"} / stratos_nodepool_pool_size

# Standby shortage
stratos_nodepool_desired_standby - stratos_nodepool_nodes_total{state="standby"}

Scale Operations

# Scale-up rate (per 5 minutes)
rate(stratos_nodepool_scaleup_total[5m])

# Scale-down rate
rate(stratos_nodepool_scaledown_total[5m])

# P95 scale-up latency
histogram_quantile(0.95, rate(stratos_nodepool_scaleup_duration_seconds_bucket[5m]))

# P99 scale-up latency
histogram_quantile(0.99, rate(stratos_nodepool_scaleup_duration_seconds_bucket[5m]))

# Warmup success rate
1 - (rate(stratos_nodepool_warmup_failures_total[1h]) / rate(stratos_nodepool_warmup_duration_seconds_count[1h]))

Controller Health

# Reconciliation latency (P95)
histogram_quantile(0.95, rate(stratos_nodepool_reconciliation_duration_seconds_bucket[5m]))

# Reconciliation error rate
rate(stratos_nodepool_reconciliation_errors_total[5m])

# Cloud API latency (P95)
histogram_quantile(0.95, rate(stratos_cloud_provider_latency_seconds_bucket[5m]))

# Cloud API error rate
rate(stratos_cloud_provider_calls_total{status="error"}[5m])
stratos-alerts.yaml
groups:
- name: stratos
rules:
# Insufficient standby nodes
- alert: StratosInsufficientStandby
expr: |
stratos_nodepool_nodes_total{state="standby"}
< stratos_nodepool_desired_standby
for: 5m
labels:
severity: warning
annotations:
summary: "NodePool {{ $labels.pool }} has insufficient standby nodes"
description: "Pool has {{ $value }} standby nodes but needs {{ $labels.desired }}"

# No standby nodes
- alert: StratosNoStandby
expr: |
stratos_nodepool_nodes_total{state="standby"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "NodePool {{ $labels.pool }} has no standby nodes"
description: "Scale-up will be delayed until warmup completes"

# High warmup failure rate
- alert: StratosWarmupFailures
expr: |
rate(stratos_nodepool_warmup_failures_total[1h]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High warmup failure rate for pool {{ $labels.pool }}"
description: "Failure rate is {{ $value | printf \"%.2f\" }} per second"

# Scale-up taking too long
- alert: StratosSlowScaleUp
expr: |
histogram_quantile(0.95, rate(stratos_nodepool_scaleup_duration_seconds_bucket[5m])) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Slow scale-up for pool {{ $labels.pool }}"
description: "P95 scale-up latency is {{ $value | printf \"%.0f\" }} seconds"

# Controller reconciliation errors
- alert: StratosReconciliationErrors
expr: |
rate(stratos_nodepool_reconciliation_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Reconciliation errors for pool {{ $labels.pool }}"
description: "Error type: {{ $labels.type }}"

# Pool at capacity
- alert: StratosPoolAtCapacity
expr: |
stratos_nodepool_nodes_total{state="running"} >= stratos_nodepool_pool_size
for: 5m
labels:
severity: warning
annotations:
summary: "NodePool {{ $labels.pool }} is at capacity"
description: "Running {{ $value }} nodes, pool size is {{ $labels.pool_size }}"

# Cloud API errors
- alert: StratosCloudAPIErrors
expr: |
rate(stratos_cloud_provider_calls_total{status="error"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Cloud API errors for {{ $labels.provider }}"
description: "Operation {{ $labels.operation }} failing at {{ $value | printf \"%.2f\" }} per second"

Grafana Dashboard

Panel 1: Pool Overview

Query: Node distribution by state

stratos_nodepool_nodes_total

Visualization: Stacked bar chart by state

Panel 2: Standby Health

Query: Standby vs desired

# Actual standby
stratos_nodepool_nodes_total{state="standby"}

# Desired standby
stratos_nodepool_desired_standby

Visualization: Gauge with thresholds

Panel 3: Scale Operations

Query: Scale events over time

increase(stratos_nodepool_scaleup_total[5m])
increase(stratos_nodepool_scaledown_total[5m])

Visualization: Time series

Panel 4: Scale-Up Latency

Query: Latency percentiles

histogram_quantile(0.5, rate(stratos_nodepool_scaleup_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(stratos_nodepool_scaleup_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(stratos_nodepool_scaleup_duration_seconds_bucket[5m]))

Visualization: Time series with legend (P50, P95, P99)

Panel 5: Controller Health

Query: Reconciliation metrics

rate(stratos_nodepool_reconciliation_errors_total[5m])
histogram_quantile(0.95, rate(stratos_nodepool_reconciliation_duration_seconds_bucket[5m]))

Visualization: Time series

Panel 6: Cloud API

Query: API call rates and latency

rate(stratos_cloud_provider_calls_total[5m])
histogram_quantile(0.95, rate(stratos_cloud_provider_latency_seconds_bucket[5m]))

Visualization: Time series

Kubernetes Events

Stratos emits Kubernetes events for significant operations:

kubectl get events --field-selector involvedObject.kind=NodePool

Event types:

EventDescription
CreatedNodePool created
ScaleUpNodes started for scale-up
ScaleDownNodes stopped for scale-down
ReplenishingLaunching new nodes for minStandby
MaxRuntimeExceededNode recycled due to max runtime
CleanupFailedError during NodePool deletion
DeletedNodePool deleted

Health Endpoints

EndpointPortDescription
/healthz8081Liveness probe
/readyz8081Readiness probe
/metrics8080Prometheus metrics

Troubleshooting with Metrics

Diagnosing Scale-Up Issues

  1. Check standby availability:

    stratos_nodepool_nodes_total{state="standby"}
  2. Check in-flight scale-ups:

    stratos_nodepool_starting_nodes
  3. Check scale-up latency:

    histogram_quantile(0.95, rate(stratos_nodepool_scaleup_duration_seconds_bucket[5m]))

Diagnosing Warmup Issues

  1. Check warmup duration:

    histogram_quantile(0.95, rate(stratos_nodepool_warmup_duration_seconds_bucket[5m]))
  2. Check warmup duration by completion mode:

    # Compare SelfStop vs ControllerStop warmup times
    histogram_quantile(0.95,
    sum(rate(stratos_nodepool_warmup_duration_seconds_bucket[5m])) by (le, pool, mode)
    )
  3. Check failure rate:

    rate(stratos_nodepool_warmup_failures_total[1h])

Diagnosing Cloud API Issues

  1. Check API error rate:

    rate(stratos_cloud_provider_calls_total{status="error"}[5m])
  2. Check API latency:

    histogram_quantile(0.95, rate(stratos_cloud_provider_latency_seconds_bucket[5m]))

Next Steps