Quickstart
This guide walks you through creating your first NodePool and testing scale-up.
Prerequisites
Before creating a NodePool, ensure you have:
- Stratos installed via Helm (see Installation)
- AWS credentials configured (see AWS Setup)
- EKS-optimized AMI ID for your region
- Subnet and security group IDs
- IAM instance profile for worker nodes
Understanding the NodeClass Pattern
Stratos separates AWS-specific configuration from pool management:
- AWSNodeClass: Defines EC2 instance configuration (instance type, AMI, networking, IAM, user data)
- NodePool: Defines pool sizing, scaling behavior, and node template (labels, taints)
You create an AWSNodeClass first, then create a NodePool that references it.
+------------------+ references +------------------+
| NodePool | ----------------------> | AWSNodeClass |
+------------------+ +------------------+
| - poolSize: 10 | | - instanceType |
| - minStandby: 3 | | - ami |
| - labels, taints | | - subnetIds |
+------------------+ | - userData |
+------------------+
Step 1: Prepare the User Data Script
The user data script runs during the warmup phase. It must:
- Join the Kubernetes cluster
- Register with startup taints
- Wait for the node to be healthy
- Self-stop (poweroff) when ready
The startup taint in --register-with-taints must match the startupTaints field in the NodePool spec. Mismatched taints will cause scheduling issues.
Example for EKS:
#!/bin/bash
set -e
# Join the EKS cluster with a startup taint
# The taint prevents pod scheduling until CNI is ready
/etc/eks/bootstrap.sh my-cluster \
--kubelet-extra-args '--register-with-taints=node.eks.amazonaws.com/not-ready=true:NoSchedule'
# Wait for kubelet to be healthy
until curl -sf http://localhost:10248/healthz >/dev/null 2>&1; do
sleep 5
done
# Give time for node to fully register
sleep 30
# Note: Stratos automatically pre-pulls DaemonSet images during warmup.
# You can configure additional images via preWarm.imagesToPull in the NodePool spec.
# Signal warmup complete by stopping the instance
poweroff
Bottlerocket uses TOML configuration and doesn't support shell scripts. Use ControllerStop completion mode instead:
preWarm:
completionMode: ControllerStop
In this mode, Stratos stops the instance when the node becomes Ready, eliminating the need for a poweroff script. See Bottlerocket Setup for details.
Step 2: Create the AWSNodeClass
Create a file named awsnodeclass-workers.yaml:
apiVersion: stratos.sh/v1alpha1
kind: AWSNodeClass
metadata:
name: workers
spec:
region: us-east-1
instanceType: m5.large
ami: ami-0123456789abcdef0 # Your EKS-optimized AMI
subnetIds:
- subnet-12345678
- subnet-87654321
securityGroupIds:
- sg-12345678
iamInstanceProfile: arn:aws:iam::123456789012:instance-profile/eks-node-role
# User data script that joins cluster and self-stops
userData: |
#!/bin/bash
set -e
/etc/eks/bootstrap.sh my-cluster \
--kubelet-extra-args '--register-with-taints=node.eks.amazonaws.com/not-ready=true:NoSchedule'
until curl -sf http://localhost:10248/healthz >/dev/null 2>&1; do sleep 5; done
sleep 30
poweroff
blockDeviceMappings:
- deviceName: /dev/xvda
volumeSize: 50
volumeType: gp3
encrypted: true
tags:
Environment: production
ManagedBy: stratos
Apply the AWSNodeClass:
kubectl apply -f awsnodeclass-workers.yaml
Verify it was created:
kubectl get awsnodeclasses
Step 3: Create the NodePool
Create a file named nodepool-workers.yaml:
apiVersion: stratos.sh/v1alpha1
kind: NodePool
metadata:
name: workers
spec:
# Maximum nodes in the pool (standby + running)
poolSize: 10
# Minimum standby nodes to maintain
minStandby: 3
# Reconciliation interval
reconciliationInterval: 30s
template:
# Reference to AWSNodeClass for EC2 configuration
nodeClassRef:
kind: AWSNodeClass
name: workers
# Labels applied to managed nodes
labels:
stratos.sh/pool: workers
node-role.kubernetes.io/worker: ""
# Startup taints - MUST match --register-with-taints in userData
# Stratos removes these when the CNI is ready
startupTaints:
- key: node.eks.amazonaws.com/not-ready
value: "true"
effect: NoSchedule
# How startup taints are removed
startupTaintRemoval: WhenNetworkReady
# Pre-warm configuration
preWarm:
timeout: 15m
timeoutAction: terminate
# Scale-down configuration
scaleDown:
enabled: true
emptyNodeTTL: 5m
drainTimeout: 5m
Apply the NodePool:
kubectl apply -f nodepool-workers.yaml
The AWSNodeClass must exist before creating the NodePool. If the NodePool references a non-existent AWSNodeClass, it will be marked as Degraded with reason NodeClassNotFound.
Step 4: Verify the NodePool
Check the NodePool status:
kubectl get nodepools
Expected output:
NAME POOLSIZE MINSTANDBY STANDBY RUNNING READY AGE
workers 10 3 0 0 False 10s
Watch nodes being created:
kubectl get nodes -l stratos.sh/pool=workers -w
View detailed status:
kubectl describe nodepool workers
Check the AWSNodeClass status:
kubectl describe awsnodeclass workers
Initial warmup takes several minutes. Nodes will transition through warmup to standby state.
After warmup completes (3-5 minutes), you should see:
NAME POOLSIZE MINSTANDBY STANDBY RUNNING READY AGE
workers 10 3 3 0 True 5m
Step 5: Test Scale-Up
Create a deployment that requests resources:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-workload
spec:
replicas: 5
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- name: nginx
image: nginx:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
kubectl apply -f test-workload.yaml
When pods become pending due to insufficient capacity, Stratos will:
- Detect the unschedulable pods
- Start standby nodes (pre-warmed instances resume in ~15-20 seconds)
- Nodes become Ready in ~20-25 seconds total
- Pods are scheduled on the newly available nodes
This is roughly half the time of Karpenter (~40-50 seconds) because Stratos nodes have already completed boot, cluster join, CNI initialization, and DaemonSet image pulls during the warmup phase.
Monitor the process:
# Watch pods
kubectl get pods -w
# Watch nodes
kubectl get nodes -l stratos.sh/pool=workers -w
# Check NodePool events
kubectl describe nodepool workers
Step 6: Test Scale-Down
Delete the test workload:
kubectl delete deployment test-workload
After emptyNodeTTL (5 minutes by default), Stratos will:
- Detect empty nodes
- Drain and cordon the nodes
- Stop the instances
- Return them to standby state
Troubleshooting
NodePool Shows Degraded with NodeClassNotFound
The AWSNodeClass doesn't exist or has a different name:
# Check if AWSNodeClass exists
kubectl get awsnodeclasses
# Check the NodePool's nodeClassRef
kubectl get nodepool workers -o jsonpath='{.spec.template.nodeClassRef}'
Nodes Stuck in Warmup
Check the user data script output:
aws ec2 get-console-output --instance-id <instance-id>
Common issues:
- User data script failing
- Missing
poweroffcommand - Network issues preventing cluster join
Scale-Up Not Triggering
Verify standby nodes are available:
kubectl get nodes -l stratos.sh/pool=workers,stratos.sh/state=standby
Check controller logs:
kubectl -n stratos-system logs deployment/stratos
AWSNodeClass Status Shows Issues
Check the AWSNodeClass conditions:
kubectl describe awsnodeclass workers
Look for:
Validcondition: Is the spec valid?InUsecondition: Is it referenced by NodePools?
Next Steps
- Node Lifecycle - Understand node states
- AWSNodeClass Reference - Complete AWSNodeClass API
- Scaling Policies - Configure scaling behavior
- Monitoring - Set up monitoring and alerts