Skip to main content

Cloud Providers

Stratos uses a cloud provider abstraction to support multiple cloud platforms. This document explains the architecture, the NodeClass pattern for cloud-specific configuration, and supported providers.

Architecture Overview

Stratos separates concerns between pool management and cloud-specific configuration:

+------------------+       references        +------------------+
| NodePool | ----------------------> | AWSNodeClass |
+------------------+ +------------------+
| - poolSize | | - instanceType |
| - minStandby | | - ami |
| - labels, taints | | - subnetIds |
| - preWarm config | | - securityGroups |
| - scaleDown | | - userData |
+------------------+ +------------------+
|
| uses
v
+------------------+
| CloudProvider |
| Interface |
+------------------+
| - StartInstance |
| - StopInstance |
| - Terminate... |
| - GetInstance |
| - ListInstances |
+------------------+

Why Separate NodeClass?

  1. Reusability: Multiple NodePools can share the same EC2 configuration
  2. Separation of concerns: Pool sizing vs. instance configuration
  3. Multi-cloud support: Each cloud gets its own NodeClass type (AWSNodeClass, GCPNodeClass, etc.)
  4. Independent evolution: Cloud-specific schemas can evolve without affecting NodePool

This pattern follows Karpenter's NodeClass design.

Provider Interface

The CloudProvider interface defines instance lifecycle operations that work with instance IDs:

type CloudProvider interface {
// StartInstance starts a stopped instance.
StartInstance(ctx context.Context, instanceID string) error

// StopInstance stops a running instance.
StopInstance(ctx context.Context, instanceID string, force bool) error

// TerminateInstance terminates an instance permanently.
TerminateInstance(ctx context.Context, instanceID string) error

// GetInstanceState returns the current state of an instance.
GetInstanceState(ctx context.Context, instanceID string) (InstanceState, error)

// GetInstance returns full details of an instance.
GetInstance(ctx context.Context, instanceID string) (*Instance, error)

// ListInstances returns instances matching the given tags.
ListInstances(ctx context.Context, tags map[string]string) ([]*Instance, error)

// UpdateInstanceTags updates tags on an instance.
UpdateInstanceTags(ctx context.Context, instanceID string, tags map[string]string) error
}
note

LaunchInstance is not part of the generic interface because launch requires cloud-specific configuration (e.g., AWSNodeClass). Each provider implements its own LaunchInstance method that takes its specific NodeClass type directly.

NodeClass Resources

NodeClass resources define cloud-specific instance configuration. Currently, Stratos supports:

NodeClassCloudDescription
AWSNodeClassAWSEC2 instance configuration (instance type, AMI, networking, IAM)
GCPNodeClassGCPPlanned - Compute Engine configuration
AzureNodeClassAzurePlanned - Virtual Machine configuration

AWSNodeClass

AWSNodeClass defines AWS EC2 configuration:

apiVersion: stratos.sh/v1alpha1
kind: AWSNodeClass
metadata:
name: production-nodes
spec:
region: us-east-1
instanceType: m5.large
ami: ami-0123456789abcdef0
subnetIds:
- subnet-12345678
- subnet-87654321
securityGroupIds:
- sg-12345678
iamInstanceProfile: arn:aws:iam::123456789012:instance-profile/node-role
userData: |
#!/bin/bash
/etc/eks/bootstrap.sh my-cluster
# ... warmup script
blockDeviceMappings:
- deviceName: /dev/xvda
volumeSize: 50
volumeType: gp3
encrypted: true
tags:
Environment: production

See AWSNodeClass API Reference for complete documentation.

NodePool Reference

NodePools reference NodeClass resources via spec.template.nodeClassRef:

apiVersion: stratos.sh/v1alpha1
kind: NodePool
metadata:
name: workers
spec:
poolSize: 10
minStandby: 3
template:
nodeClassRef:
kind: AWSNodeClass # NodeClass type
name: production-nodes # NodeClass name
labels:
stratos.sh/pool: workers

Supported Providers

AWS (Production)

The AWS provider (--cloud-provider=aws) manages EC2 instances.

Features:

  • Full EC2 lifecycle management (launch, start, stop, terminate)
  • Built-in rate limiting for API throttling protection
  • Instance type to capacity mapping for scale-up calculations
  • Subnet round-robin for AZ distribution
  • AWSNodeClass for configuration

NodeClass: AWSNodeClass

Configuration Example:

apiVersion: stratos.sh/v1alpha1
kind: AWSNodeClass
metadata:
name: standard-nodes
spec:
region: us-east-1
instanceType: m5.large
ami: ami-0123456789abcdef0
subnetIds:
- subnet-12345678
- subnet-87654321
securityGroupIds:
- sg-12345678
iamInstanceProfile: arn:aws:iam::123456789012:instance-profile/node-role
userData: |
#!/bin/bash
/etc/eks/bootstrap.sh my-cluster \
--kubelet-extra-args '--register-with-taints=node.eks.amazonaws.com/not-ready=true:NoSchedule'
until curl -sf http://localhost:10248/healthz; do sleep 5; done
sleep 30
poweroff
blockDeviceMappings:
- deviceName: /dev/xvda
volumeSize: 50
volumeType: gp3
encrypted: true
tags:
Environment: production

Fake (Testing)

The fake provider (--cloud-provider=fake) is a mock implementation for testing and local development.

Features:

  • In-memory instance tracking
  • Configurable hooks for testing
  • No cloud costs

Usage:

go run ./cmd/stratos/main.go --cluster-name=test --cloud-provider=fake
tip

Use the fake provider for local development and testing. It allows rapid iteration without cloud costs or API rate limits.

Instance States

The provider interface uses a common set of instance states:

StateDescription
pendingInstance is launching
runningInstance is running
stoppingInstance is stopping
stoppedInstance is stopped
terminatedInstance is terminated
unknownState cannot be determined

Instance Tags

Stratos uses tags to track instance ownership and state:

TagDescriptionExample
managed-byIdentifies Stratos-managed instancesstratos
stratos.sh/poolNodePool nameworkers
stratos.sh/clusterKubernetes cluster nameproduction
stratos.sh/stateCurrent Stratos statestandby

These tags are used for:

  • Discovering managed instances on startup
  • Filtering instances by pool
  • Auditing and cost allocation

Rate Limiting

The AWS provider includes built-in rate limiting to avoid EC2 API throttling:

OperationRate Limit
DescribeInstances20 req/s
RunInstances5 req/s
StartInstances5 req/s
StopInstances5 req/s
TerminateInstances5 req/s
CreateTags10 req/s
note

Rate limits are applied per controller instance. If you see RateLimitError in logs, consider reducing the reconciliation frequency.

Error Handling

The provider interface defines common error types:

ErrorDescription
InstanceNotFoundErrorInstance does not exist
RateLimitErrorAPI rate limit exceeded
InsufficientCapacityErrorInsufficient capacity in region/AZ
InvalidConfigErrorInvalid launch configuration

The controller handles these errors appropriately:

  • InstanceNotFoundError: Cleans up orphaned Kubernetes node
  • RateLimitError: Retries with exponential backoff
  • InsufficientCapacityError: Tries different subnets/AZs

Future Providers

The cloud provider architecture is designed to support additional providers:

  • GCP - Google Compute Engine instances (GCPNodeClass)
  • Azure - Azure Virtual Machines (AzureNodeClass)

Adding a new provider requires:

  1. Implement the CloudProvider interface
  2. Create a new NodeClass CRD (e.g., GCPNodeClass)
  3. Implement the provider-specific LaunchInstance method
  4. Register the provider in the factory
note

GCP and Azure support is planned but not yet implemented. Contributions are welcome.

Next Steps