Cloud Providers
Stratos uses a cloud provider abstraction to support multiple cloud platforms. This document explains the architecture, the NodeClass pattern for cloud-specific configuration, and supported providers.
Architecture Overview
Stratos separates concerns between pool management and cloud-specific configuration:
+------------------+ references +------------------+
| NodePool | ----------------------> | AWSNodeClass |
+------------------+ +------------------+
| - poolSize | | - instanceType |
| - minStandby | | - ami |
| - labels, taints | | - subnetIds |
| - preWarm config | | - securityGroups |
| - scaleDown | | - userData |
+------------------+ +------------------+
|
| uses
v
+------------------+
| CloudProvider |
| Interface |
+------------------+
| - StartInstance |
| - StopInstance |
| - Terminate... |
| - GetInstance |
| - ListInstances |
+------------------+
Why Separate NodeClass?
- Reusability: Multiple NodePools can share the same EC2 configuration
- Separation of concerns: Pool sizing vs. instance configuration
- Multi-cloud support: Each cloud gets its own NodeClass type (AWSNodeClass, GCPNodeClass, etc.)
- Independent evolution: Cloud-specific schemas can evolve without affecting NodePool
This pattern follows Karpenter's NodeClass design.
Provider Interface
The CloudProvider interface defines instance lifecycle operations that work with instance IDs:
type CloudProvider interface {
// StartInstance starts a stopped instance.
StartInstance(ctx context.Context, instanceID string) error
// StopInstance stops a running instance.
StopInstance(ctx context.Context, instanceID string, force bool) error
// TerminateInstance terminates an instance permanently.
TerminateInstance(ctx context.Context, instanceID string) error
// GetInstanceState returns the current state of an instance.
GetInstanceState(ctx context.Context, instanceID string) (InstanceState, error)
// GetInstance returns full details of an instance.
GetInstance(ctx context.Context, instanceID string) (*Instance, error)
// ListInstances returns instances matching the given tags.
ListInstances(ctx context.Context, tags map[string]string) ([]*Instance, error)
// UpdateInstanceTags updates tags on an instance.
UpdateInstanceTags(ctx context.Context, instanceID string, tags map[string]string) error
}
LaunchInstance is not part of the generic interface because launch requires cloud-specific configuration (e.g., AWSNodeClass). Each provider implements its own LaunchInstance method that takes its specific NodeClass type directly.
NodeClass Resources
NodeClass resources define cloud-specific instance configuration. Currently, Stratos supports:
| NodeClass | Cloud | Description |
|---|---|---|
AWSNodeClass | AWS | EC2 instance configuration (instance type, AMI, networking, IAM) |
GCPNodeClass | GCP | Planned - Compute Engine configuration |
AzureNodeClass | Azure | Planned - Virtual Machine configuration |
AWSNodeClass
AWSNodeClass defines AWS EC2 configuration:
apiVersion: stratos.sh/v1alpha1
kind: AWSNodeClass
metadata:
name: production-nodes
spec:
region: us-east-1
instanceType: m5.large
ami: ami-0123456789abcdef0
subnetIds:
- subnet-12345678
- subnet-87654321
securityGroupIds:
- sg-12345678
iamInstanceProfile: arn:aws:iam::123456789012:instance-profile/node-role
userData: |
#!/bin/bash
/etc/eks/bootstrap.sh my-cluster
# ... warmup script
blockDeviceMappings:
- deviceName: /dev/xvda
volumeSize: 50
volumeType: gp3
encrypted: true
tags:
Environment: production
See AWSNodeClass API Reference for complete documentation.
NodePool Reference
NodePools reference NodeClass resources via spec.template.nodeClassRef:
apiVersion: stratos.sh/v1alpha1
kind: NodePool
metadata:
name: workers
spec:
poolSize: 10
minStandby: 3
template:
nodeClassRef:
kind: AWSNodeClass # NodeClass type
name: production-nodes # NodeClass name
labels:
stratos.sh/pool: workers
Supported Providers
AWS (Production)
The AWS provider (--cloud-provider=aws) manages EC2 instances.
Features:
- Full EC2 lifecycle management (launch, start, stop, terminate)
- Built-in rate limiting for API throttling protection
- Instance type to capacity mapping for scale-up calculations
- Subnet round-robin for AZ distribution
- AWSNodeClass for configuration
NodeClass: AWSNodeClass
Configuration Example:
apiVersion: stratos.sh/v1alpha1
kind: AWSNodeClass
metadata:
name: standard-nodes
spec:
region: us-east-1
instanceType: m5.large
ami: ami-0123456789abcdef0
subnetIds:
- subnet-12345678
- subnet-87654321
securityGroupIds:
- sg-12345678
iamInstanceProfile: arn:aws:iam::123456789012:instance-profile/node-role
userData: |
#!/bin/bash
/etc/eks/bootstrap.sh my-cluster \
--kubelet-extra-args '--register-with-taints=node.eks.amazonaws.com/not-ready=true:NoSchedule'
until curl -sf http://localhost:10248/healthz; do sleep 5; done
sleep 30
poweroff
blockDeviceMappings:
- deviceName: /dev/xvda
volumeSize: 50
volumeType: gp3
encrypted: true
tags:
Environment: production
Fake (Testing)
The fake provider (--cloud-provider=fake) is a mock implementation for testing and local development.
Features:
- In-memory instance tracking
- Configurable hooks for testing
- No cloud costs
Usage:
go run ./cmd/stratos/main.go --cluster-name=test --cloud-provider=fake
Use the fake provider for local development and testing. It allows rapid iteration without cloud costs or API rate limits.
Instance States
The provider interface uses a common set of instance states:
| State | Description |
|---|---|
pending | Instance is launching |
running | Instance is running |
stopping | Instance is stopping |
stopped | Instance is stopped |
terminated | Instance is terminated |
unknown | State cannot be determined |
Instance Tags
Stratos uses tags to track instance ownership and state:
| Tag | Description | Example |
|---|---|---|
managed-by | Identifies Stratos-managed instances | stratos |
stratos.sh/pool | NodePool name | workers |
stratos.sh/cluster | Kubernetes cluster name | production |
stratos.sh/state | Current Stratos state | standby |
These tags are used for:
- Discovering managed instances on startup
- Filtering instances by pool
- Auditing and cost allocation
Rate Limiting
The AWS provider includes built-in rate limiting to avoid EC2 API throttling:
| Operation | Rate Limit |
|---|---|
| DescribeInstances | 20 req/s |
| RunInstances | 5 req/s |
| StartInstances | 5 req/s |
| StopInstances | 5 req/s |
| TerminateInstances | 5 req/s |
| CreateTags | 10 req/s |
Rate limits are applied per controller instance. If you see RateLimitError in logs, consider reducing the reconciliation frequency.
Error Handling
The provider interface defines common error types:
| Error | Description |
|---|---|
InstanceNotFoundError | Instance does not exist |
RateLimitError | API rate limit exceeded |
InsufficientCapacityError | Insufficient capacity in region/AZ |
InvalidConfigError | Invalid launch configuration |
The controller handles these errors appropriately:
InstanceNotFoundError: Cleans up orphaned Kubernetes nodeRateLimitError: Retries with exponential backoffInsufficientCapacityError: Tries different subnets/AZs
Future Providers
The cloud provider architecture is designed to support additional providers:
- GCP - Google Compute Engine instances (GCPNodeClass)
- Azure - Azure Virtual Machines (AzureNodeClass)
Adding a new provider requires:
- Implement the CloudProvider interface
- Create a new NodeClass CRD (e.g., GCPNodeClass)
- Implement the provider-specific LaunchInstance method
- Register the provider in the factory
GCP and Azure support is planned but not yet implemented. Contributions are welcome.
Next Steps
- AWSNodeClass Reference - Complete AWSNodeClass API
- AWS Setup - Configure AWS prerequisites
- NodePool API - Complete NodePool API reference