Karpenter vs Cast.ai on EKS: I Ran Both in Production and Here’s What I Found
A technical evaluation of two node autoscalers across provisioning speed, driver protection, operational burden, and real cost — with a live 90-day benchmark. The answer is not what most people expect.
Bottom line upfront
Why This Comparison Needed to Exist
Karpenter and Cast.ai look like they do the same job. Both watch for pods that can’t be scheduled, spin up EC2 nodes to fit them, and clean up nodes when they’re no longer needed.
The reason for this comparison was curiosity about Cast.ai’s cost optimisation claims — and whether its broader feature set (live pricing intelligence, VPA rightsizing, continuous bin-packing) justified switching away from a tool that was already working well.
There is no single document that puts these two tools side by side the way an engineer actually needs. The official documentation for each tool is good within its own world, but crossing between them required extensive cross-referencing and hands-on testing.
These tools are solving the same problem from completely opposite architectural positions. Those positions have measurable consequences for how fast your jobs start, how much your compute bill is, and whether your Spark driver survives a consolidation cycle.
The Fundamental Difference Nobody Explains Clearly
The one question that unlocks everything else: who makes the provisioning decision, and where?
Karpenter lives entirely inside your cluster. When a pod goes Pending, its controller reacts within milliseconds. It already has NodePool constraints loaded in memory, cached pricing data, and fires an ec2:CreateFleet call to AWS using credentials held via IRSA. The entire path from “pod is stuck” to “EC2 API called” happens inside your cluster in roughly two seconds.
Cast.ai’s intelligence lives on Cast.ai’s servers. A lightweight agent called castai-agent runs inside your cluster and ships all cluster state changes to Cast.ai’s cloud platform every 15 seconds. The actual decision is made on Cast.ai’s backend using ML models trained across their entire customer fleet and live AWS pricing APIs. Once the decision is made, Cast.ai’s platform calls ec2:RunInstances directly in your AWS account via a cross-account IAM role.
That 15-second sync interval is the trade-off Cast.ai makes to access better intelligence. A pending pod might wait up to 15 seconds before Cast.ai even starts thinking about provisioning. In exchange, every provisioning decision has access to live pricing data and a bin-packing engine that a local rule-based system simply cannot replicate.
| Dimension | Karpenter | Cast.ai |
|---|---|---|
| Decision location | Inside your cluster | Cast.ai’s cloud platform |
| Intelligence | Rule-based NodePool constraints | ML models + live AWS pricing |
| Pricing data | Cached, refreshed periodically | Queried live at decision time |
| Internet required? | No — fully in-cluster | Yes — outbound HTTPS (PrivateLink available) |
| Multi-cluster learning | No — per cluster only | Yes — learns across all customers |
| Deployment model | Open source, self-managed | Commercial SaaS |
Provisioning Speed: Karpenter’s Structural Advantage
Speed matters most for interactive workloads and batch jobs with tight SLAs. For a Spark executor that needs to start within 30 seconds, the difference between two seconds and 17 seconds is the difference between meeting your SLA and missing it.
Karpenter: when a pod goes Pending, the controller is notified within milliseconds and fires ec2:CreateFleet. The time from pod Pending to EC2 API call is typically 1–3 seconds. The practical latency from “pod is stuck” to “pod is running” is dominated by EC2 boot time (~90s), not Karpenter’s decision time. Karpenter adds essentially zero overhead.
Cast.ai: the castai-agent ships cluster state every 15 seconds. A pod that goes Pending one second after a sync cycle waits up to 14 seconds before Cast.ai’s backend even sees it. The practical overhead before EC2 receives the request is 15–20 seconds.
| Metric | Karpenter | Cast.ai |
|---|---|---|
| Time to EC2 API call | ~2 seconds | 15–20 seconds |
| Decision location | In-cluster, in-memory | SaaS backend (network round-trip) |
| Practical pod start (cold) | EC2 boot time (~90s) | EC2 boot time + 15–20s overhead |
Spark and EMR on EKS: The Specific Considerations
Most content about Karpenter and Cast.ai is written for microservice workloads. Spark and EMR on EKS have different characteristics that change the calculus significantly.
The Driver Pod Problem
In a Spark job, the driver pod is the job. If the driver is evicted, the entire job fails. Both tools have protection mechanisms, but they work differently.
Karpenter uses the standard Kubernetes annotation:
karpenter.sh/do-not-disrupt: "true"
This prevents Karpenter’s consolidation from evicting the annotated pod. It does not protect against VPA rightsizing, because Karpenter doesn’t do VPA.
Cast.ai requires two separate annotations — one for eviction protection and one for VPA protection:
annotations:
# Prevents Cast.ai Evictor from consolidating the node the driver is on
autoscaling.cast.ai/removal-disabled: "true"
# Disables Cast.ai vertical rightsizing for the driver only
workloads.cast.ai/configuration: |
vertical:
optimization: off
24/7 Streaming Workloads
Karpenter has an expireAfter TTL on NodePool that forces node replacement after a configurable period. For a streaming job, this means the node will eventually be drained and replaced — potentially disrupting the driver.
Cast.ai has no built-in node TTL. A node running a protected streaming job will not be evicted by the Evictor. For 24/7 streaming workloads, this is a meaningful operational difference.
Cost Analysis: A Live 90-Day Node Benchmark
I ran both autoscalers simultaneously against the same workload: a 24/7 location streaming job running continuously. Same job, same data throughput, same load profile.
| Instance | $/day | vCPU |
| c6g.16xlarge Spot | $11.76 | 64 |
| m6g.12xlarge OD | $37.15 | 48 |
| inf1.2xlarge Spot | $2.62 | 8 |
| m6g.2xlarge OD | $6.19 | 8 |
| c6g.large Spot | $0.62 | 2 |
| Total | $58.34 | 130 |
| Instance | $/day | vCPU |
| c6g.8xlarge Spot | $9.41 | 32 |
| c6g.8xlarge Spot | $9.41 | 32 |
| c6g.8xlarge OD ⚑ | $21.02 | 32 |
| c6g.8xlarge OD ⚑ | $21.02 | 32 |
| i4g.2xlarge Spot | $3.38 | 8 |
| r6gd.2xlarge OD | $9.22 | 8 |
| Total | $73.46 | 144 |
Cast.ai Licence Economics
Cast.ai charges €8 per billable vCPU per month. The critical distinction: billable vCPUs = average CPU actually utilised by workloads, not total provisioned vCPUs.
For this 24/7 streaming job, CPU utilisation was measured at a consistent 5 vCPU/day.
Total Cost Comparison
| Line item | Cast.ai / day | Karpenter / day |
|---|---|---|
| AWS compute | $58.34 | $73.46 |
| Cast.ai licence | $45.85 | $0 |
| Total cost | $104.19 | $73.46 |
| Difference | Karpenter $30.73/day cheaper → $921/month, $11,052/year | |
The Complete Head-to-Head Scorecard
| Dimension | Karpenter | Cast.ai |
|---|---|---|
| Provisioning speed | ★★★★★ Fastest | ★★★☆☆ |
| Compute savings | ★★★☆☆ | ★★★★☆ 20–35%* |
| Optimisation intelligence | Rule-based NodePool | ML + live AWS pricing |
| VPA (vertical autoscaling) | ✗ | ✓ Built-in, per pod |
| Spot interruption response | ★★★★★ SQS + EventBridge | ★★★☆☆ |
| Spot interruption prevention | ★★☆☆☆ No predictive capability | ★★★★★ ML-based rebalancing |
| Driver pod protection | Eviction only | Eviction + VPA |
| 24/7 streaming safety | Until expireAfter TTL | Indefinite — no ceiling |
| Maintenance burden | High — team owns everything | Low — mostly automated |
| Licence cost | Free | €8/billable vCPU/month |
Recommendation: Run Both — Cast.ai for Batch, Karpenter for Streaming
Run Cast.ai as the provisioner for batch workloads — ETL pipelines, nightly aggregation, ML training, any Spark job that starts, processes, and terminates. Run Karpenter for 24/7 streaming workloads where sustained CPU utilisation makes the Cast.ai licence uneconomical.
One critical operational detail in a dual-provisioner setup: Cast.ai’s castai-workload-autoscaler operates cluster-wide. It watches CPU and memory usage across all pods regardless of which provisioner created the underlying node. Apply all three protections to every Spark driver, regardless of which provisioner placed its node:
annotations:
# Prevents Karpenter from disrupting this pod during node consolidation
# Source: https://karpenter.sh/docs/concepts/disruption/
karpenter.sh/do-not-disrupt: "true"
# Prevents Cast.ai Evictor from removing this pod for consolidation
# Source: https://docs.cast.ai/docs/evictor
autoscaling.cast.ai/removal-disabled: "true"
# Disables Cast.ai vertical rightsizing for CPU and memory on the driver
# Source: https://docs.cast.ai/docs/workload-autoscaler-annotations-reference
workloads.cast.ai/configuration: |
vertical:
optimization: off
Leave executor pods unannotated so Cast.ai can rightsize them freely.