Platform Engineering FinOps EKS Karpenter Cast.ai EMR on EKS

Karpenter vs Cast.ai on EKS: I Ran Both in Production and Here’s What I Found

A technical evaluation of two node autoscalers across provisioning speed, driver protection, operational burden, and real cost — with a live 90-day benchmark. The answer is not what most people expect.

CM
Chinmaya Kumar Mishra
Principal Platform Engineer · EKS Architect · CKA · AWS SAA  ·  March 2026  ·  18 min read

Bottom line upfront

Use Cast.ai for
Batch & low-utilisation workloads — ETL, nightly aggregation, ML training runs, any Spark job that starts, processes, and terminates.
Use Karpenter for
24/7 streaming workloads — location streaming, event pipelines, real-time aggregation where sustained CPU makes Cast.ai’s licence uneconomical.

Why This Comparison Needed to Exist

Karpenter and Cast.ai look like they do the same job. Both watch for pods that can’t be scheduled, spin up EC2 nodes to fit them, and clean up nodes when they’re no longer needed.

The reason for this comparison was curiosity about Cast.ai’s cost optimisation claims — and whether its broader feature set (live pricing intelligence, VPA rightsizing, continuous bin-packing) justified switching away from a tool that was already working well.

There is no single document that puts these two tools side by side the way an engineer actually needs. The official documentation for each tool is good within its own world, but crossing between them required extensive cross-referencing and hands-on testing.

These tools are solving the same problem from completely opposite architectural positions. Those positions have measurable consequences for how fast your jobs start, how much your compute bill is, and whether your Spark driver survives a consolidation cycle.


The Fundamental Difference Nobody Explains Clearly

The one question that unlocks everything else: who makes the provisioning decision, and where?

Karpenter lives entirely inside your cluster. When a pod goes Pending, its controller reacts within milliseconds. It already has NodePool constraints loaded in memory, cached pricing data, and fires an ec2:CreateFleet call to AWS using credentials held via IRSA. The entire path from “pod is stuck” to “EC2 API called” happens inside your cluster in roughly two seconds.

Cast.ai’s intelligence lives on Cast.ai’s servers. A lightweight agent called castai-agent runs inside your cluster and ships all cluster state changes to Cast.ai’s cloud platform every 15 seconds. The actual decision is made on Cast.ai’s backend using ML models trained across their entire customer fleet and live AWS pricing APIs. Once the decision is made, Cast.ai’s platform calls ec2:RunInstances directly in your AWS account via a cross-account IAM role.

That 15-second sync interval is the trade-off Cast.ai makes to access better intelligence. A pending pod might wait up to 15 seconds before Cast.ai even starts thinking about provisioning. In exchange, every provisioning decision has access to live pricing data and a bin-packing engine that a local rule-based system simply cannot replicate.

DimensionKarpenterCast.ai
Decision locationInside your clusterCast.ai’s cloud platform
IntelligenceRule-based NodePool constraintsML models + live AWS pricing
Pricing dataCached, refreshed periodicallyQueried live at decision time
Internet required?No — fully in-clusterYes — outbound HTTPS (PrivateLink available)
Multi-cluster learningNo — per cluster onlyYes — learns across all customers
Deployment modelOpen source, self-managedCommercial SaaS

Provisioning Speed: Karpenter’s Structural Advantage

Speed matters most for interactive workloads and batch jobs with tight SLAs. For a Spark executor that needs to start within 30 seconds, the difference between two seconds and 17 seconds is the difference between meeting your SLA and missing it.

Karpenter: when a pod goes Pending, the controller is notified within milliseconds and fires ec2:CreateFleet. The time from pod Pending to EC2 API call is typically 1–3 seconds. The practical latency from “pod is stuck” to “pod is running” is dominated by EC2 boot time (~90s), not Karpenter’s decision time. Karpenter adds essentially zero overhead.

Cast.ai: the castai-agent ships cluster state every 15 seconds. A pod that goes Pending one second after a sync cycle waits up to 14 seconds before Cast.ai’s backend even sees it. The practical overhead before EC2 receives the request is 15–20 seconds.

MetricKarpenterCast.ai
Time to EC2 API call~2 seconds15–20 seconds
Decision locationIn-cluster, in-memorySaaS backend (network round-trip)
Practical pod start (cold)EC2 boot time (~90s)EC2 boot time + 15–20s overhead

Spark and EMR on EKS: The Specific Considerations

Most content about Karpenter and Cast.ai is written for microservice workloads. Spark and EMR on EKS have different characteristics that change the calculus significantly.

The Driver Pod Problem

In a Spark job, the driver pod is the job. If the driver is evicted, the entire job fails. Both tools have protection mechanisms, but they work differently.

Karpenter uses the standard Kubernetes annotation:

karpenter.sh/do-not-disrupt: "true"

This prevents Karpenter’s consolidation from evicting the annotated pod. It does not protect against VPA rightsizing, because Karpenter doesn’t do VPA.

Cast.ai requires two separate annotations — one for eviction protection and one for VPA protection:

Spark driver protection — both annotations required
annotations:
  # Prevents Cast.ai Evictor from consolidating the node the driver is on
  autoscaling.cast.ai/removal-disabled: "true"

  # Disables Cast.ai vertical rightsizing for the driver only
  workloads.cast.ai/configuration: |
    vertical:
      optimization: off
Critical detail Missing either annotation exposes the driver to that specific risk. Executor pods do not need these annotations — VPA-driven eviction of an executor is a task retry, which Spark handles natively.

24/7 Streaming Workloads

Karpenter has an expireAfter TTL on NodePool that forces node replacement after a configurable period. For a streaming job, this means the node will eventually be drained and replaced — potentially disrupting the driver.

Cast.ai has no built-in node TTL. A node running a protected streaming job will not be evicted by the Evictor. For 24/7 streaming workloads, this is a meaningful operational difference.


Cost Analysis: A Live 90-Day Node Benchmark

I ran both autoscalers simultaneously against the same workload: a 24/7 location streaming job running continuously. Same job, same data throughput, same load profile.

Cast.ai fleet — 5 nodes
Instance$/dayvCPU
c6g.16xlarge Spot$11.7664
m6g.12xlarge OD$37.1548
inf1.2xlarge Spot$2.628
m6g.2xlarge OD$6.198
c6g.large Spot$0.622
Total$58.34130
Karpenter fleet — 6 nodes
Instance$/dayvCPU
c6g.8xlarge Spot$9.4132
c6g.8xlarge Spot$9.4132
c6g.8xlarge OD ⚑$21.0232
c6g.8xlarge OD ⚑$21.0232
i4g.2xlarge Spot$3.388
r6gd.2xlarge OD$9.228
Total$73.46144
⚑ Memory saturation risk Two Karpenter nodes are running at 99% memory utilisation — an active OOM risk that Cast.ai’s continuous rightsizing engine would detect and rebalance automatically.

Cast.ai Licence Economics

Cast.ai charges €8 per billable vCPU per month. The critical distinction: billable vCPUs = average CPU actually utilised by workloads, not total provisioned vCPUs.

For this 24/7 streaming job, CPU utilisation was measured at a consistent 5 vCPU/day.

Common misread Multiplying €8 by provisioned vCPUs (130) gives €1,040/month — a figure 13× higher than the actual charge. Cast.ai bills on utilisation, not provisioned capacity. At 5 billable vCPU, the monthly licence is $1,375.

Total Cost Comparison

Line itemCast.ai / dayKarpenter / day
AWS compute$58.34$73.46
Cast.ai licence$45.85$0
Total cost$104.19$73.46
DifferenceKarpenter $30.73/day cheaper → $921/month, $11,052/year
Scope of these figures These reflect a 24/7 continuously running streaming job at sustained 5 vCPU/day — the worst-case scenario for Cast.ai’s licence economics. For batch jobs running a few hours per day, billable vCPU averages to a fraction of this figure, the licence cost drops accordingly, and Cast.ai’s compute saving easily absorbs it.

The Complete Head-to-Head Scorecard

DimensionKarpenterCast.ai
Provisioning speed★★★★★ Fastest★★★☆☆
Compute savings★★★☆☆★★★★☆ 20–35%*
Optimisation intelligenceRule-based NodePoolML + live AWS pricing
VPA (vertical autoscaling)✓ Built-in, per pod
Spot interruption response★★★★★ SQS + EventBridge★★★☆☆
Spot interruption prevention★★☆☆☆ No predictive capability★★★★★ ML-based rebalancing
Driver pod protectionEviction onlyEviction + VPA
24/7 streaming safetyUntil expireAfter TTLIndefinite — no ceiling
Maintenance burdenHigh — team owns everythingLow — mostly automated
Licence costFree€8/billable vCPU/month

Recommendation: Run Both — Cast.ai for Batch, Karpenter for Streaming

Run Cast.ai as the provisioner for batch workloads — ETL pipelines, nightly aggregation, ML training, any Spark job that starts, processes, and terminates. Run Karpenter for 24/7 streaming workloads where sustained CPU utilisation makes the Cast.ai licence uneconomical.

One critical operational detail in a dual-provisioner setup: Cast.ai’s castai-workload-autoscaler operates cluster-wide. It watches CPU and memory usage across all pods regardless of which provisioner created the underlying node. Apply all three protections to every Spark driver, regardless of which provisioner placed its node:

Complete driver pod annotations for a dual-provisioner cluster
annotations:
  # Prevents Karpenter from disrupting this pod during node consolidation
  # Source: https://karpenter.sh/docs/concepts/disruption/
  karpenter.sh/do-not-disrupt: "true"

  # Prevents Cast.ai Evictor from removing this pod for consolidation
  # Source: https://docs.cast.ai/docs/evictor
  autoscaling.cast.ai/removal-disabled: "true"

  # Disables Cast.ai vertical rightsizing for CPU and memory on the driver
  # Source: https://docs.cast.ai/docs/workload-autoscaler-annotations-reference
  workloads.cast.ai/configuration: |
    vertical:
      optimization: off

Leave executor pods unannotated so Cast.ai can rightsize them freely.

The bottom line Two sets of configurations, two upgrade paths, two operational models. That overhead is real. But the trade-off is worth making. Batch jobs on Cast.ai keep the licence cost low and let the automation handle rightsizing and Spot selection. Streaming jobs on Karpenter carry no licence overhead and keep long-running workloads stable.
CM
Chinmaya Kumar Mishra Principal Platform Engineer and EKS Architect at Siemens Smart Infrastructure, Pune. 18 years of engineering experience, 10+ cloud-native. CKA (Jan 2028) · AWS Solutions Architect Associate (Dec 2028). Creator of Awskube CLI (25 IRSA/IAM validation rules, org-wide) and HORIZON federated observability architecture.

LinkedIn  ·  chinmaya.mishra0105@gmail.com  ·  Prometheus article