How We Cut a Client Kubernetes Bill by 47% Using Spot Instances

A European SaaS client came to us with a familiar problem: their AWS bill had grown from $15K to $38K per month over 18 months, while their user base had only doubled. Kubernetes made it easy to scale — and easy to waste money. Here is exactly how we brought the bill down to $20K.

The Starting State

The client ran 3 EKS clusters: production, staging, and development. All three ran 24/7 on m5.2xlarge on-demand instances. Key findings from our initial audit:

Average CPU utilization across all nodes: 18%
Average memory utilization: 34%
Staging and dev clusters running 24/7 despite being used only during business hours (9 AM - 7 PM CET)
No autoscaling configured — fixed node counts set months ago
Over-provisioned resource requests — most pods requesting 2-4x their actual consumption
23 orphaned EBS volumes from old PVCs — $420/month for nothing

Phase 1: Right-Sizing (Week 1-2)

We deployed the Vertical Pod Autoscaler in recommendation mode across all namespaces and collected data for 14 days. The results were striking:

A Java API service was requesting 2 CPU and 4Gi memory but consistently using 400m CPU and 800Mi memory — 80% waste on a single deployment with 6 replicas. Multiply that across 47 services.

We adjusted resource requests and limits based on P95 usage with a 30% buffer. This alone allowed us to reduce the number of worker nodes from 24 to 14 in production.

Savings: $5,200/month

Phase 2: Spot Instances for Stateless Workloads (Week 3)

We analyzed which workloads were spot-compatible: - Stateless API services (12 services, 34 pods) — fully spot-compatible - Web frontend servers — spot-compatible - Background workers and queue consumers — spot-compatible - Databases and stateful services — must stay on-demand

We configured mixed instance pools using Karpenter (replacing the default Cluster Autoscaler). Karpenter selects from 15+ instance types across 3 AZs, maximizing spot availability and minimizing interruptions.

Node affinity rules ensure stateful workloads (PostgreSQL, Redis, Elasticsearch) only schedule on on-demand nodes, while stateless services prefer spot nodes.

Spot instance pricing for our mix averaged 68% cheaper than on-demand.

Savings: $7,400/month

Phase 3: Schedule Non-Production Environments (Week 3)

Staging and development clusters do not need to run at 3 AM. We implemented:

Development cluster: Scales to zero nodes outside 8 AM - 8 PM CET, Monday-Friday
Staging cluster: Scales to minimum (2 nodes) outside business hours, full capacity during CI/CD runs

Using a combination of Karpenter node scaling and CronJobs that cordon/drain nodes, we reduced non-production compute by 65%.

Savings: $3,800/month

Phase 4: Storage Cleanup and Optimization (Week 4)

Deleted 23 orphaned EBS volumes ($420/month)
Migrated log storage from gp3 EBS to S3 with lifecycle policies ($280/month saved)
Switched development databases from io2 to gp3 storage class ($190/month saved)
Implemented PVC auto-cleanup on namespace deletion

Savings: $890/month

Phase 5: Commitment-Based Pricing (Week 5)

After 4 weeks of optimized usage data, we had clear visibility into the baseline compute that would always be running. We purchased:

1-year Compute Savings Plan covering 70% of the remaining on-demand baseline
This covered the production stateful workloads and minimum staging capacity

Savings: $1,100/month

Results Summary

| Category | Monthly Savings | |----------|----------------| | Right-sizing | $5,200 | | Spot instances | $7,400 | | Non-prod scheduling | $3,800 | | Storage cleanup | $890 | | Savings Plans | $1,100 | | Total | $18,390 |

Final monthly bill: $19,610 (down from $38,000) Reduction: 47%

What We Did NOT Do

Importantly, we did not: - Change any application code - Reduce the number of service replicas in production - Compromise on availability or disaster recovery - Introduce any new single points of failure - Experience any downtime during the optimization

The application continued serving the same traffic at the same latency. Users noticed nothing. The CFO noticed a lot.

Lessons Learned

1Start with visibility — You cannot optimize what you do not measure. Kubecost or OpenCost should be your first deployment.
2Right-size first — It is the safest, highest-ROI optimization. Do it before anything else.
3Spot is production-ready — With proper pod disruption budgets, multiple replicas, and mixed instance pools, spot interruptions are invisible to users.
4Non-production waste is massive — Dev and staging environments often represent 30-40% of cloud spend but run unused 70% of the time.
5Commit last — Only purchase reservations after you have optimized. Otherwise you are locking in waste.

Conclusion

Cloud cost optimization is not about cutting corners — it is about eliminating waste. Every dollar spent on idle compute is a dollar not spent on product development, hiring, or customer acquisition. The tools and techniques in this case study are applicable to any Kubernetes environment running on public cloud, and the 30-50% savings range is consistently achievable.