A European SaaS client came to us with a familiar problem: their AWS bill had grown from $15K to $38K per month over 18 months, while their user base had only doubled. Kubernetes made it easy to scale — and easy to waste money. Here is exactly how we brought the bill down to $20K.
The Starting State
The client ran 3 EKS clusters: production, staging, and development. All three ran 24/7 on m5.2xlarge on-demand instances. Key findings from our initial audit:
- Average CPU utilization across all nodes: 18%
- Average memory utilization: 34%
- Staging and dev clusters running 24/7 despite being used only during business hours (9 AM - 7 PM CET)
- No autoscaling configured — fixed node counts set months ago
- Over-provisioned resource requests — most pods requesting 2-4x their actual consumption
- 23 orphaned EBS volumes from old PVCs — $420/month for nothing
Phase 1: Right-Sizing (Week 1-2)
We deployed the Vertical Pod Autoscaler in recommendation mode across all namespaces and collected data for 14 days. The results were striking:
A Java API service was requesting 2 CPU and 4Gi memory but consistently using 400m CPU and 800Mi memory — 80% waste on a single deployment with 6 replicas. Multiply that across 47 services.
We adjusted resource requests and limits based on P95 usage with a 30% buffer. This alone allowed us to reduce the number of worker nodes from 24 to 14 in production.
Savings: $5,200/month
Phase 2: Spot Instances for Stateless Workloads (Week 3)
We analyzed which workloads were spot-compatible: - Stateless API services (12 services, 34 pods) — fully spot-compatible - Web frontend servers — spot-compatible - Background workers and queue consumers — spot-compatible - Databases and stateful services — must stay on-demand
We configured mixed instance pools using Karpenter (replacing the default Cluster Autoscaler). Karpenter selects from 15+ instance types across 3 AZs, maximizing spot availability and minimizing interruptions.
Node affinity rules ensure stateful workloads (PostgreSQL, Redis, Elasticsearch) only schedule on on-demand nodes, while stateless services prefer spot nodes.
Spot instance pricing for our mix averaged 68% cheaper than on-demand.
Savings: $7,400/month
Phase 3: Schedule Non-Production Environments (Week 3)
Staging and development clusters do not need to run at 3 AM. We implemented:
- Development cluster: Scales to zero nodes outside 8 AM - 8 PM CET, Monday-Friday
- Staging cluster: Scales to minimum (2 nodes) outside business hours, full capacity during CI/CD runs
Using a combination of Karpenter node scaling and CronJobs that cordon/drain nodes, we reduced non-production compute by 65%.
Savings: $3,800/month
Phase 4: Storage Cleanup and Optimization (Week 4)
- Deleted 23 orphaned EBS volumes ($420/month)
- Migrated log storage from gp3 EBS to S3 with lifecycle policies ($280/month saved)
- Switched development databases from io2 to gp3 storage class ($190/month saved)
- Implemented PVC auto-cleanup on namespace deletion
Savings: $890/month
Phase 5: Commitment-Based Pricing (Week 5)
After 4 weeks of optimized usage data, we had clear visibility into the baseline compute that would always be running. We purchased:
- 1-year Compute Savings Plan covering 70% of the remaining on-demand baseline
- This covered the production stateful workloads and minimum staging capacity
Savings: $1,100/month
Results Summary
| Category | Monthly Savings | |----------|----------------| | Right-sizing | $5,200 | | Spot instances | $7,400 | | Non-prod scheduling | $3,800 | | Storage cleanup | $890 | | Savings Plans | $1,100 | | Total | $18,390 |
Final monthly bill: $19,610 (down from $38,000) Reduction: 47%
What We Did NOT Do
Importantly, we did not: - Change any application code - Reduce the number of service replicas in production - Compromise on availability or disaster recovery - Introduce any new single points of failure - Experience any downtime during the optimization
The application continued serving the same traffic at the same latency. Users noticed nothing. The CFO noticed a lot.
Lessons Learned
- 1Start with visibility — You cannot optimize what you do not measure. Kubecost or OpenCost should be your first deployment.
- 2Right-size first — It is the safest, highest-ROI optimization. Do it before anything else.
- 3Spot is production-ready — With proper pod disruption budgets, multiple replicas, and mixed instance pools, spot interruptions are invisible to users.
- 4Non-production waste is massive — Dev and staging environments often represent 30-40% of cloud spend but run unused 70% of the time.
- 5Commit last — Only purchase reservations after you have optimized. Otherwise you are locking in waste.
Conclusion
Cloud cost optimization is not about cutting corners — it is about eliminating waste. Every dollar spent on idle compute is a dollar not spent on product development, hiring, or customer acquisition. The tools and techniques in this case study are applicable to any Kubernetes environment running on public cloud, and the 30-50% savings range is consistently achievable.