A Lesson Learned for Cost Optimization with Kubernetes

Blog Team

Being efficient, both in terms of cost and operations, is crucial. This is a guiding principle for us, and it's what drove us to undertake a large-scale infrastructure transformation that saved money, improved development speed, and strengthened our governance practices. Spoiler: It involved moving from VMware to Kubernetes, and the journey was worth every step. If you're looking to boost your career or want a behind-the-scenes look at what our tech team accomplished, stick around. There's a lot you can learn from our experience.

Why We Made the Shift

Our client, a major US retailer, had to solve some big challenges: skyrocketing infrastructure costs, difficulties scaling during peak seasons, limited visibility into resource use, and operational inefficiencies. Imagine your yearly IT expenses growing faster than your revenue! That's what they were up against, and the existing VMware setup wasn't cutting it anymore.

The solution? A migration to Kubernetes that promised better cost visibility, more efficient scaling, and tighter governance. The outcome was a 35% drop in infrastructure costs, faster releases, and a significant reduction in security incidents. This blog post will unpack how we did it and what you can take away from our approach.

Key Areas of Focus

During this transformation, we zeroed in on four core areas:

Cost Observability: Understanding where our money was going.
Cost Optimization: Cutting down expenses without compromising performance.
Compute Fleet Management: Efficiently managing and scaling our computing power.
Governance: Keeping everything compliant, secure, and well-managed.

How We Did It

1. Planning and Assessment

The first step was understanding what we had and where we wanted to go. This meant auditing the existing VMware infrastructure and identifying pain points across different teams—like security, finance, operations, and application development. We gathered input from stakeholders to make sure everyone's needs were addressed.

Infrastructure Audit: We cataloged all hardware, virtual machines, and application dependencies. We used tools like VMware vRealize Operations Manager to analyze resource usage and identify inefficiencies.
Cost Analysis: Using AWS Cost Explorer and other cloud cost management tools, we broke down expenses for servers, storage, licensing, and even hidden costs like over-provisioning of resources.
Risk Assessment: We developed strategies to mitigate potential risks, such as using Kubernetes-native security tools like Kube-bench to address security gaps and performing load tests to predict performance during migration.

2. Choosing the Right Tools

We picked Kubernetes as the core platform, supported by tools like OpenCost for cost visibility, Prometheus and Grafana for monitoring, and Istio for advanced traffic management. Each tool had a specific role to play:

Kubernetes: Managed containerized workloads, automated deployments, and provided self-healing capabilities. Kubernetes also helped in decoupling applications from VMware VMs, making them cloud-ready.
OpenCost: Provided real-time insights into resource usage and costs, tracking spending at the namespace and workload levels. This was crucial for understanding the financial impact of different services.
Prometheus & Grafana: These tools enabled us to create custom metrics and visual dashboards for stakeholders. Prometheus scraped metrics from Kubernetes nodes, and Grafana provided clear, actionable visualizations.
Istio: Helped manage traffic between microservices, providing security through mutual TLS, and allowed for advanced traffic routing to optimize performance during peak loads.

3. Design and Architecture

With the tools in place, we designed an architecture that could provide cost observability and then optimize those costs effectively. This included deploying OpenCost for cost tracking, integrating Prometheus for metrics collection, and using Grafana for visual dashboards.

Automated Right-Sizing: We used historical resource data from Prometheus to right-size Kubernetes resources. The Vertical Pod Autoscaler (VPA) was set up to adjust container resource requests and limits automatically.
Spot Instances: Integrated AWS Spot Instances for workloads that were non-critical. For instance, batch processing jobs ran on spot instances, saving up to 70% of compute costs compared to on-demand instances.
Cluster Autoscaling: Kubernetes Cluster Autoscaler was deployed to manage compute resources efficiently, scaling nodes up or down based on demand, and integrated with AWS Autoscaling Groups to ensure cost-effective use of cloud infrastructure.

4. Implementation and Migration

The actual migration started with a pilot phase. We began with non-critical applications to validate our strategies before rolling out the migration across all workloads.

Pilot Phase: We chose a set of low-impact applications and containerized them using Docker. Helm charts were then used to manage the deployment of these containers into Kubernetes. This allowed us to validate cost observability with OpenCost and adjust our compute fleet management approach.
Phased Rollout: We migrated applications in groups, starting with stateless services and then moving to stateful services that required persistent storage. For storage, we used Amazon EBS and configured Kubernetes Persistent Volumes (PVs) to handle data persistence.
Cutover Strategy: We planned application cutovers during off-peak hours. The cutover was supported by keeping parallel environments running, with the ability to roll back using Kubernetes deployments and traffic management through Istio in case of issues.

Results Worth Celebrating

This migration delivered some impressive outcomes:

35% Cost Reduction: By leveraging Kubernetes autoscaling, optimizing workloads, and utilizing spot instances, we saved millions annually.
2.2x Faster Releases: Using CI/CD pipelines with Jenkins integrated into Kubernetes, we were able to streamline the deployment process, cutting down the average release time significantly.
70% Fewer Security Incidents: Implementing role-based access control (RBAC) with Kubernetes and using tools like Aqua Security reduced vulnerabilities and improved overall cluster security.

On top of this, we achieved better compliance across internal policies and reduced the time spent on audit preparation by 60%. Automated compliance checks with tools like OPA (Open Policy Agent) helped streamline this process.

What You Can Learn

If you're looking to grow in your career or join a company like ours, here’s what you can take away from our experience:

Embrace Containerization and Orchestration: Moving to Kubernetes might seem daunting, but the benefits in scalability, resilience, and cost efficiency are huge. Tools like Helm and Docker make the process manageable.
Focus on Observability: Knowing where your costs are going is the first step in cutting them down. Tools like OpenCost and Prometheus are key for gaining insights into your infrastructure.
Iterate, Don’t Jump: A phased approach to migration reduces risk and helps you adapt as you go. Starting with non-critical workloads and building experience is a great strategy.
Automate Governance: Security and compliance aren’t just checkboxes—they’re key to sustainable growth. Kubernetes makes it easier to enforce governance, but tools like OPA and RBAC take it to the next level.

Ready to Be Part of This Journey?

At UpTeam, we're doing meaningful work that shapes the future of IT. If this sounds like the kind of challenge you’d like to take on, we’re always on the lookout for talented people who want to make a difference. Join us, and let's build something amazing together.

‍