Press "Enter" to skip to content

Kubernetes in 2025: What’s Changed Since I Wrote About AKS

0

Hey folks, I was looking through some of my older posts the other day and found my hands-on AKS walkthrough from back in 2019. Reading it now is wild. I was deploying a basic cluster, manually scaling node pools, running kubectl commands and feeling pretty good about it. That was Kubernetes 1.13 or 1.14 era.

We’re now on Kubernetes 1.29/1.30, and AKS in particular has changed so much that my old post reads like a historical document. I figured it was time to write about where things stand now, especially since we run a ton of containerized workloads at Spektra for CloudLabs.

Back Then: Everything Was Manual

When I wrote that original AKS post, here’s what the experience looked like. You created a cluster through the portal or CLI. You picked a node size. You deployed your app with kubectl apply. You managed upgrades yourself, hoping nothing broke. Monitoring meant bolting on third-party tools. And scaling? You either over-provisioned (expensive) or scrambled to add nodes when traffic spiked.

I remember spending a full weekend upgrading an AKS cluster from 1.13 to 1.14. Nerve-wracking, no easy rollback, and we had downtime. Not great.

Networking was another headache. Getting ingress right, configuring network policies, figuring out which CNI plugin to use… I spent more time on YAML for networking than on the actual application.

What’s Genuinely Better Now

I don’t want to just list features. Instead, let me walk through the stuff that actually changed how I work with Kubernetes day to day.

Upgrades Don’t Keep Me Up at Night Anymore

AKS now has automatic upgrades. You pick a channel (rapid, stable, node-image) and it handles the rest. I set our non-production clusters to the stable channel and honestly forgot about them for months. They just quietly stayed current.

For production, I still prefer planned maintenance windows where I control the timing. But even that is way smoother than the manual upgrade dance from 2019. The process is much more reliable about draining nodes properly now.

GitOps Changed Everything About Deployments

This is probably the single biggest shift in how we operate. Back in 2019, deployments meant someone running kubectl apply from their laptop, or a CI/CD pipeline that did the same thing. It worked, but tracking what was deployed where was a mess.

Now we use Flux (the CNCF GitOps tool) with AKS. Our Kubernetes manifests live in a Git repo. Flux watches that repo and automatically syncs the cluster state to match. Want to deploy? Push a commit. Want to rollback? Revert the commit. Want to know what’s running in production? Look at the Git repo.

AKS has Flux built in as a cluster extension, so you don’t even need to manage it yourself. ArgoCD is the other popular option (we use it on a couple of projects too), but I’ve found Flux’s integration with AKS to be cleaner since it’s a first-party extension.

Going from "SSH into a bastion host to run kubectl" to "merge a PR to deploy" feels like going from dial-up to broadband.

KEDA Made Autoscaling Actually Useful

The old Horizontal Pod Autoscaler (HPA) was fine if you wanted to scale based on CPU or memory. But what if your workload needs to scale based on the length of an Azure Service Bus queue? Or the number of messages in an Event Hub? That’s where KEDA comes in.

KEDA (Kubernetes Event Driven Autoscaling) is an AKS add-on now. We use it heavily in CloudLabs for provisioning workers. When a batch of lab environments needs to be created, messages land in a queue, KEDA scales up worker pods to handle the load, then scales back down to zero when the queue is empty.

Scale to zero. That was impossible with the old HPA. For workloads that are bursty (which most of ours are), this saved us real money. I don’t have exact numbers handy, but our Azure bill for those worker nodes dropped noticeably after we switched from always-on pods to KEDA-managed scaling.

Service Mesh Without the Pain

I tried setting up Istio manually on AKS in 2020. Gave up after two days. Too many moving parts, too much YAML, too easy to misconfigure and break traffic routing.

AKS now has an Istio-based service mesh as a managed add-on. Microsoft handles the control plane. You just enable it and start using it. mTLS between services, traffic splitting for canary deployments, observability. The stuff that used to take days now takes minutes.

We mainly care about mTLS and traffic observability, so we don’t use all of Istio’s features. But having it available without managing Istio ourselves? That alone is worth it.

The GPU Node Pool Situation

Okay, I have to talk about this because it’s on everyone’s mind right now.

DeepSeek R1 just came out last month and the whole industry is buzzing about running AI models closer to where they’re needed. Kubernetes is increasingly where people want to run inference workloads, and AKS supports GPU node pools with NVIDIA GPU Operator.

We’ve been experimenting with GPU node pools for some internal AI features in CloudLabs. Getting a GPU node pool running is straightforward. Getting it running cost-effectively? That’s harder. GPU VMs in Azure are expensive. Like, really expensive.

Spot node pools help. AKS lets you create node pools backed by Azure Spot VMs, which can be 60-90% cheaper. The catch is they can be evicted when Azure needs the capacity back. For batch inference, that trade-off is totally worth it. For real-time serving, you need a strategy for handling evictions gracefully.

I think Kubernetes is going to become the default platform for AI/ML workloads in 2025.

What’s Still Hard

Not everything is sunshine. A few things still frustrate me.

Cost visibility is better than it was but still not great. AKS cost analysis can tell you spending per namespace, which helps. But correlating costs back to specific applications or teams requires tagging discipline that most teams don’t have. We use the AKS cost analysis blade plus OpenCost, and it’s still not as straightforward as I’d like.

The learning curve hasn’t gotten smaller. If anything, there’s MORE to learn now. In 2019, you needed to know Kubernetes basics. Now add GitOps, KEDA, service mesh, network policies, workload identity… you get the idea. The individual pieces are easier, but there are a lot more of them.

And multi-cluster management is still a work in progress. AKS Fleet Manager is getting there, but if you’re running 5+ clusters across regions (like we do for CloudLabs), coordinating everything still requires real effort.

So, Was It Worth Revisiting?

Absolutely. Kubernetes went from "cool but painful" to "I wouldn’t build infrastructure any other way." The rough edges got filed down. AKS in particular absorbed a huge amount of the operational work that used to fall on us.

If you tried Kubernetes a few years ago and bounced off because it was too much ops work, it’s worth another look. The experience is completely different now.

I’m planning a more hands-on follow-up with actual commands and configs for the GitOps + KEDA + Istio setup we run. For now, I just wanted to capture how much things shifted in five or six years.

Happy building, folks!

Amit

Assisted by AI during writing

Leave a Reply

Your email address will not be published. Required fields are marked *