The 3 AM Wake-Up Call That Changed How I See Kubernetes Forever
The 3 AM Wake-Up Call That Changed How I See Kubernetes Forever
I’ll never forget the first time I truly understood the difference between Kubernetes in development and Kubernetes in production. It was 3 AM, and my phone was buzzing with alerts I’d never seen before. My cluster was down, my services were unreachable, and I was staring at a wall of cryptic error messages. I’d been using Kubernetes for months, but this was the moment everything changed.
The problem wasn’t that I didn’t know Kubernetes—it was that I didn’t know production Kubernetes. I’d been treating it like a development tool, running everything on my local machine, deploying with kubectl apply, and hoping for the best. But production is a different beast entirely. It’s where theory meets reality, where elegant abstractions break down, and where your true understanding is tested.
Why Production Kubernetes Is Different
Let me be clear: if you’re running Kubernetes in production, you’re not just running containers. You’re running a distributed system that spans multiple machines, handles traffic spikes, manages state, and needs to be monitored, secured, and backed up. It’s a living, breathing entity that requires constant attention and care.
The first thing I learned the hard way is that production Kubernetes isn’t about the cool features—it’s about the boring, unsexy stuff that keeps your system running. Things like:
- Resource limits and requests: Without proper resource management, your pods will fight each other for CPU and memory, causing unpredictable behavior and crashes.
- Health checks and readiness probes: Your services need to know when they’re actually ready to serve traffic, not just when they’ve started.
- Persistent storage: Data doesn’t live in containers—it needs somewhere to persist, and that somewhere needs to be reliable and backed up.
- Network policies: In production, you can’t just let everything talk to everything. You need to control traffic flow for security and performance.
The Three Pillars of Production-Ready Kubernetes
After that 3 AM wake-up call, I started rebuilding my understanding of Kubernetes around three core pillars: reliability, observability, and security.
Reliability: The Foundation
Reliability starts with understanding that your cluster will fail. Not if, but when. The question is how gracefully it fails.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
spec:
containers:
- name: app
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
This configuration ensures that even if one pod fails, your service remains available. The resource limits prevent any single pod from consuming all available resources, and the health checks ensure traffic only goes to healthy instances.
Observability: Seeing What’s Really Happening
You can’t fix what you can’t see. In production, you need to know what’s happening inside your cluster at all times.
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
This Prometheus configuration automatically discovers pods with Prometheus annotations and scrapes their metrics. Combined with a centralized logging solution like ELK stack or Grafana Loki, you can trace issues from application logs all the way to infrastructure metrics.
Security: The Often-Forgotten Pillar
Security in Kubernetes isn’t just about network policies (though those are important). It’s about the entire security posture of your cluster.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: my-app-network-policy
spec:
podSelector:
matchLabels:
app: my-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: my-app
ports:
- protocol: TCP
port: 8080
egress:
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
This network policy ensures that my-app can only communicate with itself and DNS servers. No other traffic is allowed in or out. Combine this with proper RBAC (Role-Based Access Control), secrets management, and image scanning, and you have a much more secure cluster.
The Deployment Pipeline: From Code to Production
One of the biggest shifts in my thinking was around deployments. In development, I’d just run kubectl apply -f deployment.yaml. In production, that’s a recipe for disaster.
Here’s what a production deployment pipeline looks like:
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up kubectl
uses: azure/setup-kubectl@v1
- name: Deploy to staging
run: |
kubectl apply -f k8s/staging/
kubectl rollout status deployment/my-app -n staging
- name: Run tests
run: |
curl -f http://staging.my-app/health
# Run integration tests
- name: Deploy to production
run: |
kubectl apply -f k8s/production/
kubectl rollout status deployment/my-app -n production
This pipeline ensures that every change goes through staging first, gets tested, and only then moves to production. It also includes rollout status checks to ensure deployments complete successfully before proceeding.
Common Pitfalls and How to Avoid Them
Through my journey, I’ve learned that there are some common mistakes that almost everyone makes when moving to production Kubernetes:
Pitfall 1: Not setting resource limits Without resource limits, one misbehaving pod can consume all available resources, causing a cascade of failures across your cluster.
Pitfall 2: Ignoring pod disruption budgets During node maintenance or cluster upgrades, Kubernetes might evict your pods. Without a pod disruption budget, you could end up with zero available instances.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Pitfall 3: Treating Kubernetes like a traditional VM Kubernetes is a platform for running distributed systems, not just individual applications. You need to design for failure, scale horizontally, and embrace the cloud-native mindset.
The Human Side of Production Kubernetes
Here’s something I’ve learned that doesn’t get talked about enough: production Kubernetes is as much about people as it is about technology. The 3 AM wake-up call taught me that I needed better monitoring, but it also taught me that I needed better processes.
- On-call rotations: No one person should be responsible for 24/7 availability. Build a team rotation.
- Runbooks: Document common issues and their solutions. When you’re stressed at 3 AM, you won’t remember what you learned yesterday.
- Post-mortems: When things go wrong (and they will), analyze what happened without blame. Focus on improving the system.
Where Do We Go From Here?
Looking back at that 3 AM moment, I realize it was the best thing that could have happened to me. It forced me to confront the reality that production Kubernetes is hard, and that’s okay. The goal isn’t to eliminate all failures—it’s to build systems that can handle them gracefully.
So, what’s your experience with production Kubernetes? Have you had your own 3 AM wake-up call? What lessons have you learned about running Kubernetes in the real world? I’d love to hear your stories in the comments.
And if you’re just starting your production Kubernetes journey, remember: start small, think big, and always be learning. The path from development to production is challenging, but it’s also incredibly rewarding. You’re not just learning a technology—you’re learning how to build resilient, scalable systems that can handle whatever the real world throws at them.
What’s the biggest challenge you’re facing with production Kubernetes right now? Let’s learn from each other’s experiences.