Running Kubernetes in Production Without the Pain

Kubernetes is the industry standard for container orchestration, but running workloads effectively requires more than just deploying pods. Here are the best practices every developer and DevOps engineer should follow.

Pod Design

One Process Per Container

Each container should run a single process. Don't bundle your app, database, and cache into one container.

# Bad - multiple processes in one pod
apiVersion: v1
kind: Pod
metadata:
  name: monolith
spec:
  containers:
    - name: everything
      image: my-app-with-db-and-cache

# Good - separate concerns
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
        - name: api
          image: my-api:1.2.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: database
spec:
  template:
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine

Use Labels and Annotations

Labels are for selection and grouping. Annotations are for metadata.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  labels:
    app: api-server
    env: production
    team: backend
    version: v1.2.0
  annotations:
    description: "Main API server"
    owner: "backend-team@company.com"
spec:
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
        env: production

Always Set Resource Requests and Limits

Without resource constraints, a single pod can starve the entire node.

containers:
  - name: api
    image: my-api:1.2.0
    resources:
      requests:
        cpu: "250m"       # Guaranteed minimum
        memory: "256Mi"
      limits:
        cpu: "500m"       # Maximum allowed
        memory: "512Mi"

Guidelines:

Set requests to what your app typically uses
Set limits to what your app uses under peak load
Memory limits should be close to requests (OOMKill is harsh)
CPU limits can be higher — CPU is compressible

Use Liveness and Readiness Probes

containers:
  - name: api
    image: my-api:1.2.0
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 3
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30
      periodSeconds: 10

Liveness — Is the container alive? If not, restart it
Readiness — Is the container ready to accept traffic? If not, remove from service
Startup — Is the container still starting up? Prevents liveness probe from killing slow-starting apps

Handle Graceful Shutdown

containers:
  - name: api
    image: my-api:1.2.0
    lifecycle:
      preStop:
        exec:
          command: ["sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 30

In your application:

process.on('SIGTERM', async () => {
  console.log('SIGTERM received, draining connections...');
  server.close(() => {
    console.log('Server closed');
    process.exit(0);
  });
});

Deployments

Use Deployments, Not Bare Pods

# Bad - bare pod, no self-healing
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
    - name: app
      image: my-app:1.0

# Good - managed by deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-app:1.0

Pin Image Tags, Never Use `latest`

# Bad - unpredictable
containers:
  - name: api
    image: my-api:latest

# Good - pinned version
containers:
  - name: api
    image: my-api:1.2.0

# Better - pinned with digest
containers:
  - name: api
    image: my-api@sha256:abc123...

Use Rolling Update Strategy

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max pods above desired count during update
      maxUnavailable: 0   # Zero downtime
  minReadySeconds: 10     # Wait before marking pod as ready

Set Pod Disruption Budgets

Protect your app during node maintenance and cluster upgrades:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2    # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: api-server

Configuration

Use ConfigMaps for Non-Sensitive Config

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  DATABASE_HOST: "postgres.default.svc.cluster.local"
  LOG_LEVEL: "info"
  MAX_CONNECTIONS: "100"
---
containers:
  - name: api
    envFrom:
      - configMapRef:
          name: app-config

Use Secrets for Sensitive Data

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
stringData:
  DATABASE_PASSWORD: "super-secret"
  API_KEY: "sk-abc123"
---
containers:
  - name: api
    envFrom:
      - secretRef:
          name: app-secrets

Important: Base Kubernetes Secrets are only base64-encoded, not encrypted. For real security:

Enable encryption at rest in etcd
Use external secret managers (AWS Secrets Manager, HashiCorp Vault)
Use tools like External Secrets Operator or Sealed Secrets

Use Immutable ConfigMaps and Secrets

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-v2
immutable: true
data:
  LOG_LEVEL: "debug"

Immutable configs improve cluster performance and prevent accidental changes.

Security

Don't Run as Root

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
    - name: api
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL

Use Network Policies

Restrict pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - port: 5432

Use RBAC

Apply least-privilege access:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
subjects:
  - kind: ServiceAccount
    name: my-app
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Scan Container Images

# Use Trivy to scan images
trivy image my-api:1.2.0

# In CI/CD pipeline
trivy image --exit-code 1 --severity HIGH,CRITICAL my-api:1.2.0

Networking

Use Services for Internal Communication

apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api-server
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP

Access via DNS: api-service.default.svc.cluster.local

Use Ingress for External Traffic

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80

Observability

Centralized Logging

# App should log to stdout/stderr
containers:
  - name: api
    image: my-api:1.2.0
    # Kubernetes collects stdout/stderr automatically

# Use a log aggregator like Loki, ELK, or Fluentd

Expose Metrics with Prometheus

apiVersion: v1
kind: Service
metadata:
  name: api-service
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"
spec:
  selector:
    app: api-server
  ports:
    - name: http
      port: 80
      targetPort: 8080
    - name: metrics
      port: 9090
      targetPort: 9090

Set Up Alerts

# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-alerts
spec:
  groups:
    - name: api
      rules:
        - alert: HighErrorRate
          expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High 5xx error rate on API"
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 10m
          labels:
            severity: warning

Storage

Use PersistentVolumeClaims

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 20Gi
---
containers:
  - name: postgres
    image: postgres:16-alpine
    volumeMounts:
      - name: data
        mountPath: /var/lib/postgresql/data
volumes:
  - name: data
    persistentVolumeClaim:
      claimName: postgres-data

Use StatefulSets for Stateful Workloads

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi

Namespace Organization

Separate by Environment or Team

# By environment
kubectl create namespace development
kubectl create namespace staging
kubectl create namespace production

# By team
kubectl create namespace team-backend
kubectl create namespace team-frontend

Set Resource Quotas per Namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"

Helm and GitOps

Use Helm for Packaging

# Create a chart
helm create my-app

# Install
helm install my-app ./my-app -f values-production.yaml

# Upgrade
helm upgrade my-app ./my-app -f values-production.yaml

Use GitOps with ArgoCD or Flux

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/k8s-manifests
    targetRevision: main
    path: apps/my-app
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Quick Reference

Practice	Why
Resource requests and limits	Prevent resource starvation
Liveness and readiness probes	Auto-recovery and safe traffic routing
Pin image tags	Reproducible deployments
Rolling update strategy	Zero-downtime deploys
Pod Disruption Budgets	Safe node maintenance
Run as non-root	Principle of least privilege
Network Policies	Limit blast radius of compromises
Secrets management	Protect sensitive data
Namespaces + quotas	Resource isolation
Centralized logging + metrics	Observability
Helm + GitOps	Repeatable, auditable deployments
Graceful shutdown	No dropped requests

Summary

Kubernetes best practices boil down to:

Design pods well — one process per container, resource limits, health probes
Deploy safely — rolling updates, PDBs, pinned image tags
Secure everything — non-root, RBAC, network policies, image scanning
Observe everything — centralized logs, Prometheus metrics, alerts
Automate delivery — Helm charts, GitOps with ArgoCD or Flux