# Kubernetes Best Practices

## 🎯 **Overview**

This guide covers industry best practices for deploying, managing, and operating Kubernetes clusters in production environments.

***

## 🏗️ **Cluster Architecture Best Practices**

### Cluster Design Principles

#### ✅ **Recommended Cluster Sizing**

```yaml
# Production cluster sizing guidelines
clusters:
  small:
    nodes: 3-5
    cpu_per_node: 2-4 vCPU
    memory_per_node: 8-16 GiB
    use_case: Development, testing, small apps

  medium:
    nodes: 5-10
    cpu_per_node: 4-8 vCPU
    memory_per_node: 16-32 GiB
    use_case: Production workloads, microservices

  large:
    nodes: 10-50+
    cpu_per_node: 8-16 vCPU
    memory_per_node: 32-64 GiB
    use_case: Enterprise applications, heavy workloads
```

#### ✅ **Multi-Zone and Multi-Region Setup**

```yaml
# Multi-zone cluster configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-topology
  namespace: kube-system
data:
  topology.yaml: |
    cluster:
      regions:
        primary: us-west-2
        secondary: us-east-1
      zones:
        primary:
          - us-west-2a
          - us-west-2b
          - us-west-2c
        secondary:
          - us-east-1a
          - us-east-1b
          - us-east-1c

    availability_zones:
      - zone_type: "primary"
        min_nodes: 2
        max_nodes: 5
      - zone_type: "secondary"
        min_nodes: 1
        max_nodes: 3
```

#### ✅ **High Availability Components**

```yaml
# Critical component HA configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-api
  namespace: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: critical-api
  template:
    metadata:
      labels:
        app: critical-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - critical-api
              topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "topology.kubernetes.io/zone"
        whenUnsatisfiable: DoNotSchedule
      containers:
      - name: api
        image: critical-api:v1.0.0
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
      terminationGracePeriodSeconds: 30
```

***

## 🔐 **Security Best Practices**

### RBAC Implementation

#### ✅ **Principle of Least Privilege**

```yaml
# Service accounts with minimal permissions
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-service-account
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app-role
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app-rolebinding
  namespace: production
subjects:
- kind: ServiceAccount
  name: app-service-account
  namespace: production
roleRef:
  kind: Role
  name: app-role
  apiGroup: rbac.authorization.k8s.io
```

#### ✅ **Network Security**

```yaml
# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-traffic
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: myapp
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: production
    - podSelector:
        matchLabels:
          app: frontend
  - ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: database
    ports:
    - protocol: TCP
      port: 5432
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53
```

#### ✅ **Pod Security Standards**

```yaml
# Pod security context
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
  namespace: production
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:v1.0.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true
      runAsUser: 1000
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: config
      mountPath: /app/config
      readOnly: true
  volumes:
  - name: tmp
    emptyDir: {}
  - name: config
    configMap:
      name: app-config
```

***

## 📊 **Resource Management Best Practices**

### Resource Requests and Limits

#### ✅ **Proper Resource Configuration**

```yaml
# Resource allocation best practices
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-deployment
  namespace: production
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
      - name: webapp
        image: nginx:1.21
        resources:
          requests:
            cpu: 100m          # Start with small amount
            memory: 128Mi       # Start with small amount
          limits:
            cpu: 500m          # Allow burst capacity
            memory: 512Mi       # Prevent OOM kills
        env:
        - name: JAVA_OPTS
          value: "-Xmx256m -Xms128m"  # Match memory limits
        - name: GOMAXPROCS
          value: "1"                   # Match CPU limits
```

#### ✅ **Resource Quotas and Limits**

```yaml
# Namespace resource quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    persistentvolumeclaims: "10"
    services: "10"
    services.loadbalancers: "2"
    count/pods: "50"
    count/secrets: "20"
    count/configmaps: "20"
---
# Pod limits
apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container
  - max:
      cpu: "2"
      memory: "4Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
    type: Container
```

***

## 🔧 **Application Deployment Best Practices**

### Container Image Management

#### ✅ **Optimized Docker Images**

```dockerfile
# Multi-stage build for optimized images
FROM golang:1.21-alpine AS builder

# Build application
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main .

# Final stage
FROM scratch
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=builder /app/main /main

# Non-root user
COPY --from=builder /etc/passwd /etc/passwd
USER 1000:1000

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD ["/main", "health"]

EXPOSE 8080
ENTRYPOINT ["/main"]
```

#### ✅ **Configuration Management**

```yaml
# Configuration best practices
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
data:
  config.yaml: |
    server:
      port: 8080
      host: "0.0.0.0"
    database:
      host: "postgres.production.svc.cluster.local"
      port: 5432
      max_connections: 20
    logging:
      level: "info"
      format: "json"
---
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
  namespace: production
type: Opaque
data:
  database-password: cGFzc3dvcmQxMjM0
  api-key: YXBpa2V5c2VjcmV0MTIzNDU2Nzg5MA==
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-deployment
  namespace: production
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
      - name: app
        image: myapp:v1.0.0
        env:
        - name: CONFIG_FILE
          value: "/etc/config/config.yaml"
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: database-password
        - name: API_KEY
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: api-key
        volumeMounts:
        - name: config
          mountPath: /etc/config
          readOnly: true
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 3
      volumes:
      - name: config
        configMap:
          name: app-config
```

### Health Checks and Probes

#### ✅ **Comprehensive Probe Configuration**

```yaml
# Advanced probe configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: healthy-app
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1.0.0
        ports:
        - containerPort: 8080
          name: http
        livenessProbe:
          httpGet:
            path: /health/live
            port: http
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /health/startup
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 30
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "curl -X POST http://localhost:8080/shutdown"]
          preStop:
            httpGet:
              path: /graceful-shutdown
              port: http
          preStop:
            tcpSocket:
              port: 8080
        terminationGracePeriodSeconds: 60
```

***

## 📈 **Monitoring and Observability Best Practices**

### Monitoring Setup

#### ✅ **Comprehensive Monitoring Stack**

```yaml
# Monitoring namespace
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    name: monitoring
    workload: monitoring
---
# Prometheus configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    rule_files:
      - "/etc/prometheus/rules/*.yml"

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093

    scrape_configs:
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
```

#### ✅ **Alerting Rules**

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: production-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes-apps
    rules:
    - alert: PodCrashing
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crashing"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been restarting {{ $value }} times in the last 15 minutes."

    - alert: HighCPUUsage
      expr: sum by (pod, namespace) (rate(container_cpu_usage_seconds_total{container!=""}[5m])) > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on pod {{ $labels.namespace }}/{{ $labels.pod }}"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU usage is above 80% for more than 5 minutes."

    - alert: HighMemoryUsage
      expr: sum by (pod, namespace) (container_memory_working_set_bytes{container!=""}) / sum by (pod, namespace) (kube_pod_container_resource_limits{resource="memory"}) > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High memory usage on pod {{ $labels.namespace }}/{{ $labels.pod }}"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} memory usage is above 90% for more than 5 minutes."

    - alert: NodeDown
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} is down"
        description: "Node {{ $labels.node }} has been down for more than 5 minutes."
```

***

## 🔧 **Troubleshooting Best Practices**

### Log Management

#### ✅ **Structured Logging**

```yaml
# Logging configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: logging-config
  namespace: production
data:
  logback.xml: |
    <configuration>
        <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
            <encoder>
                <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
            </encoder>
        </appender>

        <appender name="JSON" class="ch.qos.logback.contrib.json.classic.JsonAppender">
            <encoder class="ch.qos.logback.contrib.jackson.JacksonJsonEncoder">
                <providers>
                    <timestampFieldName>timestamp</timestampFieldName>
                    <logLevelFieldName>level</logLevelFieldName>
                    <loggerFieldName>logger</loggerFieldName>
                    <messageFieldName>message</messageFieldName>
                    <stackTraceFieldName>stack_trace</stackTraceFieldName>
                </providers>
            </encoder>
        </appender>

        <root level="INFO">
            <appender-ref ref="STDOUT"/>
            <appender-ref ref="JSON"/>
        </root>
    </configuration>
```

### Debugging Tools

#### ✅ **Debug Container Template**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: debug-container
  namespace: production
spec:
  containers:
  - name: debug-tools
    image: nicolaka/netshoot
    command:
    - /bin/bash
    - -c
    - |
      tail -f /dev/null
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi
    volumeMounts:
    - name: host-root
      mountPath: /host
    securityContext:
      privileged: true
  volumes:
  - name: host-root
    hostPath:
      path: /
    - name: kube-config
      hostPath:
        path: /etc/kubernetes
        type: DirectoryOrCreate
  restartPolicy: Never
  hostNetwork: true
  hostPID: true
  hostIPC: true
```

***

## 📋 **Checklist for Production Deployment**

### Pre-Deployment Checklist

#### ✅ **Cluster Preparation**

* [ ] Multi-zone cluster setup
* [ ] Resource quotas configured
* [ ] ] Network policies implemented
* [ ] ] RBAC configured with least privilege
* [ ] ] Monitoring stack deployed
* [ ] ] Backup configured
* [ ] ] Security scans completed
* [ ] ] Load testing performed
* [ ] ] Documentation updated

#### ✅ **Application Preparation**

* [ ] Resource requests and limits defined
* [ ] Health checks implemented
* [ ] Configuration externalized
* [ ] Secrets properly managed
* [ ] Image scanning completed
* [ ] Liveness and readiness probes configured
*

&#x20;  \* Graceful shutdown implemented
\*
\[ ]   \* Logging configured
\*
\[ ]   \* Error handling implemented

### Operational Best Practices

#### ✅ **Daily Operations**

```bash
# Daily health checks
kubectl get nodes --no-headers | grep -v Ready
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
kubectl top nodes
kubectl top pods --all-namespaces

# Check cluster events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -10

# Backup verification
kubectl get backups.velero.io
kubectl get restores.velero.io

# Resource utilization
kubectl describe nodes | grep -A 5 "Allocated resources"
```

#### ✅ **Weekly Maintenance**

* [ ] Update security patches
* [ ] Review resource utilization
* [ ] Check backup status
* [ ] Review alerting rules
* [ ] Update documentation
* [ ] Performance testing
* [ ] Security audit

***

## 🚨 **Common Pitfalls to Avoid**

### ❌ **Common Mistakes**

1. **No Resource Requests/Limits**

   ```yaml
   # ❌ Bad: No resource limits
   resources: {}

   # ✅ Good: Proper resource management
   resources:
     requests:
       cpu: 100m
       memory: 128Mi
     limits:
       cpu: 500m
       memory: 512Mi
   ```
2. **Running as Root**

   ```yaml
   # ❌ Bad: Running as root
   securityContext:
     runAsUser: 0

   # ✅ Good: Non-root user
   securityContext:
     runAsNonRoot: true
     runAsUser: 1000
   ```
3. **Hardcoded Configuration**

   ```yaml
   # ❌ Bad: Hardcoded values
   env:
   - name: DATABASE_URL
     value: "postgres://user:pass@localhost:5432/db"

   # ✅ Good: Use ConfigMaps and Secrets
   env:
   - name: DATABASE_HOST
     valueFrom:
       configMapKeyRef:
         name: db-config
         key: host
   ```
4. **Missing Health Checks**

   ```yaml
   # ❌ Bad: No health checks
   containers:
   - name: app
     image: myapp:v1.0.0

   # ✅ Good: Comprehensive health checks
   containers:
   - name: app
     image: myapp:v1.0.0
     livenessProbe:
       httpGet:
         path: /health
         port: 8080
     readinessProbe:
       httpGet:
         path: /ready
         port: 8080
   ```
5. **Ignoring Security**

   ```yaml
   # ❌ Bad: No security context
   spec: {}

   # ✅ Good: Comprehensive security
   spec:
     securityContext:
       runAsNonRoot: true
       readOnlyRootFilesystem: true
       capabilities:
         drop:
         - ALL
     securityContext:
       allowPrivilegeEscalation: false
       readOnlyRootFilesystem: true
   ```

***

## 📚 **Additional Resources**

### Official Documentation

* [Kubernetes Best Practices](https://kubernetes.io/docs/concepts/cluster-administration/cluster-management/)
* [Security Best Practices](https://kubernetes.io/docs/concepts/security/overview/)
* [Production Best Practices](https://kubernetes.io/docs/setup/production-environment/)

### Tools and Guides

* [kube-score](https://github.com/zegl/kube-score) - Kubernetes static analysis
* [Popeye](https://github.com/derailed/popeye) - Kubernetes resource analysis
* [Fairwinds Polaris](https://polaris.fairwinds.com/) - Security validation

### Learning Resources

* [Kubernetes Best Practices Guide](https://learnk8s.io/kubernetes-best-practices/)
* [Kubernetes Security Checklist](https://github.com/aquasecurity/kube-bench)

***

\*📌 \**Follow these best practices to ensure secure, scalable, and maintainable Kubernetes deployments*