# Troubleshooting Advanced

## Troubleshooting Methodology

### Systematic Debugging Approach

```
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Problem ID    │───▶│  Data Collection │───▶│  Hypothesis     │
│                 │    │                  │    │                 │
│ • Symptoms      │    │ • Logs           │    │ • Root Cause    │
│ • Impact        │    │ • Metrics        │    │ • Test Cases    │
│ • Scope         │    │ • Events         │    │ • Validation    │
└─────────────────┘    └──────────────────┘    └─────────────────┘
```

### Troubleshooting Framework

#### 1. Problem Identification

```bash
# Check overall cluster health
kubectl get componentstatuses
kubectl get nodes --show-labels
kubectl get namespaces

# Identify problematic resources
kubectl get pods --all-namespaces -o wide
kubectl get services --all-namespaces
kubectl get deployments --all-namespaces

# Check resource utilization
kubectl top nodes
kubectl top pods --all-namespaces
```

#### 2. Data Collection

```bash
# Collect events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Get detailed resource information
kubectl describe pod <pod-name> -n <namespace>
kubectl describe service <service-name> -n <namespace>
kubectl describe deployment <deployment-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -n <namespace> -c <container-name>

# Network diagnostics
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>
kubectl exec -it <pod-name> -n <namespace> -- curl http://<service-name>
```

## Pod Troubleshooting

### Common Pod Issues

#### Pod Pending Issues

```bash
# Check pod status and events
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

# Common pending causes and solutions

# 1. Resource constraints
kubectl describe pod <pod-name> | grep -A 10 "Events"
kubectl get nodes --show-labels
kubectl top nodes

# Check available resources
kubectl describe nodes | grep -A 10 "Allocated resources"
kubectl describe nodes | grep -A 5 "Capacity"

# Solution: Adjust resource requests or add nodes
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"requests":{"cpu":"100m","memory":"128Mi"}}}]}}}}'

# 2. Taints and Tolerations
kubectl describe nodes | grep -A 5 "Taints"
kubectl get pods -o wide | grep Pending

# Check if pod has tolerations for node taints
kubectl get pod <pod-name> -o yaml | grep -A 10 "tolerations"

# Solution: Add tolerations or remove taints
kubectl taint nodes <node-name> <taint-key>-
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"tolerations":[{"key":"<taint-key>","operator":"Exists","effect":"NoSchedule"}]}}}}'

# 3. Image Pull Issues
kubectl describe pod <pod-name> | grep -A 5 "Events"
kubectl get pods <pod-name> -o yaml | grep -A 10 "imagePullSecrets"

# Test image pull manually
docker pull <image-name>
kubectl run test-pod --image=<image-name> --restart=Never --rm -it

# Solution: Fix image name, credentials, or network
kubectl create secret docker-registry <secret-name> --docker-server=<registry> --docker-username=<username> --docker-password=<password>
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"<secret-name>"}]}}}}'

# 4. Persistent Volume Claims
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
kubectl get pv

# Check storage class availability
kubectl get storageclass
kubectl describe storageclass <storage-class-name>

# Solution: Create required PV or fix storage class
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: <storage-class-name>
  hostPath:
    path: /tmp/data
EOF
```

#### Pod CrashLoopBackOff

```bash
# Get pod details and logs
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

# Debug with debug container
kubectl debug <pod-name> -n <namespace> --image=busybox --copy-to=<debug-pod-name>

# Check if it's configuration issue
kubectl get pod <pod-name> -o yaml | grep -A 20 "containers"

# Common causes and solutions:

# 1. Configuration errors
kubectl exec -it <pod-name> -n <namespace> -- env
kubectl exec -it <pod-name> -n <namespace> -- cat /etc/config/file

# Solution: Fix config maps, secrets, or environment variables
kubectl edit configmap <configmap-name> -n <namespace>
kubectl edit secret <secret-name> -n <namespace>

# 2. Database connection issues
kubectl exec -it <pod-name> -n <namespace> -- nc -zv <database-host> <port>
kubectl exec -it <pod-name> -n <namespace> -- telnet <database-host> <port>

# Solution: Check database connectivity, credentials, network policies
kubectl get networkpolicy -n <namespace>
kubectl get service <database-service> -n <namespace>

# 3. Port conflicts
kubectl get pod <pod-name> -o yaml | grep -A 5 "ports"
kubectl get endpoints <service-name> -n <namespace>

# Solution: Change container port or fix service configuration
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","ports":[{"containerPort":<new-port>}]}]}}}}'

# 4. Resource limits too low
kubectl describe pod <pod-name> | grep -A 10 "Limits"
kubectl top pod <pod-name> -n <namespace>

# Solution: Increase resource limits
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"cpu":"500m","memory":"512Mi"}}}]}}}}'
```

#### Pod Not Ready

```bash
# Check readiness probes
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Readiness"
kubectl get pod <pod-name> -o yaml | grep -A 15 "readinessProbe"

# Test readiness manually
kubectl exec -it <pod-name> -n <namespace> -- curl -f http://localhost:<readiness-port>/<readiness-path>
kubectl exec -it <pod-name> -n <namespace> -- wget -qO- http://localhost:<readiness-port>/<readiness-path>

# Common issues and solutions:

# 1. Readiness probe failing
# Check application logs for startup issues
kubectl logs <pod-name> -n <namespace>

# Solution: Fix readiness probe configuration or application startup
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","readinessProbe":{"httpGet":{"path":"/health","port":8080},"initialDelaySeconds":30,"periodSeconds":10}}]}}}}'

# 2. Service endpoint issues
kubectl get endpoints <service-name> -n <namespace>
kubectl get service <service-name> -n <namespace> -o yaml

# Solution: Fix service selector or pod labels
kubectl get service <service-name> -n <namespace> -o yaml | grep -A 5 "selector"
kubectl get pods -l <selector-key>=<selector-value> -n <namespace>

# 3. Network policy blocking
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <policy-name> -n <namespace>

# Solution: Update network policy to allow traffic
kubectl patch networkpolicy <policy-name> -n <namespace> -p '{"spec":{"podSelector":{},"policyTypes":["Ingress"],"ingress":[{"from":[{"podSelector":{"matchLabels":{"app":"<source-app>"}}]},"ports":[{"protocol":"TCP","port":<target-port>}]}]}}'
```

## Network Troubleshooting

### Service Connectivity Issues

#### Service Not Accessible

```bash
# Check service status
kubectl get service <service-name> -n <namespace> -o wide
kubectl describe service <service-name> -n <namespace>

# Check endpoints
kubectl get endpoints <service-name> -n <namespace>
kubectl describe endpoints <service-name> -n <namespace>

# Test service connectivity
kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup <service-name>.<namespace>.svc.cluster.local
kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- http://<service-name>.<namespace>.svc.cluster.local

# Common issues and solutions:

# 1. No endpoints
kubectl get pods -l <selector-key>=<selector-value> -n <namespace>
kubectl describe pods -l <selector-key>=<selector-value> -n <namespace>

# Solution: Fix pod labels or service selector
kubectl label pod <pod-name> <selector-key>=<selector-value> -n <namespace>

# 2. Port configuration mismatch
kubectl get service <service-name> -n <namespace> -o yaml | grep -A 10 "ports"
kubectl get pods -l <selector-key>=<selector-value> -n <namespace> -o yaml | grep -A 10 "containerPort"

# Solution: Fix service port or container port
kubectl patch service <service-name> -p '{"spec":{"ports":[{"port":<service-port>,"targetPort":<container-port>,"protocol":"TCP"}]}}'

# 3. External IP not assigned
kubectl get service <service-name> -n <namespace> -o wide
kubectl describe service <service-name> -n <namespace>

# For LoadBalancer services
kubectl get service <service-name> -n <namespace> --watch

# For NodePort services
kubectl get nodes -o wide
curl http://<node-ip>:<node-port>

# Solution: Check cloud provider integration or use appropriate service type
```

#### Ingress Issues

```bash
# Check ingress status
kubectl get ingress <ingress-name> -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>

# Check ingress controller
kubectl get pods -n <ingress-namespace>
kubectl logs -n <ingress-namespace> -l <ingress-controller-label>

# Test ingress from external
curl -H "Host: <hostname>" http://<ingress-ip>/<path>
curl -k https://<hostname>/<path>

# Common issues and solutions:

# 1. Ingress controller not running
kubectl get pods -n <ingress-namespace> | grep ingress
kubectl get svc -n <ingress-namespace>

# Solution: Deploy or fix ingress controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.1/deploy/static/provider/cloud/deploy.yaml

# 2. TLS certificate issues
kubectl describe ingress <ingress-name> -n <namespace> | grep -A 10 "TLS"
kubectl get secret <tls-secret-name> -n <namespace>

# Check certificate
kubectl get secret <tls-secret-name> -n <namespace> -o yaml | grep -A 10 "tls.crt"
openssl x509 -in <cert-file> -text -noout

# Solution: Fix certificate or secret configuration
kubectl create secret tls <secret-name> --key=<key-file> --cert=<cert-file> -n <namespace>

# 3. Backend service not found
kubectl get ingress <ingress-name> -n <namespace> -o yaml | grep -A 10 "backend"
kubectl get service <backend-service> -n <namespace>

# Solution: Fix backend service name or namespace
kubectl patch ingress <ingress-name> -p '{"spec":{"rules":[{"host":"<hostname>","http":{"paths":[{"path":"<path>","pathType":"Prefix","backend":{"service":{"name":"<service-name>","port":{"number":<service-port>}}}}]}]}}}}'

# 4. Path routing issues
curl -H "Host: <hostname>" http://<ingress-ip>/<path> -v
kubectl logs -n <ingress-namespace> -l <ingress-controller-label> | grep <hostname>

# Solution: Fix path configuration or rewrite rules
kubectl patch ingress <ingress-name> -p '{"metadata":{"annotations":{"nginx.ingress.kubernetes.io/rewrite-target":"/<target-path>"}}}'
```

### Network Policy Troubleshooting

#### Policy Blocking Traffic

```bash
# Check network policies
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <policy-name> -n <namespace>

# Test connectivity between pods
kubectl run source-pod --image=busybox --rm -it --restart=Never -- /bin/sh
# Inside pod:
# nslookup <target-service>.<namespace>.svc.cluster.local
# wget -qO- http://<target-service>.<namespace>.svc.cluster.local

# Check pod labels and selectors
kubectl get pod <pod-name> -n <namespace> --show-labels
kubectl get networkpolicy <policy-name> -n <namespace> -o yaml | grep -A 10 "podSelector"

# Common issues and solutions:

# 1. Policy too restrictive
kubectl describe networkpolicy <policy-name> -n <namespace> | grep -A 10 "policyTypes"

# Solution: Add allowed sources/destinations
kubectl patch networkpolicy <policy-name> -n <namespace> -p '{"spec":{"ingress":[{"from":[{"podSelector":{"matchLabels":{"app":"<allowed-app>"}}]},"ports":[{"protocol":"TCP","port":<port>}]}]}}'

# 2. Policy not matching due to labels
kubectl get pod -l <label-key>=<label-value> -n <namespace>
kubectl get networkpolicy -n <namespace> -o yaml | grep -A 5 "matchLabels"

# Solution: Fix pod labels or policy selectors
kubectl label pod <pod-name> <label-key>=<label-value> -n <namespace

# 3. Egress traffic blocked
kubectl exec -it <pod-name> -n <namespace> -- curl -I http://external-api.com
kubectl describe networkpolicy <policy-name> -n <namespace> | grep -A 10 "egress"

# Solution: Add egress rules for external access
kubectl patch networkpolicy <policy-name> -n <namespace> -p '{"spec":{"egress":[{"to":[]}},{"ports":[{"protocol":"TCP","port":53},{"protocol":"UDP","port":53}]}]}}'
```

## Storage Troubleshooting

### Volume Mount Issues

#### Volume Mount Failed

```bash
# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Check PV status
kubectl get pv
kubectl describe pv <pv-name>

# Check pod volume mounts
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Mounts"
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 15 "volumeMounts"

# Common issues and solutions:

# 1. PVC stuck in Pending
kubectl describe pvc <pvc-name> -n <namespace> | grep -A 10 "Events"
kubectl get storageclass
kubectl describe storageclass <storage-class-name>

# Solution: Fix storage class or create matching PV
kubectl get storageclass <storage-class-name> -o yaml | grep provisioner
kubectl get pods -n <storage-namespace> | grep <provisioner>

# 2. Volume already attached
kubectl get pv | grep <volume-name>
kubectl get pods --all-namespaces -o wide | grep <node-name>

# Solution: Wait for pod to terminate or use different volume
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# 3. Permission issues
kubectl exec -it <pod-name> -n <namespace> -- ls -la /mount/path
kubectl exec -it <pod-name> -n <namespace> -- touch /mount/path/test

# Solution: Fix security context or volume permissions
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"securityContext":{"fsGroup":2000}}}}}'
```

#### Database Storage Issues

```bash
# Check database pod storage
kubectl exec -it <database-pod> -n <namespace> -- df -h
kubectl exec -it <database-pod> -n <namespace> -- ls -la /data

# Check storage usage
kubectl exec -it <database-pod> -n <namespace> -- du -sh /data/*
kubectl top pod <database-pod> -n <namespace>

# Monitor disk I/O
kubectl exec -it <database-pod> -n <namespace> -- iostat -x 1

# Common issues and solutions:

# 1. Disk space full
kubectl exec -it <database-pod> -n <namespace> -- df -h
kubectl describe pv <pv-name> | grep -A 5 "Capacity"

# Solution: Clean up old data or expand volume
kubectl exec -it <database-pod> -n <namespace> -- find /data -name "*.log" -mtime +7 -delete

# 2. Volume mounting with wrong permissions
kubectl exec -it <database-pod> -n <namespace> -- ls -la /data
kubectl get pvc <pvc-name> -n <namespace> -o yaml | grep -A 10 "accessModes"

# Solution: Fix permissions or use different access mode
kubectl patch pvc <pvc-name> -n <namespace> -p '{"spec":{"accessModes":["ReadWriteMany"]}}'

# 3. Database corruption due to storage issues
kubectl exec -it <database-pod> -n <namespace> -- cat /var/log/postgresql/postgresql.log
kubectl logs <database-pod> -n <namespace> | grep -i error

# Solution: Restore from backup or repair database
kubectl exec -it <database-pod> -n <namespace> -- pg_resetwal /data
```

## Performance Troubleshooting

### Resource Bottlenecks

#### High CPU Usage

```bash
# Identify high CPU pods
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes --sort-by=cpu

# Get detailed metrics
kubectl exec -it <high-cpu-pod> -n <namespace> -- top
kubectl exec -it <high-cpu-pod> -n <namespace> -- ps aux

# Check CPU throttling
kubectl describe pod <high-cpu-pod> -n <namespace> | grep -A 10 "Limits"
kubectl exec -it <high-cpu-pod> -n <namespace> -- cat /sys/fs/cgroup/cpu/cpu.stat

# Common causes and solutions:

# 1. CPU limits too low
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits"
kubectl top pod <pod-name> -n <namespace>

# Solution: Increase CPU limits
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"cpu":"1000m"}}}]}}}}'

# 2. Application performance issues
kubectl exec -it <pod-name> -n <namespace> -- pprof <application-url>/debug/pprof/profile
kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|error"

# Solution: Profile application and optimize code
kubectl exec -it <pod-name> -n <namespace> -- curl http://localhost:8080/debug/pprof/profile?seconds=30 -o cpu.prof

# 3. Missing resource requests
kubectl get pods -n <namespace> -o yaml | grep -A 10 "resources"
kubectl describe nodes | grep -A 10 "Allocated resources"

# Solution: Set appropriate resource requests
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"requests":{"cpu":"200m"}}}]}}}}'
```

#### High Memory Usage

```bash
# Identify high memory pods
kubectl top pods --all-namespaces --sort-by=memory
kubectl top nodes --sort-by=memory

# Check memory usage details
kubectl exec -it <high-memory-pod> -n <namespace> -- free -h
kubectl exec -it <high-memory-pod> -n <namespace> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes

# Check for memory leaks
kubectl logs <pod-name> -n <namespace> | grep -i "out of memory\|oom-killer"
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "State"

# Common causes and solutions:

# 1. Memory leaks
kubectl exec -it <pod-name> -n <namespace> -- cat /proc/<pid>/status | grep -i vm
kubectl logs <pod-name> -n <namespace> | tail -100

# Solution: Restart pod or fix memory leak
kubectl delete pod <pod-name> -n <namespace>

# 2. Memory limits too low
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits"
kubectl top pod <pod-name> -n <namespace>

# Solution: Increase memory limits
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

# 3. Garbage collection issues
kubectl exec -it <pod-name> -n <namespace> -- jstat -gc <java-pid>
kubectl logs <pod-name> -n <namespace> | grep -i "gc\|heap"

# Solution: Tune garbage collection or increase heap size
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","env":[{"name":"JAVA_OPTS","value":"-Xmx1g -XX:+UseG1GC"}]}]}}}}'
```

## Cluster-Level Issues

### API Server Problems

#### API Server Unreachable

```bash
# Check API server connectivity
kubectl get componentstatuses
kubectl cluster-info
kubectl get nodes

# Check API server logs
kubectl logs -n kube-system -l component=kube-apiserver
kubectl logs -n kube-system -l component=kube-controller-manager
kubectl logs -n kube-system -l component=kube-scheduler

# Check etcd health
kubectl get pods -n kube-system | grep etcd
kubectl logs -n kube-system -l component=etcd

# Common issues and solutions:

# 1. API server overloaded
kubectl top pods -n kube-system
kubectl describe node <master-node> | grep -A 10 "Conditions"

# Solution: Scale API server or optimize load
kubectl get pods -n kube-system -l component=kube-apiserver
kubectl edit deployment kube-apiserver -n kube-system

# 2. etcd cluster issues
kubectl get pods -n kube-system | grep etcd
kubectl logs -n kube-system -l component=etcd | grep -i "leader\|health"

# Solution: Fix etcd cluster
kubectl exec -it <etcd-pod> -n kube-system -- etcdctl endpoint health

# 3. Network connectivity to master
kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- https://kubernetes.default.svc.cluster.local
kubectl get endpoints kubernetes -n default

# Solution: Fix network configuration
kubectl get svc kubernetes -n default -o yaml
```

### Node Issues

#### Node NotReady

```bash
# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check kubelet status
kubectl get pods -n kube-system | grep kubelet
kubectl logs -n kube-system -l component=kubelet

# Check system resources on node
ssh <node-user>@<node-ip> "free -h && df -h && docker ps"

# Common issues and solutions:

# 1. Kubelet not running
ssh <node-user>@<node-ip> "systemctl status kubelet"
ssh <node-user>@<node-ip> "sudo systemctl restart kubelet"

# 2. Disk pressure
ssh <node-user>@<node-ip> "df -h"
ssh <node-user>@<node-ip> "sudo docker system prune -f"

# 3. Network issues
ssh <node-user>@<node-ip> "ping kubernetes.default.svc.cluster.local"
ssh <node-user>@<node-ip> "sudo systemctl restart kube-proxy"

# 4. Resource exhaustion
ssh <node-user>@<node-ip> "top && iostat -x 1"
ssh <node-user>@<node-ip> "sudo systemctl restart docker"
```

## Advanced Debugging Tools

### kubectl Debug Commands

#### Debug Container

```bash
# Create debug container
kubectl debug <pod-name> -n <namespace> --image=nicolaka/netshoot --share-processes --copy-to=<debug-pod>

# Debug with root access
kubectl debug <pod-name> -n <namespace> --image=busybox --as-root

# Debug with specific node
kubectl debug node/<node-name> --image=nicolaka/netshoot

# Debug with custom command
kubectl debug <pod-name> -n <namespace> --image=ubuntu -- sh -c "apt-get update && apt-get install -y curl vim"
```

#### Port Forwarding

```bash
# Forward local port to pod
kubectl port-forward <pod-name> 8080:8080 -n <namespace>

# Forward to service
kubectl port-forward service/<service-name> 8080:80 -n <namespace>

# Forward database port for debugging
kubectl port-forward <database-pod> 5432:5432 -n <namespace>
psql -h localhost -p 5432 -U <username> -d <database>

# Forward API server
kubectl port-forward svc/kubernetes 443:443 -n default
```

### Advanced Network Debugging

#### Network Policy Testing

```bash
# Create test pod for network debugging
kubectl run network-test --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Test DNS resolution
nslookup kubernetes.default.svc.cluster.local
dig kubernetes.default.svc.cluster.local

# Test connectivity to services
curl -v http://<service-name>.<namespace>.svc.cluster.local
telnet <service-name>.<namespace>.svc.cluster.local <port>

# Test egress connectivity
curl -v https://google.com
ping 8.8.8.8

# Test pod-to-pod connectivity
kubectl exec -it <pod-1> -- ping <pod-2-ip>
kubectl exec -it <pod-1> -- wget -qO- http://<pod-2-ip>:<port>
```

## Automation and Scripts

### Troubleshooting Scripts

#### Health Check Script

```bash
#!/bin/bash
# k8s-health-check.sh

NAMESPACE=${1:-"all"}
SEVERITY=${2:-"warning"}

echo "🔍 Kubernetes Cluster Health Check"
echo "=================================="

# Check API server
echo "📡 API Server Status:"
kubectl get componentstatuses 2>/dev/null || echo "❌ API server not accessible"

# Check nodes
echo -e "\n🖥️  Node Status:"
kubectl get nodes -o wide

# Check pods in namespace
if [ "$NAMESPACE" == "all" ]; then
    echo -e "\n📦 Pod Status (All Namespaces):"
    kubectl get pods --all-namespaces | grep -E "(Error|CrashLoopBackOff|Pending|ImagePullBackOff|ErrImagePull)"
else
    echo -e "\n📦 Pod Status in $NAMESPACE:"
    kubectl get pods -n $NAMESPACE | grep -E "(Error|CrashLoopBackOff|Pending|ImagePullBackOff|ErrImagePull)"
fi

# Check resource usage
echo -e "\n📊 Resource Usage:"
kubectl top nodes
kubectl top pods --all-namespaces | head -10

# Check events
echo -e "\n⚠️  Recent Events:"
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -10

# Check storage
echo -e "\n💾 Storage Status:"
kubectl get pvc --all-namespaces | grep -E "(Pending|Failed)"
kubectl get pv | grep -E "(Failed|Released)"

echo -e "\n✅ Health check completed!"
```

#### Pod Debug Script

```bash
#!/bin/bash
# debug-pod.sh

POD_NAME=$1
NAMESPACE=${2:-"default"}

if [ -z "$POD_NAME" ]; then
    echo "Usage: $0 <pod-name> [namespace]"
    exit 1
fi

echo "🔍 Debugging Pod: $POD_NAME in namespace: $NAMESPACE"
echo "=============================================="

# Get pod details
echo "📋 Pod Details:"
kubectl get pod $POD_NAME -n $NAMESPACE -o wide

echo -e "\n📝 Pod Description:"
kubectl describe pod $POD_NAME -n $NAMESPACE

# Get logs
echo -e "\n📜 Pod Logs:"
kubectl logs $POD_NAME -n $NAMESPACE --tail=50

# Get previous logs if exists
echo -e "\n📜 Previous Pod Logs:"
kubectl logs $POD_NAME -n $NAMESPACE --previous --tail=50 2>/dev/null || echo "No previous logs found"

# Check events
echo -e "\n⚠️  Pod Events:"
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD_NAME --sort-by='.lastTimestamp'

# Check resource usage
echo -e "\n📊 Resource Usage:"
kubectl top pod $POD_NAME -n $NAMESPACE 2>/dev/null || echo "Metrics not available"

# Create debug container
echo -e "\n🐛 Creating Debug Container:"
kubectl debug $POD_NAME -n $NAMESPACE --image=nicolaka/netshoot --share-processes --copy-to=$POD_NAME-debug
kubectl exec -it $POD_NAME-debug -n $NAMESPACE -- bash

echo -e "\n✅ Debug session completed!"
kubectl delete pod $POD_NAME-debug -n $NAMESPACE --force --grace-period=0 2>/dev/null
```

***

## 🚀 **Production Troubleshooting Setup**

### Troubleshooting Tools Deployment

```yaml
# Troubleshooting namespace
apiVersion: v1
kind: Namespace
metadata:
  name: troubleshooting
  labels:
    name: troubleshooting

---
# Debug tools DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: debug-tools
  namespace: troubleshooting
spec:
  selector:
    matchLabels:
      app: debug-tools
  template:
    metadata:
      labels:
        app: debug-tools
    spec:
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: debug-tools
        image: nicolaka/netshoot
        command: ["/bin/sh"]
        args: ["-c", "sleep 3600"]
        resources:
          limits:
            cpu: 100m
            memory: 128Mi
          requests:
            cpu: 50m
            memory: 64Mi
        volumeMounts:
        - name: root
          mountPath: /host
      volumes:
      - name: root
        hostPath:
          path: /
      restartPolicy: Always

---
# Troubleshooting service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: troubleshooting-sa
  namespace: troubleshooting

---
# Troubleshooting cluster role
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: troubleshooting-role
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "pods/exec", "pods/portforward"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["nodes", "services", "events"]
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: troubleshooting-binding
subjects:
- kind: ServiceAccount
  name: troubleshooting-sa
  namespace: troubleshooting
roleRef:
  kind: ClusterRole
  name: troubleshooting-role
  apiGroup: rbac.authorization.k8s.io
```

***

## 📚 **Resources dan Referensi**

### Dokumentasi Official

* [Kubernetes Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)
* [kubectl Debug](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#debug)
* [Network Policies Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-network-policies/)

### Troubleshooting Tools

* [kubectl Debug Cheat Sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#debugging)
* [Network Policy Analyzer](https://github.com/mattfenwick/network-policy-analyzer)
* [kube-bench](https://github.com/aquasecurity/kube-bench)

### Cheatsheet Summary

```bash
# Common Troubleshooting Commands
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl top pods --all-namespaces --sort-by=cpu
kubectl debug <pod-name> -n <namespace> --image=busybox

# Network Debugging
kubectl run test-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash
kubectl port-forward <pod-name> 8080:8080 -n <namespace>

# Storage Debugging
kubectl get pvc --all-namespaces
kubectl describe pvc <pvc-name> -n <namespace>
kubectl exec -it <pod-name> -n <namespace> -- df -h
```

Advanced troubleshooting documentation siap digunakan! 🔧
