# Troubleshooting Advanced ## Troubleshooting Methodology ### Systematic Debugging Approach ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Problem ID │───▶│ Data Collection │───▶│ Hypothesis │ │ │ │ │ │ │ │ • Symptoms │ │ • Logs │ │ • Root Cause │ │ • Impact │ │ • Metrics │ │ • Test Cases │ │ • Scope │ │ • Events │ │ • Validation │ └─────────────────┘ └──────────────────┘ └─────────────────┘ ``` ### Troubleshooting Framework #### 1. Problem Identification ```bash # Check overall cluster health kubectl get componentstatuses kubectl get nodes --show-labels kubectl get namespaces # Identify problematic resources kubectl get pods --all-namespaces -o wide kubectl get services --all-namespaces kubectl get deployments --all-namespaces # Check resource utilization kubectl top nodes kubectl top pods --all-namespaces ``` #### 2. Data Collection ```bash # Collect events kubectl get events --all-namespaces --sort-by='.lastTimestamp' # Get detailed resource information kubectl describe pod -n kubectl describe service -n kubectl describe deployment -n # Check logs kubectl logs -n --previous kubectl logs -n -c # Network diagnostics kubectl exec -it -n -- nslookup kubectl exec -it -n -- curl http:// ``` ## Pod Troubleshooting ### Common Pod Issues #### Pod Pending Issues ```bash # Check pod status and events kubectl get pod -n -o wide kubectl describe pod -n # Common pending causes and solutions # 1. Resource constraints kubectl describe pod | grep -A 10 "Events" kubectl get nodes --show-labels kubectl top nodes # Check available resources kubectl describe nodes | grep -A 10 "Allocated resources" kubectl describe nodes | grep -A 5 "Capacity" # Solution: Adjust resource requests or add nodes kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","resources":{"requests":{"cpu":"100m","memory":"128Mi"}}}]}}}}' # 2. Taints and Tolerations kubectl describe nodes | grep -A 5 "Taints" kubectl get pods -o wide | grep Pending # Check if pod has tolerations for node taints kubectl get pod -o yaml | grep -A 10 "tolerations" # Solution: Add tolerations or remove taints kubectl taint nodes

- kubectl patch deployment -p '{"spec":{"template":{"spec":{"tolerations":[{"key":"","operator":"Exists","effect":"NoSchedule"}]}}}}' # 3. Image Pull Issues kubectl describe pod | grep -A 5 "Events" kubectl get pods -o yaml | grep -A 10 "imagePullSecrets" # Test image pull manually docker pull kubectl run test-pod --image= --restart=Never --rm -it # Solution: Fix image name, credentials, or network kubectl create secret docker-registry --docker-server= --docker-username= --docker-password= kubectl patch deployment -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":""}]}}}}' # 4. Persistent Volume Claims kubectl get pvc -n kubectl describe pvc -n kubectl get pv # Check storage class availability kubectl get storageclass kubectl describe storageclass # Solution: Create required PV or fix storage class kubectl apply -f - < hostPath: path: /tmp/data EOF ``` #### Pod CrashLoopBackOff ```bash # Get pod details and logs kubectl describe pod -n kubectl logs -n --previous kubectl logs -n -c --previous # Debug with debug container kubectl debug -n --image=busybox --copy-to= # Check if it's configuration issue kubectl get pod -o yaml | grep -A 20 "containers" # Common causes and solutions: # 1. Configuration errors kubectl exec -it -n -- env kubectl exec -it -n -- cat /etc/config/file # Solution: Fix config maps, secrets, or environment variables kubectl edit configmap -n kubectl edit secret -n # 2. Database connection issues kubectl exec -it -n -- nc -zv kubectl exec -it -n -- telnet # Solution: Check database connectivity, credentials, network policies kubectl get networkpolicy -n kubectl get service -n # 3. Port conflicts kubectl get pod -o yaml | grep -A 5 "ports" kubectl get endpoints -n # Solution: Change container port or fix service configuration kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","ports":[{"containerPort":}]}]}}}}' # 4. Resource limits too low kubectl describe pod | grep -A 10 "Limits" kubectl top pod -n # Solution: Increase resource limits kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","resources":{"limits":{"cpu":"500m","memory":"512Mi"}}}]}}}}' ``` #### Pod Not Ready ```bash # Check readiness probes kubectl describe pod -n | grep -A 10 "Readiness" kubectl get pod -o yaml | grep -A 15 "readinessProbe" # Test readiness manually kubectl exec -it -n -- curl -f http://localhost:/ kubectl exec -it -n -- wget -qO- http://localhost:/ # Common issues and solutions: # 1. Readiness probe failing # Check application logs for startup issues kubectl logs -n # Solution: Fix readiness probe configuration or application startup kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","readinessProbe":{"httpGet":{"path":"/health","port":8080},"initialDelaySeconds":30,"periodSeconds":10}}]}}}}' # 2. Service endpoint issues kubectl get endpoints -n kubectl get service -n -o yaml # Solution: Fix service selector or pod labels kubectl get service -n -o yaml | grep -A 5 "selector" kubectl get pods -l = -n # 3. Network policy blocking kubectl get networkpolicy -n kubectl describe networkpolicy -n # Solution: Update network policy to allow traffic kubectl patch networkpolicy -n -p '{"spec":{"podSelector":{},"policyTypes":["Ingress"],"ingress":[{"from":[{"podSelector":{"matchLabels":{"app":""}}]},"ports":[{"protocol":"TCP","port":}]}]}}' ``` ## Network Troubleshooting ### Service Connectivity Issues #### Service Not Accessible ```bash # Check service status kubectl get service -n -o wide kubectl describe service -n # Check endpoints kubectl get endpoints -n kubectl describe endpoints -n # Test service connectivity kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup ..svc.cluster.local kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- http://..svc.cluster.local # Common issues and solutions: # 1. No endpoints kubectl get pods -l = -n kubectl describe pods -l = -n # Solution: Fix pod labels or service selector kubectl label pod

= -n # 2. Port configuration mismatch kubectl get service -n -o yaml | grep -A 10 "ports" kubectl get pods -l = -n -o yaml | grep -A 10 "containerPort" # Solution: Fix service port or container port kubectl patch service -p '{"spec":{"ports":[{"port":,"targetPort":,"protocol":"TCP"}]}}' # 3. External IP not assigned kubectl get service -n -o wide kubectl describe service -n # For LoadBalancer services kubectl get service -n --watch # For NodePort services kubectl get nodes -o wide curl http://: # Solution: Check cloud provider integration or use appropriate service type ``` #### Ingress Issues ```bash # Check ingress status kubectl get ingress -n kubectl describe ingress -n # Check ingress controller kubectl get pods -n kubectl logs -n -l # Test ingress from external curl -H "Host: " http:/// curl -k https:/// # Common issues and solutions: # 1. Ingress controller not running kubectl get pods -n | grep ingress kubectl get svc -n # Solution: Deploy or fix ingress controller kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.1/deploy/static/provider/cloud/deploy.yaml # 2. TLS certificate issues kubectl describe ingress -n | grep -A 10 "TLS" kubectl get secret -n # Check certificate kubectl get secret -n -o yaml | grep -A 10 "tls.crt" openssl x509 -in -text -noout # Solution: Fix certificate or secret configuration kubectl create secret tls --key= --cert= -n # 3. Backend service not found kubectl get ingress -n -o yaml | grep -A 10 "backend" kubectl get service -n # Solution: Fix backend service name or namespace kubectl patch ingress -p '{"spec":{"rules":[{"host":"","http":{"paths":[{"path":"","pathType":"Prefix","backend":{"service":{"name":"","port":{"number":}}}}]}]}}}}' # 4. Path routing issues curl -H "Host: " http:/// -v kubectl logs -n -l | grep # Solution: Fix path configuration or rewrite rules kubectl patch ingress -p '{"metadata":{"annotations":{"nginx.ingress.kubernetes.io/rewrite-target":"/"}}}' ``` ### Network Policy Troubleshooting #### Policy Blocking Traffic ```bash # Check network policies kubectl get networkpolicy -n kubectl describe networkpolicy -n # Test connectivity between pods kubectl run source-pod --image=busybox --rm -it --restart=Never -- /bin/sh # Inside pod: # nslookup ..svc.cluster.local # wget -qO- http://..svc.cluster.local # Check pod labels and selectors kubectl get pod -n --show-labels kubectl get networkpolicy -n -o yaml | grep -A 10 "podSelector" # Common issues and solutions: # 1. Policy too restrictive kubectl describe networkpolicy -n | grep -A 10 "policyTypes" # Solution: Add allowed sources/destinations kubectl patch networkpolicy -n -p '{"spec":{"ingress":[{"from":[{"podSelector":{"matchLabels":{"app":""}}]},"ports":[{"protocol":"TCP","port":}]}]}}' # 2. Policy not matching due to labels kubectl get pod -l = -n kubectl get networkpolicy -n -o yaml | grep -A 5 "matchLabels" # Solution: Fix pod labels or policy selectors kubectl label pod

= -n -n -- curl -I http://external-api.com kubectl describe networkpolicy -n | grep -A 10 "egress" # Solution: Add egress rules for external access kubectl patch networkpolicy -n -p '{"spec":{"egress":[{"to":[]}},{"ports":[{"protocol":"TCP","port":53},{"protocol":"UDP","port":53}]}]}}' ``` ## Storage Troubleshooting ### Volume Mount Issues #### Volume Mount Failed ```bash # Check PVC status kubectl get pvc -n kubectl describe pvc -n # Check PV status kubectl get pv kubectl describe pv # Check pod volume mounts kubectl describe pod -n | grep -A 10 "Mounts" kubectl get pod -n -o yaml | grep -A 15 "volumeMounts" # Common issues and solutions: # 1. PVC stuck in Pending kubectl describe pvc -n | grep -A 10 "Events" kubectl get storageclass kubectl describe storageclass # Solution: Fix storage class or create matching PV kubectl get storageclass -o yaml | grep provisioner kubectl get pods -n | grep # 2. Volume already attached kubectl get pv | grep kubectl get pods --all-namespaces -o wide | grep # Solution: Wait for pod to terminate or use different volume kubectl delete pod -n --force --grace-period=0 # 3. Permission issues kubectl exec -it -n -- ls -la /mount/path kubectl exec -it -n -- touch /mount/path/test # Solution: Fix security context or volume permissions kubectl patch deployment -p '{"spec":{"template":{"spec":{"securityContext":{"fsGroup":2000}}}}}' ``` #### Database Storage Issues ```bash # Check database pod storage kubectl exec -it -n -- df -h kubectl exec -it -n -- ls -la /data # Check storage usage kubectl exec -it -n -- du -sh /data/* kubectl top pod -n # Monitor disk I/O kubectl exec -it -n -- iostat -x 1 # Common issues and solutions: # 1. Disk space full kubectl exec -it -n -- df -h kubectl describe pv | grep -A 5 "Capacity" # Solution: Clean up old data or expand volume kubectl exec -it -n -- find /data -name "*.log" -mtime +7 -delete # 2. Volume mounting with wrong permissions kubectl exec -it -n -- ls -la /data kubectl get pvc -n -o yaml | grep -A 10 "accessModes" # Solution: Fix permissions or use different access mode kubectl patch pvc -n -p '{"spec":{"accessModes":["ReadWriteMany"]}}' # 3. Database corruption due to storage issues kubectl exec -it -n -- cat /var/log/postgresql/postgresql.log kubectl logs -n | grep -i error # Solution: Restore from backup or repair database kubectl exec -it -n -- pg_resetwal /data ``` ## Performance Troubleshooting ### Resource Bottlenecks #### High CPU Usage ```bash # Identify high CPU pods kubectl top pods --all-namespaces --sort-by=cpu kubectl top nodes --sort-by=cpu # Get detailed metrics kubectl exec -it -n -- top kubectl exec -it -n -- ps aux # Check CPU throttling kubectl describe pod -n | grep -A 10 "Limits" kubectl exec -it -n -- cat /sys/fs/cgroup/cpu/cpu.stat # Common causes and solutions: # 1. CPU limits too low kubectl describe pod -n | grep -A 5 "Limits" kubectl top pod -n # Solution: Increase CPU limits kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","resources":{"limits":{"cpu":"1000m"}}}]}}}}' # 2. Application performance issues kubectl exec -it -n -- pprof /debug/pprof/profile kubectl logs -n | grep -i "slow\|timeout\|error" # Solution: Profile application and optimize code kubectl exec -it -n -- curl http://localhost:8080/debug/pprof/profile?seconds=30 -o cpu.prof # 3. Missing resource requests kubectl get pods -n -o yaml | grep -A 10 "resources" kubectl describe nodes | grep -A 10 "Allocated resources" # Solution: Set appropriate resource requests kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","resources":{"requests":{"cpu":"200m"}}}]}}}}' ``` #### High Memory Usage ```bash # Identify high memory pods kubectl top pods --all-namespaces --sort-by=memory kubectl top nodes --sort-by=memory # Check memory usage details kubectl exec -it -n -- free -h kubectl exec -it -n -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes # Check for memory leaks kubectl logs -n | grep -i "out of memory\|oom-killer" kubectl describe pod -n | grep -A 10 "State" # Common causes and solutions: # 1. Memory leaks kubectl exec -it -n -- cat /proc//status | grep -i vm kubectl logs -n | tail -100 # Solution: Restart pod or fix memory leak kubectl delete pod -n # 2. Memory limits too low kubectl describe pod -n | grep -A 5 "Limits" kubectl top pod -n # Solution: Increase memory limits kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","resources":{"limits":{"memory":"2Gi"}}}]}}}}' # 3. Garbage collection issues kubectl exec -it -n -- jstat -gc kubectl logs -n | grep -i "gc\|heap" # Solution: Tune garbage collection or increase heap size kubectl patch deployment -p '{"spec":{"template":{"spec":{"containers":[{"name":"","env":[{"name":"JAVA_OPTS","value":"-Xmx1g -XX:+UseG1GC"}]}]}}}}' ``` ## Cluster-Level Issues ### API Server Problems #### API Server Unreachable ```bash # Check API server connectivity kubectl get componentstatuses kubectl cluster-info kubectl get nodes # Check API server logs kubectl logs -n kube-system -l component=kube-apiserver kubectl logs -n kube-system -l component=kube-controller-manager kubectl logs -n kube-system -l component=kube-scheduler # Check etcd health kubectl get pods -n kube-system | grep etcd kubectl logs -n kube-system -l component=etcd # Common issues and solutions: # 1. API server overloaded kubectl top pods -n kube-system kubectl describe node | grep -A 10 "Conditions" # Solution: Scale API server or optimize load kubectl get pods -n kube-system -l component=kube-apiserver kubectl edit deployment kube-apiserver -n kube-system # 2. etcd cluster issues kubectl get pods -n kube-system | grep etcd kubectl logs -n kube-system -l component=etcd | grep -i "leader\|health" # Solution: Fix etcd cluster kubectl exec -it -n kube-system -- etcdctl endpoint health # 3. Network connectivity to master kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- https://kubernetes.default.svc.cluster.local kubectl get endpoints kubernetes -n default # Solution: Fix network configuration kubectl get svc kubernetes -n default -o yaml ``` ### Node Issues #### Node NotReady ```bash # Check node status kubectl get nodes kubectl describe node # Check kubelet status kubectl get pods -n kube-system | grep kubelet kubectl logs -n kube-system -l component=kubelet # Check system resources on node ssh @ "free -h && df -h && docker ps" # Common issues and solutions: # 1. Kubelet not running ssh @ "systemctl status kubelet" ssh @ "sudo systemctl restart kubelet" # 2. Disk pressure ssh @ "df -h" ssh @ "sudo docker system prune -f" # 3. Network issues ssh @ "ping kubernetes.default.svc.cluster.local" ssh @ "sudo systemctl restart kube-proxy" # 4. Resource exhaustion ssh @ "top && iostat -x 1" ssh @ "sudo systemctl restart docker" ``` ## Advanced Debugging Tools ### kubectl Debug Commands #### Debug Container ```bash # Create debug container kubectl debug -n --image=nicolaka/netshoot --share-processes --copy-to= # Debug with root access kubectl debug -n --image=busybox --as-root # Debug with specific node kubectl debug node/ --image=nicolaka/netshoot # Debug with custom command kubectl debug -n --image=ubuntu -- sh -c "apt-get update && apt-get install -y curl vim" ``` #### Port Forwarding ```bash # Forward local port to pod kubectl port-forward 8080:8080 -n # Forward to service kubectl port-forward service/ 8080:80 -n # Forward database port for debugging kubectl port-forward 5432:5432 -n psql -h localhost -p 5432 -U -d # Forward API server kubectl port-forward svc/kubernetes 443:443 -n default ``` ### Advanced Network Debugging #### Network Policy Testing ```bash # Create test pod for network debugging kubectl run network-test --image=nicolaka/netshoot --rm -it --restart=Never -- bash # Test DNS resolution nslookup kubernetes.default.svc.cluster.local dig kubernetes.default.svc.cluster.local # Test connectivity to services curl -v http://..svc.cluster.local telnet ..svc.cluster.local # Test egress connectivity curl -v https://google.com ping 8.8.8.8 # Test pod-to-pod connectivity kubectl exec -it -- ping kubectl exec -it -- wget -qO- http://: ``` ## Automation and Scripts ### Troubleshooting Scripts #### Health Check Script ```bash #!/bin/bash # k8s-health-check.sh NAMESPACE=${1:-"all"} SEVERITY=${2:-"warning"} echo "🔍 Kubernetes Cluster Health Check" echo "==================================" # Check API server echo "📡 API Server Status:" kubectl get componentstatuses 2>/dev/null || echo "❌ API server not accessible" # Check nodes echo -e "\n🖥️ Node Status:" kubectl get nodes -o wide # Check pods in namespace if [ "$NAMESPACE" == "all" ]; then echo -e "\n📦 Pod Status (All Namespaces):" kubectl get pods --all-namespaces | grep -E "(Error|CrashLoopBackOff|Pending|ImagePullBackOff|ErrImagePull)" else echo -e "\n📦 Pod Status in $NAMESPACE:" kubectl get pods -n $NAMESPACE | grep -E "(Error|CrashLoopBackOff|Pending|ImagePullBackOff|ErrImagePull)" fi # Check resource usage echo -e "\n📊 Resource Usage:" kubectl top nodes kubectl top pods --all-namespaces | head -10 # Check events echo -e "\n⚠️ Recent Events:" kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -10 # Check storage echo -e "\n💾 Storage Status:" kubectl get pvc --all-namespaces | grep -E "(Pending|Failed)" kubectl get pv | grep -E "(Failed|Released)" echo -e "\n✅ Health check completed!" ``` #### Pod Debug Script ```bash #!/bin/bash # debug-pod.sh POD_NAME=$1 NAMESPACE=${2:-"default"} if [ -z "$POD_NAME" ]; then echo "Usage: $0 [namespace]" exit 1 fi echo "🔍 Debugging Pod: $POD_NAME in namespace: $NAMESPACE" echo "==============================================" # Get pod details echo "📋 Pod Details:" kubectl get pod $POD_NAME -n $NAMESPACE -o wide echo -e "\n📝 Pod Description:" kubectl describe pod $POD_NAME -n $NAMESPACE # Get logs echo -e "\n📜 Pod Logs:" kubectl logs $POD_NAME -n $NAMESPACE --tail=50 # Get previous logs if exists echo -e "\n📜 Previous Pod Logs:" kubectl logs $POD_NAME -n $NAMESPACE --previous --tail=50 2>/dev/null || echo "No previous logs found" # Check events echo -e "\n⚠️ Pod Events:" kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD_NAME --sort-by='.lastTimestamp' # Check resource usage echo -e "\n📊 Resource Usage:" kubectl top pod $POD_NAME -n $NAMESPACE 2>/dev/null || echo "Metrics not available" # Create debug container echo -e "\n🐛 Creating Debug Container:" kubectl debug $POD_NAME -n $NAMESPACE --image=nicolaka/netshoot --share-processes --copy-to=$POD_NAME-debug kubectl exec -it $POD_NAME-debug -n $NAMESPACE -- bash echo -e "\n✅ Debug session completed!" kubectl delete pod $POD_NAME-debug -n $NAMESPACE --force --grace-period=0 2>/dev/null ``` *** ## 🚀 **Production Troubleshooting Setup** ### Troubleshooting Tools Deployment ```yaml # Troubleshooting namespace apiVersion: v1 kind: Namespace metadata: name: troubleshooting labels: name: troubleshooting --- # Debug tools DaemonSet apiVersion: apps/v1 kind: DaemonSet metadata: name: debug-tools namespace: troubleshooting spec: selector: matchLabels: app: debug-tools template: metadata: labels: app: debug-tools spec: tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" containers: - name: debug-tools image: nicolaka/netshoot command: ["/bin/sh"] args: ["-c", "sleep 3600"] resources: limits: cpu: 100m memory: 128Mi requests: cpu: 50m memory: 64Mi volumeMounts: - name: root mountPath: /host volumes: - name: root hostPath: path: / restartPolicy: Always --- # Troubleshooting service account apiVersion: v1 kind: ServiceAccount metadata: name: troubleshooting-sa namespace: troubleshooting --- # Troubleshooting cluster role apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: troubleshooting-role rules: - apiGroups: [""] resources: ["pods", "pods/log", "pods/exec", "pods/portforward"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["nodes", "services", "events"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: troubleshooting-binding subjects: - kind: ServiceAccount name: troubleshooting-sa namespace: troubleshooting roleRef: kind: ClusterRole name: troubleshooting-role apiGroup: rbac.authorization.k8s.io ``` *** ## 📚 **Resources dan Referensi** ### Dokumentasi Official * [Kubernetes Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) * [kubectl Debug](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#debug) * [Network Policies Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-network-policies/) ### Troubleshooting Tools * [kubectl Debug Cheat Sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#debugging) * [Network Policy Analyzer](https://github.com/mattfenwick/network-policy-analyzer) * [kube-bench](https://github.com/aquasecurity/kube-bench) ### Cheatsheet Summary ```bash # Common Troubleshooting Commands kubectl get events --all-namespaces --sort-by='.lastTimestamp' kubectl describe pod -n kubectl logs -n --previous kubectl top pods --all-namespaces --sort-by=cpu kubectl debug -n --image=busybox # Network Debugging kubectl run test-pod --image=nicolaka/netshoot --rm -it --restart=Never -- bash kubectl port-forward 8080:8080 -n # Storage Debugging kubectl get pvc --all-namespaces kubectl describe pvc -n kubectl exec -it -n -- df -h ``` Advanced troubleshooting documentation siap digunakan! 🔧