# Troubleshooting > 🚨 **Problem Solving**: Panduan komprehensif untuk debugging, troubleshooting, dan performance analysis di Kubernetes. *** ## 📋 **Daftar Isi** ### **🔍 Common Issues** * [Pod Status Problems](#pod-status-problems) * [Network Issues](#network-issues) * [Storage Problems](#storage-problems) * [Resource Issues](#resource-issues) * [Authentication & Authorization](#authentication--authorization) ### **🔧 Debugging Techniques** * [Pod Debugging](#pod-debugging) * [Service Debugging](#service-debugging) * [Network Debugging](#network-debugging) * [Resource Monitoring](#resource-monitoring) * [Log Analysis](#log-analysis) ### **📊 Performance Analysis** * [CPU Performance](#cpu-performance) * [Memory Performance](#memory-performance) * [Network Performance](#network-performance) * [Storage Performance](#storage-performance) * [Application Performance](#application-performance) ### **🛠️ Troubleshooting Tools** * [kubectl Debug Commands](#kubectl-debug-commands) * [Network Tools](#network-tools) * [Performance Tools](#performance-tools) * [Log Management Tools](#log-management-tools) * [Third-Party Tools](#third-party-tools) ### **📋 Troubleshooting Matrix** * [Problem Identification](#problem-identification) * [Solution Framework](#solution-framework) * [Escalation Path](#escalation-path) * [Prevention Strategies](#prevention-strategies) *** ## 🔍 **Common Issues** ### Pod Status Problems **🚨 Common Pod States** | State | Description | Common Causes | Solutions | | -------------------------- | -------------------------------- | --------------------------------------------------------------- | ------------------------------------------------------- | | `Pending` | Pod scheduled tapi belum running | Resource insufficient, Image pull issues, PVC pending | Check resource requests, verify image, check PVC status | | `CrashLoopBackOff` | Pod restart berulang | Application error, Configuration issues, Liveness probe failure | Check pod logs, verify configuration, debug application | | `ImagePullBackOff` | Image gagal di-pull | Image tidak ada, Registry access issues, Image tag salah | Verify image path, check credentials, fix tag | | `Running` tapi tidak ready | Health check gagal | Readiness probe salah, Application startup lama | Check probe configuration, extend timeout | | `Terminating` | Pod dalam proses termination | Graceful period timeout, Finalizer block | Force delete jika perlu | | `Failed` | Pod failed to start | Configuration error, Resource issues | Check pod events, verify config | **🔧 Pod Status Diagnosis** ```bash # Check pod status dan events kubectl get pods -o wide kubectl describe pod kubectl get events --sort-by=.metadata.creationTimestamp # Check pod logs kubectl logs kubectl logs -c kubectl logs --previous # Check pod YAML kubectl get pod -o yaml # Check pod resource usage kubectl top pod ``` ### Network Issues **🌐 Common Network Problems** | Issue | Symptoms | Diagnosis | Solutions | | ----------------------------------- | ----------------------------------------- | ------------------------------------------------ | ---------------------------------------------------- | | Service unreachable | Connection timeout, DNS resolution failed | Check service endpoints, network policies | Verify service configuration, check network policies | | Pod cannot access external services | External connection timeout | Check egress policies, DNS configuration | Allow egress traffic, verify DNS setup | | Inter-pod communication fails | Connection refused between pods | Check network policies, CNI configuration | Verify network policies, troubleshoot CNI | | DNS resolution fails | NXDOMAIN, timeout | Check CoreDNS, network policies | Restart CoreDNS, check network policies | | Load balancer issues | External access failed | Check service type, cloud provider configuration | Verify service type, check cloud provider setup | **🔧 Network Diagnosis** ```bash # Check service endpoints kubectl get endpoints kubectl describe service # Test network connectivity kubectl exec -it -- nslookup kubectl exec -it -- curl # Check network policies kubectl get networkpolicies --all-namespaces kubectl describe networkpolicy # Check DNS resolution kubectl exec -it -- cat /etc/resolv.conf kubectl exec -it -- nslookup kubernetes.default.svc.cluster.local # Check CNI configuration kubectl get pods -n kube-system -l k8s-app=canal kubectl logs -n kube-system -l k8s-app=canal ``` ### Storage Problems **💾 Common Storage Issues** | Issue | Symptoms | Diagnosis | Solutions | | ----------------------- | ------------------------------------------ | --------------------------------------- | ---------------------------------------------- | | PVC stuck in Pending | Storage tidak tersedia, StorageClass error | Check storage class, PV availability | Create appropriate storage class, provision PV | | Pod cannot mount volume | Mount error, Permission denied | Check volume configuration, permissions | Verify volume config, fix permissions | | Storage class not found | PVC creation failed | Check storage class existence | Create missing storage class | | ReadWriteOnce conflict | Multiple pods trying to mount same volume | Check access modes | Use appropriate access mode | | EBS performance issues | Slow I/O, high latency | Check EBS metrics, volume type | Optimize volume type, increase IOPS | **🔧 Storage Diagnosis** ```bash # Check PVC status kubectl get pvc kubectl describe pvc # Check PV status kubectl get pv kubectl describe pv # Check storage classes kubectl get storageclass kubectl describe storageclass # Check volume mount in pod kubectl describe pod | grep -A 10 Mounts # Check CSI drivers kubectl get pods -n kube-system | grep csi kubectl logs -n kube-system -l app=csi-driver ``` ### Resource Issues **📊 Common Resource Problems** | Issue | Symptoms | Diagnosis | Solutions | | ------------------------------ | ---------------------------------- | -------------------------------------- | -------------------------------------------- | | Pod OOMKilled | Out of memory error | Check memory limits, usage | Increase memory limits, optimize application | | Pod throttled | CPU limit reached | Check CPU limits, usage | Increase CPU limits, optimize performance | | Node pressure | Node not ready, resource exhausted | Check node conditions, resource usage | Add resources, optimize workloads | | Cluster autoscaler not working | Nodes tidak scaling | Check autoscaler logs, IAM permissions | Fix IAM permissions, troubleshoot autoscaler | | Resource quota exceeded | Pod creation failed | Check resource quotas, limits | Increase quotas, optimize resource usage | **🔧 Resource Diagnosis** ```bash # Check resource usage kubectl top nodes kubectl top pods --all-namespaces # Check node conditions kubectl describe node # Check resource quotas kubectl get resourcequota kubectl describe resourcequota # Check limit ranges kubectl get limitrange kubectl describe limitrange # Check pod resource requests/limits kubectl get pods -o jsonpath='{.items[*].spec.containers[*].resources}' ``` ### Authentication & Authorization **🔐 Common Auth Issues** | Issue | Symptoms | Diagnosis | Solutions | | ---------------------- | --------------------- | ----------------------------------- | --------------------------------------- | | Unauthorized access | 401 Forbidden | Check credentials, RBAC | Verify credentials, fix RBAC config | | Forbidden access | 403 Forbidden | Check permissions, service accounts | Fix RBAC, check service accounts | | Service account issues | Pod cannot access API | Check service account, tokens | Verify service account, recreate tokens | | External auth failed | Authentication error | Check OIDC, LDAP config | Fix external auth configuration | | Token expired | Invalid token error | Check token validity | Refresh token, fix expiration policy | **🔧 Auth Diagnosis** ```bash # Check current context kubectl config current-context kubectl config view # Test authentication kubectl auth can-i create pods kubectl auth can-i --list --as=system:serviceaccount:default:my-sa # Check RBAC roles kubectl get roles kubectl describe role # Check service accounts kubectl get serviceaccounts kubectl describe serviceaccount # Check token kubectl get secret -o yaml ``` *** ## 🔧 **Debugging Techniques** ### Pod Debugging **🔍 Step-by-Step Pod Debugging** **Step 1: Basic Status Check** ```bash # Check pod status kubectl get pods -o wide kubectl get pods --show-labels # Check pod details kubectl describe pod # Check events kubectl get events --sort-by=.metadata.creationTimestamp ``` **Step 2: Log Analysis** ```bash # Get container logs kubectl logs kubectl logs -c kubectl logs --previous # Previous container # Follow logs in real-time kubectl logs -f # Get logs from all containers in pod kubectl logs --all-containers=true ``` **Step 3: Container Inspection** ```bash # Exec into container kubectl exec -it -- /bin/bash kubectl exec -it -c -- /bin/sh # Check processes in container kubectl exec -- ps aux # Check network in container kubectl exec -- netstat -tulpn kubectl exec -- ip addr show ``` **Step 4: Resource Analysis** ```bash # Check resource usage kubectl top pod kubectl exec -- top # Check environment variables kubectl exec -- env # Check mounted volumes kubectl exec -- df -h kubectl exec -- mount | grep volume ``` **🚨 Emergency Pod Debugging** ```bash # Create debug pod with same configuration kubectl debug -it --image=busybox --share-processes # Create ephemeral debug container kubectl debug -it --image=nicolaka/netshoot --copy-to=my-debug-pod # Debug node kubectl debug node/ -it --image=nicolaka/netshoot ``` ### Service Debugging **🔧 Service Connectivity Debugging** **Step 1: Service Configuration Check** ```bash # Check service configuration kubectl get svc kubectl describe svc # Check service endpoints kubectl get endpoints kubectl get endpoints # Check service details kubectl get svc -o yaml ``` **Step 2: Endpoint Verification** ```bash # Check if pods are selected by service kubectl get pods -l # Test service connectivity within cluster kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- ..svc.cluster.local # Test service port kubectl run test-pod --image=busybox --rm -it --restart=Never -- nc -zv ``` **Step 3: DNS Resolution Test** ```bash # Test DNS resolution kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default.svc.cluster.local # Check CoreDNS pods kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -n kube-system -l k8s-app=kube-dns ``` **Step 4: External Access Test** ```bash # For LoadBalancer services kubectl get svc kubectl get svc -o wide # Test external access curl http://: telnet # For Ingress kubectl get ingress kubectl describe ingress curl -H "Host: " http:// ``` ### Network Debugging **🌐 Network Troubleshooting Tools** **Basic Network Tests** ```bash # Test pod-to-pod connectivity kubectl exec -it -- ping # Test service connectivity kubectl exec -it -- curl : # Test external connectivity kubectl exec -it -- curl -I https://www.google.com # Check DNS resolution kubectl exec -it -- nslookup kubernetes.default.svc.cluster.local ``` **Advanced Network Debugging** ```bash # Port forwarding for testing kubectl port-forward 8080:80 kubectl port-forward service/ 8080:80 # Network policy debugging kubectl exec -it -- telnet kubectl exec -it -- nc -zv # Check routing table kubectl exec -it -- ip route show kubectl exec -it -- route -n ``` **Network Tools Installation** ```bash # Deploy network debugging tools kubectl apply -f https://raw.githubusercontent.com/nicolaka/netshoot/master/k8s/netshoot.yaml # Use netshoot pod for debugging kubectl exec -it netshoot -- bash # Inside netshoot container: netshoot# ping google.com netshoot# nslookup kubernetes.default.svc.cluster.local netshoot# curl -I http://example.com netshoot# tcpdump -i any ``` ### Resource Monitoring **📊 Real-time Resource Monitoring** **Node Resource Analysis** ```bash # Check node resource usage kubectl top nodes kubectl describe node # Check node conditions kubectl get nodes -o wide kubectl describe node | grep -A 5 "Conditions" # Check node capacity kubectl describe node | grep -A 10 "Capacity" ``` **Pod Resource Analysis** ```bash # Check pod resource usage kubectl top pods kubectl top pods --all-namespaces # Check pod resource requests/limits kubectl get pods -o custom-columns=NAME:.metadata.name,REQ_CPU:.spec.containers[*].resources.requests.cpu,LIM_CPU:.spec.containers[*].resources.limits.cpu # Check resource quotas kubectl get resourcequota kubectl describe resourcequota ``` **Advanced Resource Debugging** ```bash # Check detailed metrics kubectl proxy & curl http://localhost:8001/apis/metrics.k8s.io/v1beta1/nodes curl http://localhost:8001/apis/metrics.k8s.io/v1beta1/pods # Check resource usage trends kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes/" kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces//pods/" ``` ### Log Analysis **📝 Log Collection and Analysis** **System Logs** ```bash # Get system component logs kubectl logs -n kube-system -l component=kube-apiserver kubectl logs -n kube-system -l component=kube-controller-manager kubectl logs -n kube-system -l component=kube-scheduler kubectl logs -n kube-system -l component=kubelet # Get CNI logs kubectl logs -n kube-system -l k8s-app=canal kubectl logs -n kube-system -l app=flannel ``` **Application Logs** ```bash # Get application logs with labels kubectl logs -l app= --all-containers=true # Get logs from previous container kubectl logs --previous # Get logs from all pods in namespace kubectl logs --all-containers=true -n --tail=100 # Export logs to file kubectl logs > pod-logs.txt ``` **Log Aggregation** ```bash # Using stern for multi-pod log monitoring stern -n

stern -l app= --since 10m # Using kubectl for log analysis kubectl logs -f deployment/ --all-containers=true # Filter logs with grep kubectl logs | grep "ERROR" kubectl logs | grep -i "exception\|error\|failed" ``` *** ## 📊 **Performance Analysis** ### CPU Performance **🚀 CPU Optimization Analysis** **CPU Usage Analysis** ```bash # Check CPU usage by pods kubectl top pods --sort-by=cpu # Check CPU usage by nodes kubectl top nodes --sort-by=cpu # Detailed CPU metrics kubectl describe node | grep -A 10 "Allocated resources" ``` **CPU Performance Debugging** ```bash # Check CPU limits and requests kubectl get pods -o custom-columns=NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,CPU_LIM:.spec.containers[*].resources.limits.cpu # Check CPU throttling kubectl exec -it -- cat /sys/fs/cgroup/cpu,cpuacct/cpu.stat kubectl exec -it -- cat /sys/fs/cgroup/cpu,cpuacct/cpuacct.usage # Check CPU-intensive processes kubectl exec -it -- top kubectl exec -it -- ps aux --sort=-%cpu ``` **CPU Optimization** ```yaml # CPU optimization example apiVersion: v1 kind: Pod spec: containers: - name: app image: my-app:latest resources: requests: cpu: "100m" # Minimum guaranteed CPU limits: cpu: "500m" # Maximum CPU limit env: - name: GOMAXPROCS value: "2" # Number of CPU cores to use ``` ### Memory Performance **💾 Memory Optimization Analysis** **Memory Usage Analysis** ```bash # Check memory usage by pods kubectl top pods --sort-by=memory # Check memory usage by nodes kubectl top nodes --sort-by=memory # Check memory usage trends kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods" | jq '.items[] | {pod: .metadata.name, memory: .containers[0].usage.memory}' ``` **Memory Performance Debugging** ```bash # Check memory limits and requests kubectl get pods -o custom-columns=NAME:.metadata.name,MEM_REQ:.spec.containers[*].resources.requests.memory,MEM_LIM:.spec.containers[*].resources.limits.memory # Check OOM events kubectl describe pod | grep -i oom # Check memory usage inside container kubectl exec -it -- free -h kubectl exec -it -- cat /proc/meminfo # Check memory leaks kubectl exec -it -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes kubectl exec -it -- cat /sys/fs/cgroup/memory/memory.max_usage_in_bytes ``` **Memory Optimization** ```yaml # Memory optimization example apiVersion: v1 kind: Pod spec: containers: - name: app image: my-app:latest resources: requests: memory: "128Mi" # Minimum guaranteed memory limits: memory: "512Mi" # Maximum memory limit env: - name: JAVA_OPTS value: "-Xms128m -Xmx512m" # JVM memory settings ``` ### Network Performance **🌐 Network Performance Analysis** **Network Latency Testing** ```bash # Test latency between pods kubectl exec -it -- ping -c 10 # Test DNS resolution latency kubectl exec -it -- time nslookup kubernetes.default.svc.cluster.local # Test service latency kubectl exec -it -- time curl -s : ``` **Network Throughput Testing** ```bash # Deploy network performance testing tools kubectl apply -f https://raw.githubusercontent.com/istio/istio/master/samples/fortio-deploy.yaml # Run network tests kubectl exec -it fortio -- /usr/local/bin/fortio load -c 10 -qps 0 -n 30 -loglevel Warning http://: # Test bandwidth between pods kubectl exec -it -- /usr/bin/iperf3 -c -t 30 ``` **Network Performance Optimization** ```yaml # Network optimization example apiVersion: v1 kind: Pod spec: containers: - name: app image: my-app:latest resources: requests: memory: "64Mi" cpu: "50m" limits: memory: "128Mi" cpu: "100m" env: - name: GOMAXPROCS value: "1" - name: GOGC value: "100" ``` ### Storage Performance **💾 Storage Performance Analysis** **Disk I/O Testing** ```bash # Test disk performance in pod kubectl exec -it -- dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct kubectl exec -it -- dd if=/tmp/testfile of=/dev/null bs=1M iflag=direct # Check disk usage kubectl exec -it -- df -h kubectl exec -it -- du -sh /tmp # Check I/O stats kubectl exec -it -- iostat -x 1 kubectl exec -it -- cat /proc/diskstats ``` **Storage Performance Optimization** ```yaml # Storage optimization example apiVersion: v1 kind: Pod spec: containers: - name: app image: my-app:latest volumeMounts: - name: cache-volume mountPath: /tmp/cache volumes: - name: cache-volume emptyDir: medium: Memory # Use memory for cache sizeLimit: 100Mi ``` ### Application Performance **⚡ Application Performance Analysis** **Application Profiling** ```bash # Profile Go applications kubectl exec -it -- go tool pprof http://localhost:6060/debug/pprof/profile # Profile Java applications kubectl exec -it -- jcmd GC.run kubectl exec -it -- jcmd VM.native_memory summary # Profile Node.js applications kubectl exec -it -- node --inspect=0.0.0.0:9229 app.js ``` **Health Check Optimization** ```yaml # Health check optimization example apiVersion: v1 kind: Pod spec: containers: - name: app image: my-app:latest livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 ``` *** ## 🛠️ **Troubleshooting Tools** ### kubectl Debug Commands **🔧 Essential kubectl Commands** **Pod Debugging** ```bash # Get detailed pod information kubectl get pod -o yaml kubectl describe pod # Debug pod with ephemeral container kubectl debug -it --image=nicolaka/netshoot --share-processes # Copy files from pod kubectl cp :/path/to/file ./local-file # Port forward for debugging kubectl port-forward 8080:80 ``` **Service Debugging** ```bash # Get service information kubectl get svc -o yaml kubectl describe svc # Check service endpoints kubectl get endpoints kubectl describe endpoints # Test service connectivity kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- : ``` **Network Debugging** ```bash # Check network policies kubectl get networkpolicies --all-namespaces kubectl describe networkpolicy # Check DNS resolution kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default.svc.cluster.local # Test network connectivity kubectl run test-pod --image=nicolaka/netshoot --rm -it --restart=Never -- /bin/bash ``` ### Network Tools **🌐 Network Debugging Tools** **Network Testing Tools** ```bash # Deploy network debugging tools kubectl apply -f https://raw.githubusercontent.com/nicolaka/netshoot/master/k8s/netshoot.yaml # Use netshoot for network debugging kubectl exec -it netshoot -- bash netshoot# ping google.com netshoot# nslookup kubernetes.default.svc.cluster.local netshoot# curl -I http://example.com netshoot# tcpdump -i any -n netshoot# netstat -tulpn netshoot# ss -tulpn netshoot# ip addr show netshoot# ip route show ``` **Network Policy Debugging** ```bash # Calico network debugging kubectl exec -it calico-node -- calicoctl get workloads kubectl exec -it calico-node -- calicoctl get networkpolicies # Cilium network debugging kubectl exec -it cilium-operator -- cilium bpf policy list kubectl exec -it cilium-operator -- cilium endpoint list ``` ### Performance Tools **📊 Performance Monitoring Tools** **Resource Monitoring** ```bash # Deploy metrics server kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml # Check resource usage kubectl top nodes kubectl top pods --all-namespaces # Get detailed metrics kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" ``` **Performance Profiling Tools** ```bash # Deploy performance tools helm install prometheus prometheus-community/kube-prometheus-stack helm install grafana grafana/grafana # Check performance metrics kubectl get --raw "http://prometheus-server.monitoring.svc.cluster.local:9090/api/v1/query?query=container_cpu_usage_seconds_total" ``` ### Log Management Tools **📝 Log Collection and Analysis Tools** **Fluent Bit for Log Collection** ```yaml apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config data: fluent-bit.conf: | [SERVICE] Flush 1 Log_Level info Daemon off Parsers_File parsers.conf [INPUT] Name tail Path /var/log/containers/*.log Parser docker Tag kube.* Refresh_Interval 5 [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Merge_Log On Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude On [OUTPUT] Name forward Match * Host logstash.logstash.svc.cluster.local Port 24224 ``` **Stern for Log Management** ```bash # Install stern brew install stern # macOS go install github.com/stern/stern@latest # Go # Use stern for log monitoring stern -n production app-* stern -l app=my-app --since 10m stern --all-namespaces kube-system ``` ### Third-Party Tools **🛠️ Advanced Troubleshooting Tools** **Lens for GUI Management** ```bash # Install Lens # Download from https://k8slens.dev/ # Features: # - Visual cluster overview # - Real-time monitoring # - Log viewer # - Resource usage charts # - Workload management ``` **Octant for Web Dashboard** ```bash # Install Octant # Download from https://github.com/vmware-tanzu/octant # Launch Octant octant --kubeconfig ~/.kube/config ``` **K9s for Terminal UI** ```bash # Install k9s brew install k9s # macOS curl -sS https://webinstall.dev/k9s | bash # Linux # Use k9s k9s k9s -n production ``` *** ## 📋 **Troubleshooting Matrix** ### Problem Identification **🔍 Quick Problem Diagnosis** | Category | Key Commands | What to Look For | | ------------------- | ------------------------------------------ | ------------------------------ | | **Pod Issues** | `kubectl get pods`, `kubectl describe pod` | Status, events, restart count | | **Network Issues** | `kubectl get svc`, `kubectl describe svc` | Endpoints, service ports | | **Storage Issues** | `kubectl get pvc`, `kubectl get pv` | Status, capacity, access modes | | **Resource Issues** | `kubectl top nodes`, `kubectl top pods` | CPU/memory usage, pressure | | **Auth Issues** | `kubectl auth can-i`, `kubectl get events` | Permissions, authentication | **🚨 Emergency Response Matrix** | Issue Severity | Response Time | Escalation | Resolution Target | | ------------------------------ | ------------- | ------------------- | ----------------- | | **Critical** (Production down) | < 5 minutes | Engineering manager | < 30 minutes | | **High** (Service degraded) | < 15 minutes | Team lead | < 2 hours | | **Medium** (Partial impact) | < 1 hour | Senior engineer | < 8 hours | | **Low** (Minor issue) | < 4 hours | Junior engineer | < 24 hours | ### Solution Framework **🔧 Systematic Troubleshooting Process** **Step 1: Problem Definition** ```bash # Define the problem clearly echo "Problem: [Clear description of the issue]" echo "Impact: [What systems/users are affected]" echo "Scope: [How widespread is the issue]" echo "Urgency: [How quickly this needs to be resolved]" ``` **Step 2: Information Gathering** ```bash # Collect basic information kubectl version kubectl cluster-info kubectl get nodes -o wide kubectl get pods --all-namespaces kubectl get events --sort-by=.metadata.creationTimestamp ``` **Step 3: Analysis** ```bash # Analyze the collected data # Look for patterns, correlations, and root causes # Check recent changes, deployments, or configurations ``` **Step 4: Solution Implementation** ```bash # Implement fix # Test in non-production first # Document the changes made ``` **Step 5: Verification** ```bash # Verify the fix is working # Monitor for recurrence # Update documentation ``` ### Escalation Path **📞 Escalation Guidelines** **Level 1: Self-Service** * Use troubleshooting guides * Check documentation * Use available tools **Level 2: Team Support** * Consult with team members * Review past incidents * Use team knowledge base **Level 3: Engineering Lead** * Complex technical issues * Multi-system problems * Architecture concerns **Level 4: Management** * Business impact issues * Resource constraints * Policy decisions ### Prevention Strategies **🛡️ Proactive Monitoring** **Health Checks Implementation** ```yaml # Comprehensive health checks apiVersion: v1 kind: Pod spec: containers: - name: app image: my-app:latest livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 startupProbe: httpGet: path: /startup port: 8080 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 30 ``` **Resource Management** ```yaml # Resource quotas and limits apiVersion: v1 kind: ResourceQuota metadata: name: production-quota namespace: production spec: hard: requests.cpu: "10" requests.memory: 20Gi limits.cpu: "20" limits.memory: 40Gi persistentvolumeclaims: "10" pods: "20" ``` **Monitoring and Alerting** ```yaml # Prometheus alerts apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: production-alerts namespace: monitoring spec: groups: - name: production rules: - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently" - alert: HighMemoryUsage expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9 for: 10m labels: severity: warning annotations: summary: "High memory usage in {{ $labels.pod }}" description: "Memory usage is above 90% for pod {{ $labels.pod }}" ``` *** ## 🎯 **Best Practices** ### **🔍 Troubleshooting Best Practices** 1. **Systematic Approach** * Define the problem clearly * Gather information systematically * Use structured troubleshooting process 2. **Documentation** * Document all findings * Track changes made * Maintain knowledge base 3. **Prevention** * Implement proactive monitoring * Set up proper health checks * Regular system reviews ### **🛠️ Tool Usage Best Practices** 1. **Choose Right Tools** * Use appropriate tools for the issue * Combine multiple tools for comprehensive analysis * Keep tools updated 2. **Efficiency** * Use aliases and scripts for common tasks * Automate repetitive troubleshooting steps * Build personal troubleshooting toolkit ### **📊 Performance Optimization** 1. **Monitoring** * Set up comprehensive monitoring * Define meaningful metrics * Establish alerting thresholds 2. **Optimization** * Regular performance reviews * Capacity planning * Resource optimization *** ## 🔗 **Referensi** ### **📚 Dokumentasi Resmi** * [Kubernetes Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/) * [Debug Pods and ReplicationControllers](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-pods-replicationcontrollers/) * [Troubleshooting Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) ### **🛠️ Troubleshooting Tools** * [Stern Log Tool](https://github.com/stern/stern) * [K9s Terminal UI](https://github.com/derailed/k9s) * [Lens IDE](https://k8slens.dev/) * [Netshoot Network Tools](https://github.com/nicolaka/netshoot) ### **📖 Learning Resources** * [Kubernetes Debugging Guide](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/) * [Performance Best Practices](https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring/) * [Network Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/) *** \*🔧 \**Troubleshooting adalah skill yang penting untuk maintain reliable Kubernetes cluster*