# Troubleshooting

> 🚨 **Problem Solving**: Panduan komprehensif untuk debugging, troubleshooting, dan performance analysis di Kubernetes.

***

## 📋 **Daftar Isi**

### **🔍 Common Issues**

* [Pod Status Problems](#pod-status-problems)
* [Network Issues](#network-issues)
* [Storage Problems](#storage-problems)
* [Resource Issues](#resource-issues)
* [Authentication & Authorization](#authentication--authorization)

### **🔧 Debugging Techniques**

* [Pod Debugging](#pod-debugging)
* [Service Debugging](#service-debugging)
* [Network Debugging](#network-debugging)
* [Resource Monitoring](#resource-monitoring)
* [Log Analysis](#log-analysis)

### **📊 Performance Analysis**

* [CPU Performance](#cpu-performance)
* [Memory Performance](#memory-performance)
* [Network Performance](#network-performance)
* [Storage Performance](#storage-performance)
* [Application Performance](#application-performance)

### **🛠️ Troubleshooting Tools**

* [kubectl Debug Commands](#kubectl-debug-commands)
* [Network Tools](#network-tools)
* [Performance Tools](#performance-tools)
* [Log Management Tools](#log-management-tools)
* [Third-Party Tools](#third-party-tools)

### **📋 Troubleshooting Matrix**

* [Problem Identification](#problem-identification)
* [Solution Framework](#solution-framework)
* [Escalation Path](#escalation-path)
* [Prevention Strategies](#prevention-strategies)

***

## 🔍 **Common Issues**

### Pod Status Problems

**🚨 Common Pod States**

| State                      | Description                      | Common Causes                                                   | Solutions                                               |
| -------------------------- | -------------------------------- | --------------------------------------------------------------- | ------------------------------------------------------- |
| `Pending`                  | Pod scheduled tapi belum running | Resource insufficient, Image pull issues, PVC pending           | Check resource requests, verify image, check PVC status |
| `CrashLoopBackOff`         | Pod restart berulang             | Application error, Configuration issues, Liveness probe failure | Check pod logs, verify configuration, debug application |
| `ImagePullBackOff`         | Image gagal di-pull              | Image tidak ada, Registry access issues, Image tag salah        | Verify image path, check credentials, fix tag           |
| `Running` tapi tidak ready | Health check gagal               | Readiness probe salah, Application startup lama                 | Check probe configuration, extend timeout               |
| `Terminating`              | Pod dalam proses termination     | Graceful period timeout, Finalizer block                        | Force delete jika perlu                                 |
| `Failed`                   | Pod failed to start              | Configuration error, Resource issues                            | Check pod events, verify config                         |

**🔧 Pod Status Diagnosis**

```bash
# Check pod status dan events
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl get events --sort-by=.metadata.creationTimestamp

# Check pod logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous

# Check pod YAML
kubectl get pod <pod-name> -o yaml

# Check pod resource usage
kubectl top pod <pod-name>
```

### Network Issues

**🌐 Common Network Problems**

| Issue                               | Symptoms                                  | Diagnosis                                        | Solutions                                            |
| ----------------------------------- | ----------------------------------------- | ------------------------------------------------ | ---------------------------------------------------- |
| Service unreachable                 | Connection timeout, DNS resolution failed | Check service endpoints, network policies        | Verify service configuration, check network policies |
| Pod cannot access external services | External connection timeout               | Check egress policies, DNS configuration         | Allow egress traffic, verify DNS setup               |
| Inter-pod communication fails       | Connection refused between pods           | Check network policies, CNI configuration        | Verify network policies, troubleshoot CNI            |
| DNS resolution fails                | NXDOMAIN, timeout                         | Check CoreDNS, network policies                  | Restart CoreDNS, check network policies              |
| Load balancer issues                | External access failed                    | Check service type, cloud provider configuration | Verify service type, check cloud provider setup      |

**🔧 Network Diagnosis**

```bash
# Check service endpoints
kubectl get endpoints
kubectl describe service <service-name>

# Test network connectivity
kubectl exec -it <pod-name> -- nslookup <service-name>
kubectl exec -it <pod-name> -- curl <service-url>

# Check network policies
kubectl get networkpolicies --all-namespaces
kubectl describe networkpolicy <policy-name>

# Check DNS resolution
kubectl exec -it <pod-name> -- cat /etc/resolv.conf
kubectl exec -it <pod-name> -- nslookup kubernetes.default.svc.cluster.local

# Check CNI configuration
kubectl get pods -n kube-system -l k8s-app=canal
kubectl logs -n kube-system -l k8s-app=canal
```

### Storage Problems

**💾 Common Storage Issues**

| Issue                   | Symptoms                                   | Diagnosis                               | Solutions                                      |
| ----------------------- | ------------------------------------------ | --------------------------------------- | ---------------------------------------------- |
| PVC stuck in Pending    | Storage tidak tersedia, StorageClass error | Check storage class, PV availability    | Create appropriate storage class, provision PV |
| Pod cannot mount volume | Mount error, Permission denied             | Check volume configuration, permissions | Verify volume config, fix permissions          |
| Storage class not found | PVC creation failed                        | Check storage class existence           | Create missing storage class                   |
| ReadWriteOnce conflict  | Multiple pods trying to mount same volume  | Check access modes                      | Use appropriate access mode                    |
| EBS performance issues  | Slow I/O, high latency                     | Check EBS metrics, volume type          | Optimize volume type, increase IOPS            |

**🔧 Storage Diagnosis**

```bash
# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check PV status
kubectl get pv
kubectl describe pv <pv-name>

# Check storage classes
kubectl get storageclass
kubectl describe storageclass <storageclass-name>

# Check volume mount in pod
kubectl describe pod <pod-name> | grep -A 10 Mounts

# Check CSI drivers
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system -l app=csi-driver
```

### Resource Issues

**📊 Common Resource Problems**

| Issue                          | Symptoms                           | Diagnosis                              | Solutions                                    |
| ------------------------------ | ---------------------------------- | -------------------------------------- | -------------------------------------------- |
| Pod OOMKilled                  | Out of memory error                | Check memory limits, usage             | Increase memory limits, optimize application |
| Pod throttled                  | CPU limit reached                  | Check CPU limits, usage                | Increase CPU limits, optimize performance    |
| Node pressure                  | Node not ready, resource exhausted | Check node conditions, resource usage  | Add resources, optimize workloads            |
| Cluster autoscaler not working | Nodes tidak scaling                | Check autoscaler logs, IAM permissions | Fix IAM permissions, troubleshoot autoscaler |
| Resource quota exceeded        | Pod creation failed                | Check resource quotas, limits          | Increase quotas, optimize resource usage     |

**🔧 Resource Diagnosis**

```bash
# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Check node conditions
kubectl describe node <node-name>

# Check resource quotas
kubectl get resourcequota
kubectl describe resourcequota <quota-name>

# Check limit ranges
kubectl get limitrange
kubectl describe limitrange <limit-name>

# Check pod resource requests/limits
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].resources}'
```

### Authentication & Authorization

**🔐 Common Auth Issues**

| Issue                  | Symptoms              | Diagnosis                           | Solutions                               |
| ---------------------- | --------------------- | ----------------------------------- | --------------------------------------- |
| Unauthorized access    | 401 Forbidden         | Check credentials, RBAC             | Verify credentials, fix RBAC config     |
| Forbidden access       | 403 Forbidden         | Check permissions, service accounts | Fix RBAC, check service accounts        |
| Service account issues | Pod cannot access API | Check service account, tokens       | Verify service account, recreate tokens |
| External auth failed   | Authentication error  | Check OIDC, LDAP config             | Fix external auth configuration         |
| Token expired          | Invalid token error   | Check token validity                | Refresh token, fix expiration policy    |

**🔧 Auth Diagnosis**

```bash
# Check current context
kubectl config current-context
kubectl config view

# Test authentication
kubectl auth can-i create pods
kubectl auth can-i --list --as=system:serviceaccount:default:my-sa

# Check RBAC roles
kubectl get roles
kubectl describe role <role-name>

# Check service accounts
kubectl get serviceaccounts
kubectl describe serviceaccount <sa-name>

# Check token
kubectl get secret <sa-token> -o yaml
```

***

## 🔧 **Debugging Techniques**

### Pod Debugging

**🔍 Step-by-Step Pod Debugging**

**Step 1: Basic Status Check**

```bash
# Check pod status
kubectl get pods -o wide
kubectl get pods --show-labels

# Check pod details
kubectl describe pod <pod-name>

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp
```

**Step 2: Log Analysis**

```bash
# Get container logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous  # Previous container

# Follow logs in real-time
kubectl logs <pod-name> -f

# Get logs from all containers in pod
kubectl logs <pod-name> --all-containers=true
```

**Step 3: Container Inspection**

```bash
# Exec into container
kubectl exec -it <pod-name> -- /bin/bash
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh

# Check processes in container
kubectl exec <pod-name> -- ps aux

# Check network in container
kubectl exec <pod-name> -- netstat -tulpn
kubectl exec <pod-name> -- ip addr show
```

**Step 4: Resource Analysis**

```bash
# Check resource usage
kubectl top pod <pod-name>
kubectl exec <pod-name> -- top

# Check environment variables
kubectl exec <pod-name> -- env

# Check mounted volumes
kubectl exec <pod-name> -- df -h
kubectl exec <pod-name> -- mount | grep volume
```

**🚨 Emergency Pod Debugging**

```bash
# Create debug pod with same configuration
kubectl debug <pod-name> -it --image=busybox --share-processes

# Create ephemeral debug container
kubectl debug <pod-name> -it --image=nicolaka/netshoot --copy-to=my-debug-pod

# Debug node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot
```

### Service Debugging

**🔧 Service Connectivity Debugging**

**Step 1: Service Configuration Check**

```bash
# Check service configuration
kubectl get svc
kubectl describe svc <service-name>

# Check service endpoints
kubectl get endpoints
kubectl get endpoints <service-name>

# Check service details
kubectl get svc <service-name> -o yaml
```

**Step 2: Endpoint Verification**

```bash
# Check if pods are selected by service
kubectl get pods -l <service-selector-label>

# Test service connectivity within cluster
kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- <service-name>.<namespace>.svc.cluster.local

# Test service port
kubectl run test-pod --image=busybox --rm -it --restart=Never -- nc -zv <service-name> <port>
```

**Step 3: DNS Resolution Test**

```bash
# Test DNS resolution
kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup <service-name>
kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default.svc.cluster.local

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
```

**Step 4: External Access Test**

```bash
# For LoadBalancer services
kubectl get svc <service-name>
kubectl get svc <service-name> -o wide

# Test external access
curl http://<external-ip>:<port>
telnet <external-ip> <port>

# For Ingress
kubectl get ingress
kubectl describe ingress <ingress-name>
curl -H "Host: <host-name>" http://<ingress-ip>
```

### Network Debugging

**🌐 Network Troubleshooting Tools**

**Basic Network Tests**

```bash
# Test pod-to-pod connectivity
kubectl exec -it <pod-1> -- ping <pod-2-ip>

# Test service connectivity
kubectl exec -it <pod> -- curl <service-name>:<port>

# Test external connectivity
kubectl exec -it <pod> -- curl -I https://www.google.com

# Check DNS resolution
kubectl exec -it <pod> -- nslookup kubernetes.default.svc.cluster.local
```

**Advanced Network Debugging**

```bash
# Port forwarding for testing
kubectl port-forward <pod-name> 8080:80
kubectl port-forward service/<service-name> 8080:80

# Network policy debugging
kubectl exec -it <pod> -- telnet <target-ip> <port>
kubectl exec -it <pod> -- nc -zv <target-ip> <port>

# Check routing table
kubectl exec -it <pod> -- ip route show
kubectl exec -it <pod> -- route -n
```

**Network Tools Installation**

```bash
# Deploy network debugging tools
kubectl apply -f https://raw.githubusercontent.com/nicolaka/netshoot/master/k8s/netshoot.yaml

# Use netshoot pod for debugging
kubectl exec -it netshoot -- bash

# Inside netshoot container:
netshoot# ping google.com
netshoot# nslookup kubernetes.default.svc.cluster.local
netshoot# curl -I http://example.com
netshoot# tcpdump -i any
```

### Resource Monitoring

**📊 Real-time Resource Monitoring**

**Node Resource Analysis**

```bash
# Check node resource usage
kubectl top nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get nodes -o wide
kubectl describe node <node-name> | grep -A 5 "Conditions"

# Check node capacity
kubectl describe node <node-name> | grep -A 10 "Capacity"
```

**Pod Resource Analysis**

```bash
# Check pod resource usage
kubectl top pods
kubectl top pods --all-namespaces

# Check pod resource requests/limits
kubectl get pods -o custom-columns=NAME:.metadata.name,REQ_CPU:.spec.containers[*].resources.requests.cpu,LIM_CPU:.spec.containers[*].resources.limits.cpu

# Check resource quotas
kubectl get resourcequota
kubectl describe resourcequota <quota-name>
```

**Advanced Resource Debugging**

```bash
# Check detailed metrics
kubectl proxy &
curl http://localhost:8001/apis/metrics.k8s.io/v1beta1/nodes
curl http://localhost:8001/apis/metrics.k8s.io/v1beta1/pods

# Check resource usage trends
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes/<node-name>"
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>"
```

### Log Analysis

**📝 Log Collection and Analysis**

**System Logs**

```bash
# Get system component logs
kubectl logs -n kube-system -l component=kube-apiserver
kubectl logs -n kube-system -l component=kube-controller-manager
kubectl logs -n kube-system -l component=kube-scheduler
kubectl logs -n kube-system -l component=kubelet

# Get CNI logs
kubectl logs -n kube-system -l k8s-app=canal
kubectl logs -n kube-system -l app=flannel
```

**Application Logs**

```bash
# Get application logs with labels
kubectl logs -l app=<app-label> --all-containers=true

# Get logs from previous container
kubectl logs <pod-name> --previous

# Get logs from all pods in namespace
kubectl logs --all-containers=true -n <namespace> --tail=100

# Export logs to file
kubectl logs <pod-name> > pod-logs.txt
```

**Log Aggregation**

```bash
# Using stern for multi-pod log monitoring
stern -n <namespace> <pod-name-pattern>
stern -l app=<app-label> --since 10m

# Using kubectl for log analysis
kubectl logs -f deployment/<deployment-name> --all-containers=true

# Filter logs with grep
kubectl logs <pod-name> | grep "ERROR"
kubectl logs <pod-name> | grep -i "exception\|error\|failed"
```

***

## 📊 **Performance Analysis**

### CPU Performance

**🚀 CPU Optimization Analysis**

**CPU Usage Analysis**

```bash
# Check CPU usage by pods
kubectl top pods --sort-by=cpu

# Check CPU usage by nodes
kubectl top nodes --sort-by=cpu

# Detailed CPU metrics
kubectl describe node <node-name> | grep -A 10 "Allocated resources"
```

**CPU Performance Debugging**

```bash
# Check CPU limits and requests
kubectl get pods -o custom-columns=NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,CPU_LIM:.spec.containers[*].resources.limits.cpu

# Check CPU throttling
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu,cpuacct/cpu.stat
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu,cpuacct/cpuacct.usage

# Check CPU-intensive processes
kubectl exec -it <pod-name> -- top
kubectl exec -it <pod-name> -- ps aux --sort=-%cpu
```

**CPU Optimization**

```yaml
# CPU optimization example
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: my-app:latest
    resources:
      requests:
        cpu: "100m"  # Minimum guaranteed CPU
      limits:
        cpu: "500m"  # Maximum CPU limit
    env:
    - name: GOMAXPROCS
      value: "2"  # Number of CPU cores to use
```

### Memory Performance

**💾 Memory Optimization Analysis**

**Memory Usage Analysis**

```bash
# Check memory usage by pods
kubectl top pods --sort-by=memory

# Check memory usage by nodes
kubectl top nodes --sort-by=memory

# Check memory usage trends
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods" | jq '.items[] | {pod: .metadata.name, memory: .containers[0].usage.memory}'
```

**Memory Performance Debugging**

```bash
# Check memory limits and requests
kubectl get pods -o custom-columns=NAME:.metadata.name,MEM_REQ:.spec.containers[*].resources.requests.memory,MEM_LIM:.spec.containers[*].resources.limits.memory

# Check OOM events
kubectl describe pod <pod-name> | grep -i oom

# Check memory usage inside container
kubectl exec -it <pod-name> -- free -h
kubectl exec -it <pod-name> -- cat /proc/meminfo

# Check memory leaks
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/memory/memory.max_usage_in_bytes
```

**Memory Optimization**

```yaml
# Memory optimization example
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: my-app:latest
    resources:
      requests:
        memory: "128Mi"  # Minimum guaranteed memory
      limits:
        memory: "512Mi"  # Maximum memory limit
    env:
    - name: JAVA_OPTS
      value: "-Xms128m -Xmx512m"  # JVM memory settings
```

### Network Performance

**🌐 Network Performance Analysis**

**Network Latency Testing**

```bash
# Test latency between pods
kubectl exec -it <pod-1> -- ping -c 10 <pod-2-ip>

# Test DNS resolution latency
kubectl exec -it <pod> -- time nslookup kubernetes.default.svc.cluster.local

# Test service latency
kubectl exec -it <pod> -- time curl -s <service-name>:<port>
```

**Network Throughput Testing**

```bash
# Deploy network performance testing tools
kubectl apply -f https://raw.githubusercontent.com/istio/istio/master/samples/fortio-deploy.yaml

# Run network tests
kubectl exec -it fortio -- /usr/local/bin/fortio load -c 10 -qps 0 -n 30 -loglevel Warning http://<service-name>:<port>

# Test bandwidth between pods
kubectl exec -it <pod-1> -- /usr/bin/iperf3 -c <pod-2-ip> -t 30
```

**Network Performance Optimization**

```yaml
# Network optimization example
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: my-app:latest
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
    env:
    - name: GOMAXPROCS
      value: "1"
    - name: GOGC
      value: "100"
```

### Storage Performance

**💾 Storage Performance Analysis**

**Disk I/O Testing**

```bash
# Test disk performance in pod
kubectl exec -it <pod-name> -- dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct
kubectl exec -it <pod-name> -- dd if=/tmp/testfile of=/dev/null bs=1M iflag=direct

# Check disk usage
kubectl exec -it <pod-name> -- df -h
kubectl exec -it <pod-name> -- du -sh /tmp

# Check I/O stats
kubectl exec -it <pod-name> -- iostat -x 1
kubectl exec -it <pod-name> -- cat /proc/diskstats
```

**Storage Performance Optimization**

```yaml
# Storage optimization example
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: my-app:latest
    volumeMounts:
    - name: cache-volume
      mountPath: /tmp/cache
  volumes:
  - name: cache-volume
    emptyDir:
      medium: Memory  # Use memory for cache
      sizeLimit: 100Mi
```

### Application Performance

**⚡ Application Performance Analysis**

**Application Profiling**

```bash
# Profile Go applications
kubectl exec -it <pod-name> -- go tool pprof http://localhost:6060/debug/pprof/profile

# Profile Java applications
kubectl exec -it <pod-name> -- jcmd <pid> GC.run
kubectl exec -it <pod-name> -- jcmd <pid> VM.native_memory summary

# Profile Node.js applications
kubectl exec -it <pod-name> -- node --inspect=0.0.0.0:9229 app.js
```

**Health Check Optimization**

```yaml
# Health check optimization example
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: my-app:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3
```

***

## 🛠️ **Troubleshooting Tools**

### kubectl Debug Commands

**🔧 Essential kubectl Commands**

**Pod Debugging**

```bash
# Get detailed pod information
kubectl get pod <pod-name> -o yaml
kubectl describe pod <pod-name>

# Debug pod with ephemeral container
kubectl debug <pod-name> -it --image=nicolaka/netshoot --share-processes

# Copy files from pod
kubectl cp <pod-name>:/path/to/file ./local-file

# Port forward for debugging
kubectl port-forward <pod-name> 8080:80
```

**Service Debugging**

```bash
# Get service information
kubectl get svc <service-name> -o yaml
kubectl describe svc <service-name>

# Check service endpoints
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name>

# Test service connectivity
kubectl run test-pod --image=busybox --rm -it --restart=Never -- wget -qO- <service-name>:<port>
```

**Network Debugging**

```bash
# Check network policies
kubectl get networkpolicies --all-namespaces
kubectl describe networkpolicy <policy-name>

# Check DNS resolution
kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default.svc.cluster.local

# Test network connectivity
kubectl run test-pod --image=nicolaka/netshoot --rm -it --restart=Never -- /bin/bash
```

### Network Tools

**🌐 Network Debugging Tools**

**Network Testing Tools**

```bash
# Deploy network debugging tools
kubectl apply -f https://raw.githubusercontent.com/nicolaka/netshoot/master/k8s/netshoot.yaml

# Use netshoot for network debugging
kubectl exec -it netshoot -- bash
netshoot# ping google.com
netshoot# nslookup kubernetes.default.svc.cluster.local
netshoot# curl -I http://example.com
netshoot# tcpdump -i any -n
netshoot# netstat -tulpn
netshoot# ss -tulpn
netshoot# ip addr show
netshoot# ip route show
```

**Network Policy Debugging**

```bash
# Calico network debugging
kubectl exec -it calico-node -- calicoctl get workloads
kubectl exec -it calico-node -- calicoctl get networkpolicies

# Cilium network debugging
kubectl exec -it cilium-operator -- cilium bpf policy list
kubectl exec -it cilium-operator -- cilium endpoint list
```

### Performance Tools

**📊 Performance Monitoring Tools**

**Resource Monitoring**

```bash
# Deploy metrics server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Get detailed metrics
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods"
```

**Performance Profiling Tools**

```bash
# Deploy performance tools
helm install prometheus prometheus-community/kube-prometheus-stack
helm install grafana grafana/grafana

# Check performance metrics
kubectl get --raw "http://prometheus-server.monitoring.svc.cluster.local:9090/api/v1/query?query=container_cpu_usage_seconds_total"
```

### Log Management Tools

**📝 Log Collection and Analysis Tools**

**Fluent Bit for Log Collection**

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [OUTPUT]
        Name  forward
        Match *
        Host  logstash.logstash.svc.cluster.local
        Port  24224
```

**Stern for Log Management**

```bash
# Install stern
brew install stern  # macOS
go install github.com/stern/stern@latest  # Go

# Use stern for log monitoring
stern -n production app-*
stern -l app=my-app --since 10m
stern --all-namespaces kube-system
```

### Third-Party Tools

**🛠️ Advanced Troubleshooting Tools**

**Lens for GUI Management**

```bash
# Install Lens
# Download from https://k8slens.dev/

# Features:
# - Visual cluster overview
# - Real-time monitoring
# - Log viewer
# - Resource usage charts
# - Workload management
```

**Octant for Web Dashboard**

```bash
# Install Octant
# Download from https://github.com/vmware-tanzu/octant

# Launch Octant
octant --kubeconfig ~/.kube/config
```

**K9s for Terminal UI**

```bash
# Install k9s
brew install k9s  # macOS
curl -sS https://webinstall.dev/k9s | bash  # Linux

# Use k9s
k9s
k9s -n production
```

***

## 📋 **Troubleshooting Matrix**

### Problem Identification

**🔍 Quick Problem Diagnosis**

| Category            | Key Commands                               | What to Look For               |
| ------------------- | ------------------------------------------ | ------------------------------ |
| **Pod Issues**      | `kubectl get pods`, `kubectl describe pod` | Status, events, restart count  |
| **Network Issues**  | `kubectl get svc`, `kubectl describe svc`  | Endpoints, service ports       |
| **Storage Issues**  | `kubectl get pvc`, `kubectl get pv`        | Status, capacity, access modes |
| **Resource Issues** | `kubectl top nodes`, `kubectl top pods`    | CPU/memory usage, pressure     |
| **Auth Issues**     | `kubectl auth can-i`, `kubectl get events` | Permissions, authentication    |

**🚨 Emergency Response Matrix**

| Issue Severity                 | Response Time | Escalation          | Resolution Target |
| ------------------------------ | ------------- | ------------------- | ----------------- |
| **Critical** (Production down) | < 5 minutes   | Engineering manager | < 30 minutes      |
| **High** (Service degraded)    | < 15 minutes  | Team lead           | < 2 hours         |
| **Medium** (Partial impact)    | < 1 hour      | Senior engineer     | < 8 hours         |
| **Low** (Minor issue)          | < 4 hours     | Junior engineer     | < 24 hours        |

### Solution Framework

**🔧 Systematic Troubleshooting Process**

**Step 1: Problem Definition**

```bash
# Define the problem clearly
echo "Problem: [Clear description of the issue]"
echo "Impact: [What systems/users are affected]"
echo "Scope: [How widespread is the issue]"
echo "Urgency: [How quickly this needs to be resolved]"
```

**Step 2: Information Gathering**

```bash
# Collect basic information
kubectl version
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods --all-namespaces
kubectl get events --sort-by=.metadata.creationTimestamp
```

**Step 3: Analysis**

```bash
# Analyze the collected data
# Look for patterns, correlations, and root causes
# Check recent changes, deployments, or configurations
```

**Step 4: Solution Implementation**

```bash
# Implement fix
# Test in non-production first
# Document the changes made
```

**Step 5: Verification**

```bash
# Verify the fix is working
# Monitor for recurrence
# Update documentation
```

### Escalation Path

**📞 Escalation Guidelines**

**Level 1: Self-Service**

* Use troubleshooting guides
* Check documentation
* Use available tools

**Level 2: Team Support**

* Consult with team members
* Review past incidents
* Use team knowledge base

**Level 3: Engineering Lead**

* Complex technical issues
* Multi-system problems
* Architecture concerns

**Level 4: Management**

* Business impact issues
* Resource constraints
* Policy decisions

### Prevention Strategies

**🛡️ Proactive Monitoring**

**Health Checks Implementation**

```yaml
# Comprehensive health checks
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: my-app:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 30
```

**Resource Management**

```yaml
# Resource quotas and limits
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "10"
    pods: "20"
```

**Monitoring and Alerting**

```yaml
# Prometheus alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: production-alerts
  namespace: monitoring
spec:
  groups:
  - name: production
    rules:
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"

    - alert: HighMemoryUsage
      expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage in {{ $labels.pod }}"
        description: "Memory usage is above 90% for pod {{ $labels.pod }}"
```

***

## 🎯 **Best Practices**

### **🔍 Troubleshooting Best Practices**

1. **Systematic Approach**
   * Define the problem clearly
   * Gather information systematically
   * Use structured troubleshooting process
2. **Documentation**
   * Document all findings
   * Track changes made
   * Maintain knowledge base
3. **Prevention**
   * Implement proactive monitoring
   * Set up proper health checks
   * Regular system reviews

### **🛠️ Tool Usage Best Practices**

1. **Choose Right Tools**
   * Use appropriate tools for the issue
   * Combine multiple tools for comprehensive analysis
   * Keep tools updated
2. **Efficiency**
   * Use aliases and scripts for common tasks
   * Automate repetitive troubleshooting steps
   * Build personal troubleshooting toolkit

### **📊 Performance Optimization**

1. **Monitoring**
   * Set up comprehensive monitoring
   * Define meaningful metrics
   * Establish alerting thresholds
2. **Optimization**
   * Regular performance reviews
   * Capacity planning
   * Resource optimization

***

## 🔗 **Referensi**

### **📚 Dokumentasi Resmi**

* [Kubernetes Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
* [Debug Pods and ReplicationControllers](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-pods-replicationcontrollers/)
* [Troubleshooting Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/)

### **🛠️ Troubleshooting Tools**

* [Stern Log Tool](https://github.com/stern/stern)
* [K9s Terminal UI](https://github.com/derailed/k9s)
* [Lens IDE](https://k8slens.dev/)
* [Netshoot Network Tools](https://github.com/nicolaka/netshoot)

### **📖 Learning Resources**

* [Kubernetes Debugging Guide](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/)
* [Performance Best Practices](https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring/)
* [Network Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/)

***

\*🔧 \**Troubleshooting adalah skill yang penting untuk maintain reliable Kubernetes cluster*
