# Disaster Recovery

## Disaster Recovery Architecture Overview

### DR Strategy Components

```
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Backup &      │    │   High           │    │   Recovery       │
│   Restore       │◀───│   Availability   │◀───│   Testing        │
│                 │    │                  │    │                 │
│ • Velero        │    │ • Multi-Region   │    │ • DR Drills     │
│ • Snapshots     │    │ • Active-Passive │    │ • Chaos Eng    │
│ • Offsite       │    │ • Failover       │    │ • RTO/RPO       │
└─────────────────┘    └──────────────────┘    └─────────────────┘
```

### RTO/RPO Requirements

#### Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-policy
  namespace: disaster-recovery
data:
  dr-requirements.yaml: |
    service_tiers:
      critical:
        rto: 15 minutes      # Maximum downtime acceptable
        rpo: 5 minutes       # Maximum data loss acceptable
        backup_frequency: "5m"
        replication: "active-active"
        failover: "automatic"
        regions: ["us-west-2", "us-east-1", "eu-west-1"]

      important:
        rto: 1 hour
        rpo: 15 minutes
        backup_frequency: "15m"
        replication: "active-passive"
        failover: "manual"
        regions: ["us-west-2", "us-east-1"]

      standard:
        rto: 4 hours
        rpo: 1 hour
        backup_frequency: "1h"
        replication: "periodic"
        failover: "manual"
        regions: ["us-west-2"]

      archival:
        rto: 24 hours
        rpo: 24 hours
        backup_frequency: "24h"
        replication: "none"
        failover: "manual"
        regions: ["us-west-2"]

    backup_retention:
      daily: 30
      weekly: 12
      monthly: 12
      yearly: 7

    testing_schedule:
      dr_drills: "quarterly"
      restore_tests: "monthly"
      failover_tests: "semi-annually"
```

## Backup Strategy Implementation

### Velero Advanced Configuration

#### Multi-Cloud Velero Setup

```yaml
# Velero server configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: velero
  namespace: velero
spec:
  replicas: 1
  selector:
    matchLabels:
      app: velero
  template:
    metadata:
      labels:
        app: velero
    spec:
      serviceAccountName: velero
      containers:
      - name: velero
        image: velero/velero:v1.11.1
        command:
        - /velero
        args:
        - server
        - --log-level=info
        - --backup-sync-period=30m
        - --restic-timeout=10m
        - --default-restore-timeout=30m
        - --uploader-type=restic
        - --fs-backup-timeout=4h
        - --default-instance-volumes-to-fs-backup=true
        env:
        - name: VELERO_SCRATCH_DIR
          value: /scratch
        - name: VELERO_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: LD_LIBRARY_PATH
          value: /plugins
        volumeMounts:
        - name: plugins
          mountPath: /plugins
        - name: scratch
          mountPath: /scratch
        - name: cloud-credentials
          mountPath: /credentials
        resources:
          requests:
            cpu: 500m
            memory: 256Mi
          limits:
            cpu: 2000m
            memory: 2Gi
      volumes:
      - name: plugins
        emptyDir: {}
      - name: scratch
        emptyDir:
          sizeLimit: 2Gi
      - name: cloud-credentials
        secret:
          secretName: cloud-credentials
      nodeSelector:
        node-type: backup
      tolerations:
      - key: "backup"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

---
# Multi-cloud backup storage locations
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: aws-primary
  namespace: velero
  annotations:
    velero.io/storage-type: "aws"
spec:
  provider: aws
  objectStorage:
    bucket: velero-backups-primary
    prefix: production
  config:
    region: us-west-2
    s3ForcePathStyle: "false"
    s3Url: "https://s3.us-west-2.amazonaws.com"
    kmsKeyId: "arn:aws:kms:us-west-2:ACCOUNT_ID:key/KMS_KEY_ID"
    serverSideEncryption: "aws:kms"
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: gcp-secondary
  namespace: velero
  annotations:
    velero.io/storage-type: "gcp"
spec:
  provider: gcp
  objectStorage:
    bucket: velero-backups-secondary
    prefix: production
  config:
    project: my-gcp-project
    location: us-central1
    serviceAccount: gcp-service-account.json
---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: azure-tertiary
  namespace: velero
  annotations:
    velero.io/storage-type: "azure"
spec:
  provider: azure
  objectStorage:
    bucket: velero-backups-tertiary
    prefix: production
  config:
    resourceGroup: velero-rg
    storageAccount: velerosa
    subscriptionId: AZURE_SUBSCRIPTION_ID
    environment: AzurePublicCloud
```

#### Advanced Backup Schedules

```yaml
# Critical services - frequent backups
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: critical-services-backup
  namespace: velero
  labels:
    tier: critical
spec:
  schedule: "*/5 * * * *"  # Every 5 minutes
  useOwnerReferencesInBackup: false
  template:
    includedNamespaces:
    - production
    labelSelector:
      matchLabels:
        backup-tier: critical
    storageLocations:
    - aws-primary
    - gcp-secondary
    ttl: "720h"  # 30 days
    defaultVolumesToRestic: true
    hooks:
      resources:
      - name: pre-backup-snapshot
        namespace: production
        labelSelector:
          matchLabels:
            backup-hook: pre-snapshot
        operations:
        - backup
        group: ""
        kind: Pod
        exec:
          command:
          - /bin/bash
          - -c
          - |
            echo "Taking application snapshot before backup"
            /app/scripts/pre-backup.sh
          container: app
          onError: Fail
          timeout: 5m
      - name: post-backup-verify
        namespace: production
        labelSelector:
          matchLabels:
            backup-hook: post-verify
        operations:
        - backup
        group: ""
        kind: Pod
        exec:
          command:
          - /bin/bash
          - -c
          - |
            echo "Verifying backup completion"
            /app/scripts/post-backup.sh
          container: app
          onError: Continue
          timeout: 5m

---
# Important services - regular backups
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: important-services-backup
  namespace: velero
  labels:
    tier: important
spec:
  schedule: "*/15 * * * *"  # Every 15 minutes
  useOwnerReferencesInBackup: false
  template:
    includedNamespaces:
    - production
    labelSelector:
      matchLabels:
        backup-tier: important
    storageLocations:
    - aws-primary
    - gcp-secondary
    ttl: "2160h"  # 90 days
    defaultVolumesToRestic: true
    itemOperationTimeout: 10m
---
# Full cluster backup - daily
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: full-cluster-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  useOwnerReferencesInBackup: false
  template:
    includedNamespaces:
    - "*"
    excludedNamespaces:
    - kube-system
    - kube-public
    - kube-node-lease
    storageLocations:
    - aws-primary
    - gcp-secondary
    - azure-tertiary
    ttl: "8760h"  # 1 year
    defaultVolumesToRestic: false
    volumeSnapshotLocations:
    - aws-primary
    - gcp-secondary
---
# Weekly long-term archival backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: archival-backup
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # Sunday at 3 AM
  useOwnerReferencesInBackup: false
  template:
    includedNamespaces:
    - production
    storageLocations:
    - aws-primary
    - gcp-secondary
    - azure-tertiary
    ttl: "43800h"  # 5 years
    defaultVolumesToRestic: true
```

## High Availability Architecture

### Multi-Region Kubernetes Setup

#### Primary-Secondary Cluster Configuration

```yaml
# Primary cluster configuration (us-west-2)
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-config
  namespace: kube-system
data:
  cluster-role: "primary"
  region: "us-west-2"
  cluster-name: "production-primary"
  disaster-recovery: "enabled"
  failover-mode: "automatic"
  sync-frequency: "30s"
---
# Secondary cluster configuration (us-east-1)
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-config
  namespace: kube-system
data:
  cluster-role: "secondary"
  region: "us-east-1"
  cluster-name: "production-secondary"
  disaster-recovery: "enabled"
  failover-mode: "manual"
  sync-frequency: "30s"
---
# Cluster synchronization service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-sync
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app: cluster-sync
  template:
    metadata:
      labels:
        app: cluster-sync
    spec:
      containers:
      - name: sync
        image: cluster-sync:v1.0.0
        env:
        - name: PRIMARY_CLUSTER
          value: "production-primary"
        - name: SECONDARY_CLUSTER
          value: "production-secondary"
        - name: SYNC_FREQUENCY
          value: "30s"
        - name: KUBECONFIG_PRIMARY
          valueFrom:
            secretKeyRef:
              name: cluster-credentials
              key: primary-kubeconfig
        - name: KUBECONFIG_SECONDARY
          valueFrom:
            secretKeyRef:
              name: cluster-credentials
              key: secondary-kubeconfig
        command:
        - /bin/sh
        - -c
        - |
          while true; do
            # Sync configurations
            kubectl --kubeconfig=$KUBECONFIG_PRIMARY get configmaps -l sync=true -o yaml | \
              kubectl --kubeconfig=$KUBECONFIG_SECONDARY apply -f -

            # Sync secrets
            kubectl --kubeconfig=$KUBECONFIG_PRIMARY get secrets -l sync=true -o yaml | \
              kubectl --kubeconfig=$KUBECONFIG_SECONDARY apply -f -

            # Deployments sync with reduced replicas
            kubectl --kubeconfig=$KUBECONFIG_PRIMARY get deployments -l ha=true -o yaml | \
              sed 's/replicas: [0-9]*/replicas: 1/' | \
              kubectl --kubeconfig=$KUBECONFIG_SECONDARY apply -f -

            sleep $SYNC_FREQUENCY
          done
        volumeMounts:
        - name: kubeconfig
          mountPath: /etc/kubeconfig
          readOnly: true
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
      volumes:
      - name: kubeconfig
        secret:
          secretName: cluster-credentials
```

### Application-Level High Availability

#### Geo-Distributed Application

```yaml
# Global load balancer service
apiVersion: v1
kind: Service
metadata:
  name: global-app-service
  namespace: production
  annotations:
    cloud.google.com/load-balancer-type: "Internal"
    networking.gke.io/load-balancer-type: "Internal"
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
spec:
  selector:
    app: geo-distributed-app
  ports:
  - port: 80
    targetPort: 8080
    name: http
  type: LoadBalancer
---
# Regional deployments
apiVersion: apps/v1
kind: Deployment
metadata:
  name: geo-app-us-west
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: geo-distributed-app
      region: us-west
  template:
    metadata:
      labels:
        app: geo-distributed-app
        region: us-west
        ha: "true"
        sync: "true"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - us-west-2a
                - us-west-2b
                - us-west-2c
      containers:
      - name: app
        image: geo-app:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: REGION
          value: "us-west"
        - name: CLUSTER_ROLE
          value: "primary"
        - name: DATABASE_ENDPOINT
          value: "postgres-primary.production.svc.cluster.local"
        - name: REDIS_ENDPOINT
          value: "redis-primary.production.svc.cluster.local"
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: geo-app-us-east
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: geo-distributed-app
      region: us-east
  template:
    metadata:
      labels:
        app: geo-distributed-app
        region: us-east
        ha: "true"
        sync: "true"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - us-east-1a
                - us-east-1b
                - us-east-1c
      containers:
      - name: app
        image: geo-app:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: REGION
          value: "us-east"
        - name: CLUSTER_ROLE
          value: "secondary"
        - name: DATABASE_ENDPOINT
          value: "postgres-replica.production.svc.cluster.local"
        - name: REDIS_ENDPOINT
          value: "redis-replica.production.svc.cluster.local"
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
```

## Failover Automation

### Automated Failover Controller

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: failover-controller
  namespace: disaster-recovery
spec:
  replicas: 1
  selector:
    matchLabels:
      app: failover-controller
  template:
    metadata:
      labels:
        app: failover-controller
    spec:
      serviceAccountName: failover-controller
      containers:
      - name: controller
        image: failover-controller:v1.0.0
        env:
        - name: PRIMARY_CLUSTER
          value: "production-primary"
        - name: SECONDARY_CLUSTERS
          value: "production-secondary,production-tertiary"
        - name: HEALTH_CHECK_INTERVAL
          value: "30s"
        - name: FAILOVER_THRESHOLD
          value: "3"
        - name: SLACK_WEBHOOK
          valueFrom:
            secretKeyRef:
              name: failover-config
              key: slack-webhook
        - name: PAGERDUTY_KEY
          valueFrom:
            secretKeyRef:
              name: failover-config
              key: pagerduty-key
        command:
        - /bin/sh
        - -c
        - |
          #!/bin/bash

          HEALTH_CHECK_INTERVAL=${HEALTH_CHECK_INTERVAL:-30s}
          FAILOVER_THRESHOLD=${FAILOVER_THRESHOLD:-3}
          FAILED_COUNT=0

          while true; do
            echo "Performing health check at $(date)"

            # Check primary cluster health
            PRIMARY_HEALTH=$(kubectl --kubeconfig=/etc/kubeconfig/primary get pods -n production -l ha=true --field-selector=status.phase!=Running --no-headers | wc -l)

            if [ $PRIMARY_HEALTH -eq 0 ]; then
              echo "Primary cluster is healthy"
              FAILED_COUNT=0
            else
              echo "Primary cluster has $PRIMARY_HEALTH unhealthy pods"
              FAILED_COUNT=$((FAILED_COUNT + 1))

              if [ $FAILED_COUNT -ge $FAILOVER_THRESHOLD ]; then
                echo "Initiating failover to secondary cluster"

                # Send alert
                curl -X POST -H 'Content-type: application/json' \
                  --data '{"text":"🚨 Kubernetes Failover: Primary cluster unhealthy, initiating failover"}' \
                  $SLACK_WEBHOOK

                # Trigger failover
                kubectl apply -f /etc/failover/failover.yaml

                # Scale down primary services
                kubectl --kubeconfig=/etc/kubeconfig/primary scale deployment --all --replicas=0 -n production

                # Scale up secondary services
                kubectl --kubeconfig=/etc/kubeconfig/secondary scale deployment --all --replicas=3 -n production

                # Update DNS records
                python3 /scripts/update-dns.py --action=failover

                # Send completion alert
                curl -X POST -H 'Content-type: application/json' \
                  --data '{"text":"✅ Kubernetes Failover: Successfully failed over to secondary cluster"}' \
                  $SLACK_WEBHOOK

                # Wait for manual intervention or recovery
                sleep 3600
              fi
            fi

            # Check for manual failover trigger
            if kubectl get configmap failover-trigger -n disaster-recovery -o yaml | grep -q "trigger: true"; then
              echo "Manual failover triggered"
              kubectl apply -f /etc/failover/failover.yaml
              kubectl delete configmap failover-trigger -n disaster-recovery
            fi

            sleep $HEALTH_CHECK_INTERVAL
          done
        volumeMounts:
        - name: kubeconfig
          mountPath: /etc/kubeconfig
          readOnly: true
        - name: failover-config
          mountPath: /etc/failover
          readOnly: true
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
      volumes:
      - name: kubeconfig
        secret:
          secretName: failover-kubeconfig
      - name: failover-config
        configMap:
          name: failover-config

---
# Failover configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: failover-config
  namespace: disaster-recovery
data:
  failover.yaml: |
    apiVersion: v1
    kind: Service
    metadata:
      name: app-service
      namespace: production
      annotations:
        external-dns.alpha.kubernetes.io/hostname: app.example.com
        external-dns.alpha.kubernetes.io/ttl: "60"
    spec:
      type: ExternalName
      externalName: secondary-cluster.example.com
      sessionAffinity: None
---
  dns-update.py: |
    #!/usr/bin/env python3

    import argparse
    import os
    import requests
    import json

    def update_dns(action, target_cluster):
        """Update DNS records for failover"""
        route53_client = boto3.client('route53')

        hosted_zone_id = os.environ['HOSTED_ZONE_ID']
        domain_name = os.environ['DOMAIN_NAME']

        if action == 'failover':
            # Update to secondary cluster
            target_ip = get_cluster_ip(target_cluster)
            target_health_check = get_cluster_health_check(target_cluster)
        else:
            # Update to primary cluster
            target_ip = get_cluster_ip('primary')
            target_health_check = get_cluster_health_check('primary')

        # Update A record
        response = route53_client.change_resource_record_sets(
            HostedZoneId=hosted_zone_id,
            ChangeBatch={
                'Changes': [
                    {
                        'Action': 'UPSERT',
                        'ResourceRecordSet': {
                            'Name': domain_name,
                            'Type': 'A',
                            'TTL': 60,
                            'ResourceRecords': [
                                {
                                    'Value': target_ip
                                }
                            ]
                        }
                    }
                ]
            }
        )

        print(f"DNS update initiated: {response['ChangeInfo']['Id']}")

    def get_cluster_ip(cluster_name):
        """Get cluster load balancer IP"""
        # Implementation to get cluster IP
        pass

    def get_cluster_health_check(cluster_name):
        """Get cluster health check configuration"""
        # Implementation to get health check
        pass

    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--action', required=True)
        parser.add_argument('--cluster', required=True)
        args = parser.parse_args()

        update_dns(args.action, args.cluster)
```

## Chaos Engineering

### Chaos Monkey Implementation

#### Chaos Monkey for Kubernetes

```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-monkey
  namespace: chaos
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: chaos-monkey
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "patch", "scale"]
- apiGroups: ["batch"]
  resources: ["jobs", "cronjobs"]
  verbs: ["get", "list", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: chaos-monkey
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: chaos-monkey
subjects:
- kind: ServiceAccount
  name: chaos-monkey
  namespace: chaos
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chaos-monkey
  namespace: chaos
spec:
  replicas: 1
  selector:
    matchLabels:
      app: chaos-monkey
  template:
    metadata:
      labels:
        app: chaos-monkey
    spec:
      serviceAccountName: chaos-monkey
      containers:
      - name: chaos-monkey
        image: chaos-monkey:v1.0.0
        env:
        - name: CHAOS_INTERVAL
          value: "10m"
        - name: CHAOS_DURATION
          value: "5m"
        - name: KILL_PERCENTAGE
          value: "10"
        - name: ENABLED_NAMESPACES
          value: "chaos-testing,staging"
        - name: EXCLUDE_LABELS
          value: "chaos.excluded=true"
        - name: SLACK_WEBHOOK
          valueFrom:
            secretKeyRef:
              name: chaos-config
              key: slack-webhook
        command:
        - /bin/sh
        - -c
        - |
          #!/bin/bash

          CHAOS_INTERVAL=${CHAOS_INTERVAL:-10m}
          KILL_PERCENTAGE=${KILL_PERCENTAGE:-10}
          ENABLED_NAMESPACES=($ENABLED_NAMESPACES)

          echo "Chaos Monkey started - Interval: $CHAOS_INTERVAL, Kill Percentage: $KILL_PERCENTAGE%"

          while true; do
            echo "Chaos iteration starting at $(date)"

            for namespace in "${ENABLED_NAMESPACES[@]}"; do
              # Get all eligible pods
              eligible_pods=$(kubectl get pods -n $namespace \
                --no-headers \
                --field-selector=status.phase=Running \
                -l chaos.enabled=true,!chaos.excluded=true \
                | awk '{print $1}')

              if [ -n "$eligible_pods" ]; then
                # Calculate number of pods to kill
                total_pods=$(echo "$eligible_pods" | wc -l)
                pods_to_kill=$((total_pods * KILL_PERCENTAGE / 100))

                if [ $pods_to_kill -eq 0 ]; then
                  pods_to_kill=1
                fi

                echo "Namespace $namespace: Killing $pods_to_kill out of $total_pods pods"

                # Randomly select pods to kill
                pods_to_kill_list=$(echo "$eligible_pods" | shuf -n $pods_to_kill)

                # Kill selected pods
                for pod in $pods_to_kill_list; do
                  echo "Killing pod: $namespace/$pod"

                  # Send alert before killing
                  curl -X POST -H 'Content-type: application/json' \
                    --data "{\"text\":\"🐒 Chaos Monkey: Killing pod $namespace/$pod\"}" \
                    $SLACK_WEBHOOK

                  # Delete pod
                  kubectl delete pod $pod -n $namespace --grace-period=0 --force

                  # Record chaos event
                  kubectl create configmap chaos-event-$pod \
                    --from-literal=namespace=$namespace \
                    --from-literal=pod=$pod \
                    --from-literal=timestamp=$(date -Iseconds) \
                    --from-literal=reason="chaos-monkey" \
                    -n chaos \
                    --dry-run=client -o yaml | kubectl apply -f -
                done
              else
                echo "No eligible pods found in namespace $namespace"
              fi
            done

            echo "Chaos iteration completed. Next iteration in $CHAOS_INTERVAL"
            sleep $CHAOS_INTERVAL
          done
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
---
# Chaos testing namespace
apiVersion: v1
kind: Namespace
metadata:
  name: chaos-testing
  labels:
    chaos.enabled: "true"
    chaos.environment: "testing"
---
# Chaos testing deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: resilient-app
  namespace: chaos-testing
  labels:
    app: resilient-app
    chaos.enabled: "true"
    chaos.resilience: "high"
spec:
  replicas: 5
  selector:
    matchLabels:
      app: resilient-app
  template:
    metadata:
      labels:
        app: resilient-app
        chaos.enabled: "true"
    spec:
      containers:
      - name: app
        image: resilient-app:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: CHAOS_MODE
          value: "enabled"
        - name: GRACEFUL_SHUTDOWN_TIMEOUT
          value: "30s"
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                echo "Graceful shutdown initiated"
                sleep 10
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
      terminationGracePeriodSeconds: 30
```

## Recovery Testing

### Automated Recovery Testing

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: recovery-test
  namespace: disaster-recovery
spec:
  schedule: "0 2 * * 1"  # Every Monday at 2 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: recovery-test
          restartPolicy: OnFailure
          containers:
          - name: recovery-test
            image: recovery-test:v1.0.0
            env:
            - name: TEST_NAMESPACE
              value: "recovery-testing"
            - name: BACKUP_NAME_PREFIX
              value: "recovery-test-backup"
            - name: SLACK_WEBHOOK
              valueFrom:
                secretKeyRef:
                  name: recovery-config
                  key: slack-webhook
            - name: EMAIL_RECIPIENT
              valueFrom:
                secretKeyRef:
                  name: recovery-config
                  key: email-recipient
            command:
            - /bin/sh
            - -c
            - |
              #!/bin/bash

              echo "Starting recovery test at $(date)"

              # Send start notification
              curl -X POST -H 'Content-type: application/json' \
                --data "{\"text\":\"🧪 Recovery Test: Starting automated recovery test\"}" \
                $SLACK_WEBHOOK

              # Create test namespace
              kubectl create namespace $TEST_NAMESPACE --dry-run=client -o yaml | kubectl apply -f -

              # Create test application
              cat <<EOF | kubectl apply -f -
              apiVersion: apps/v1
              kind: Deployment
              metadata:
                name: test-app
                namespace: $TEST_NAMESPACE
              spec:
                replicas: 2
                selector:
                  matchLabels:
                    app: test-app
                template:
                  metadata:
                    labels:
                      app: test-app
                  spec:
                    containers:
                    - name: app
                      image: nginx:1.21
                      ports:
                      - containerPort: 80
                      resources:
                        requests:
                          cpu: 50m
                          memory: 64Mi
                        limits:
                          cpu: 200m
                          memory: 256Mi
              ---
              apiVersion: v1
              kind: Service
              metadata:
                name: test-service
                namespace: $TEST_NAMESPACE
              spec:
                selector:
                  app: test-app
                ports:
                - port: 80
                  targetPort: 80
              EOF

              # Wait for deployment to be ready
              kubectl wait --for=condition=available deployment/test-app -n $TEST_NAMESPACE --timeout=300s

              # Create test data
              kubectl exec -n $TEST_NAMESPACE deployment/test-app -- sh -c "echo 'test data' > /usr/share/nginx/html/test.txt"

              # Take backup
              BACKUP_NAME="$BACKUP_NAME_PREFIX-$(date +%Y%m%d-%H%M%S)"
              velero backup create $BACKUP_NAME \
                --namespace $TEST_NAMESPACE \
                --wait \
                --include-namespaces $TEST_NAMESPACE

              # Simulate disaster by deleting namespace
              kubectl delete namespace $TEST_NAMESPACE --wait=false

              # Wait for deletion
              sleep 30

              # Restore from backup
              velero restore create --from-backup $BACKUP_NAME \
                --namespace-mappings $TEST_NAMESPACE:$TEST_NAMESPACE-restored \
                --wait

              # Verify restore
              RESTORED_PODS=$(kubectl get pods -n $TEST_NAMESPACE-restored --no-headers | wc -l)
              if [ $RESTORED_PODS -eq 2 ]; then
                echo "✅ Restore successful: All pods restored"

                # Test application functionality
                SERVICE_IP=$(kubectl get svc test-service -n $TEST_NAMESPACE-restored -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
                if [ -n "$SERVICE_IP" ]; then
                  TEST_RESULT=$(curl -s http://$SERVICE_IP/test.txt)
                  if [ "$TEST_RESULT" = "test data" ]; then
                    echo "✅ Application functionality verified"
                    TEST_STATUS="SUCCESS"
                  else
                    echo "❌ Application functionality test failed"
                    TEST_STATUS="PARTIAL"
                  fi
                else
                  echo "❌ Service not accessible"
                  TEST_STATUS="PARTIAL"
                fi
              else
                echo "❌ Restore failed: Only $RESTORED_PODS pods restored"
                TEST_STATUS="FAILED"
              fi

              # Cleanup
              kubectl delete namespace $TEST_NAMESPACE-restored
              velero backup delete $BACKUP_NAME
              velero restore delete --from-backup $BACKUP_NAME

              # Send results notification
              if [ "$TEST_STATUS" = "SUCCESS" ]; then
                EMOJI="✅"
                MESSAGE="Recovery Test: SUCCESS - All systems recovered correctly"
              elif [ "$TEST_STATUS" = "PARTIAL" ]; then
                EMOJI="⚠️"
                MESSAGE="Recovery Test: PARTIAL - Systems recovered with some issues"
              else
                EMOJI="❌"
                MESSAGE="Recovery Test: FAILED - Recovery process failed"
              fi

              curl -X POST -H 'Content-type: application/json' \
                --data "{\"text\":\"$EMOJI $MESSAGE\"}" \
                $SLACK_WEBHOOK

              # Send email report
              cat <<EOF | sendmail -t
              Subject: Kubernetes Recovery Test Results: $TEST_STATUS
              To: $EMAIL_RECIPIENT

              Kubernetes Recovery Test completed with status: $TEST_STATUS

              Test details:
              - Backup: $BACKUP_NAME
              - Original namespace: $TEST_NAMESPACE
              - Restored namespace: $TEST_NAMESPACE-restored
              - Restored pods: $RESTORED_PODS

              Full test logs are available in the recovery-test namespace.

              This is an automated message from the Kubernetes Recovery Testing System.
              EOF

              echo "Recovery test completed with status: $TEST_STATUS"

              # Exit with appropriate code
              if [ "$TEST_STATUS" = "SUCCESS" ]; then
                exit 0
              else
                exit 1
              fi
            resources:
              requests:
                cpu: 100m
                memory: 128Mi
              limits:
                cpu: 500m
                memory: 512Mi
---
# Recovery test service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: recovery-test
  namespace: disaster-recovery
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: recovery-test
rules:
- apiGroups: [""]
  resources: ["namespaces", "pods", "services"]
  verbs: ["create", "get", "list", "watch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["create", "get", "list", "watch", "delete"]
- apiGroups: ["velero.io"]
  resources: ["backups", "restores"]
  verbs: ["create", "get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: recovery-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: recovery-test
subjects:
- kind: ServiceAccount
  name: recovery-test
  namespace: disaster-recovery
```

***

## 🚀 **Production DR Setup**

### Complete Disaster Recovery Configuration

```yaml
# Disaster recovery namespace
apiVersion: v1
kind: Namespace
metadata:
  name: disaster-recovery
  labels:
    name: disaster-recovery
    criticality: "high"

---
# DR monitoring dashboard
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-dashboard
  namespace: disaster-recovery
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "Disaster Recovery Dashboard",
        "tags": ["disaster-recovery", "backup", "failover"],
        "panels": [
          {
            "title": "Backup Status",
            "type": "stat",
            "targets": [
              {
                "expr": "velero_backup_total",
                "legendFormat": "Total Backups"
              },
              {
                "expr": "velero_backup_failure_total",
                "legendFormat": "Failed Backups"
              }
            ]
          },
          {
            "title": "Restore Success Rate",
            "type": "stat",
            "targets": [
              {
                "expr": "velero_restore_success_total / (velero_restore_success_total + velero_restore_failure_total) * 100",
                "legendFormat": "Restore Success %"
              }
            ]
          },
          {
            "title": "Cluster Health",
            "type": "stat",
            "targets": [
              {
                "expr": "up{job=\"kubernetes-apiservers\"}",
                "legendFormat": "API Server"
              }
            ]
          }
        ]
      }
    }

---
# DR alerting rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: dr-alerts
  namespace: disaster-recovery
spec:
  groups:
  - name: disaster-recovery
    rules:
    - alert: BackupFailed
      expr: increase(velero_backup_failure_total[1h]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Backup operation failed"
        description: "Backup operation failed in the last hour"
    - alert: RestoreFailed
      expr: increase(velero_restore_failure_total[1h]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Restore operation failed"
        description: "Restore operation failed in the last hour"
    - alert: PrimaryClusterUnhealthy
      expr: up{job="kubernetes-apiservers"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Primary cluster is unhealthy"
        description: "Primary cluster has been down for more than 5 minutes"
```

***

## 📚 **Resources dan Referensi**

### Dokumentasi Official

* [Velero Documentation](https://velero.io/docs/)
* [Kubernetes Disaster Recovery](https://kubernetes.io/docs/tasks/administer-cluster/disaster-recovery/)
* [Chaos Engineering](https://principlesofchaos.org/)

### DR Tools

* [Chaos Monkey](https://github.com/Netflix/chaos-monkey)
* [Litmus Chaos](https://litmuschaos.io/)
* [Chaos Mesh](https://chaos-mesh.org/)

### Cheatsheet Summary

```bash
# Backup and Restore Commands
velero backup create my-backup --include-namespaces production
velero restore create --from-backup my-backup
velero schedule create daily-backup --schedule="0 2 * * *" --include-namespaces production

# DR Testing Commands
kubectl get backups -n velero
kubectl get restores -n velero
kubectl get events --sort-by='.lastTimestamp' | grep -i "backup\|restore"

# Failover Commands
kubectl get nodes
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
kubectl top nodes
```

Disaster recovery documentation siap digunakan! 🆘