# Enterprise

> 🚀 **Enterprise-Grade Kubernetes**: Panduan komprehensif untuk multi-cluster management, disaster recovery, dan enterprise-level strategies.

***

## 📋 **Daftar Isi**

### **🌐 Multi-Cluster Management**

* [Cluster Federation](#cluster-federation)
* [Cross-Cluster Networking](#cross-cluster-networking)
* [Cluster API](#cluster-api)
* [Multi-Tenant Architecture](#multi-tenant-architecture)
* [Cluster Lifecycle Management](#cluster-lifecycle-management)

### **🛡️ Disaster Recovery**

* [Backup Strategies](#backup-strategies)
* [Multi-Region Deployment](#multi-region-deployment)
* [Data Replication](#data-replication)
* [Failover Mechanisms](#failover-mechanisms)
* [Recovery Procedures](#recovery-procedures)

### **💾 Backup & Restore**

* [Velero Backup Solution](#velero-backup-solution)
* [Application Backup](#application-backup)
* [Database Backup](#database-backup)
* [Storage Backup](#storage-backup)
* [Restore Procedures](#restore-procedures)

### **🔐 Enterprise Security**

* [Compliance & Auditing](#compliance--auditing)
* [Network Security](#enterprise-network-security)
* [Identity & Access Management](#identity--access-management)
* [Secret Management](#enterprise-secret-management)
* [Security Posture Management](#security-posture-management)

### **📊 Governance & Policy**

* [Policy as Code](#policy-as-code)
* [Resource Quotas](#resource-quotas)
* [Cost Management](#enterprise-cost-management)
* [Audit Trails](#audit-trails)
* [Compliance Frameworks](#compliance-frameworks)

***

## 🌐 **Multi-Cluster Management**

### Cluster Federation

**🏛️ Kubernetes Federation Overview** Cluster Federation memungkinkan management multiple Kubernetes clusters sebagai single logical unit dengan sync policies dan cross-cluster service discovery.

**🔧 Federation v2 Setup**

```bash
# Install Federation
kubectl apply -f https://github.com/kubernetes-sigs/kubefed/releases/download/v0.8.1/kubefed.yaml

# Initialize federation control plane
kubefedctl join cluster1 --cluster-context cluster1 --host-cluster-context main-cluster --kubefedctl-arg=v=2
kubefedctl join cluster2 --cluster-context cluster2 --host-cluster-context main-cluster --kubefedctl-arg=v=2
```

**📋 Federated Resources Configuration**

```yaml
# Federated namespace
apiVersion: types.kubefed.io/v1beta1
kind: FederatedNamespace
metadata:
  name: production
  namespace: production
spec:
  placement:
    clusters:
    - name: cluster1
    - name: cluster2

# Federated deployment
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
  name: web-app
  namespace: production
spec:
  placement:
    clusters:
    - name: cluster1
    - name: cluster2
  template:
    metadata:
      labels:
        app: web-app
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: web-app
      template:
        metadata:
          labels:
            app: web-app
        spec:
          containers:
          - name: web-app
            image: nginx:latest
            ports:
            - containerPort: 80
  overrides:
  - clusterName: cluster1
    clusterOverrides:
      replicas: 5
  - clusterName: cluster2
    clusterOverrides:
      replicas: 2
```

### Cross-Cluster Networking

**🌐 Multi-Cluster Network Connectivity**

**Submariner for Cross-Cluster Networking**

```bash
# Install Submariner
curl -Ls https://get.submariner.io | bash
submariner-broker deploy

# Join clusters
subctl join --kubeconfig cluster1-config broker-info.subm --clusterid cluster1
subctl join --kubeconfig cluster2-config broker-info.subm --clusterid cluster2

# Verify connectivity
subctl verify --kubeconfig cluster1-config --kubeconfig cluster2-config
```

**Service Discovery Across Clusters**

```yaml
# Global Service for cross-cluster discovery
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  name: web-service
  namespace: production

---
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
  name: web-service
  namespace: production
spec:
  type: ClusterSetIP
  ports:
  - port: 80
    protocol: TCP
  sessionAffinity: ClientIP
```

### Cluster API

**🏗️ Declarative Cluster Management**

**Cluster API Installation**

```bash
# Install Cluster API
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.4.4/clusterctl-linux-amd64.tar.gz -o clusterctl.tar.gz
tar xzf clusterctl.tar.gz
sudo mv clusterctl /usr/local/bin/

# Initialize management cluster
clusterctl init --infrastructure aws
```

**Cluster Configuration**

```yaml
# Cluster configuration
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster
  namespace: cluster-system
spec:
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: production-cluster-infrastructure
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-cluster-control-plane

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: production-cluster-infrastructure
  namespace: cluster-system
spec:
  region: us-west-2
  sshKeyName: my-key-pair
  network:
    vpc:
      id: vpc-12345678

---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: production-cluster-control-plane
  namespace: cluster-system
spec:
  replicas: 3
  version: v1.28.0
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSMachineTemplate
    name: production-cluster-control-plane
```

### Multi-Tenant Architecture

**🏢 Enterprise Multi-Tenancy**

**Namespace as a Service**

```yaml
# Tenant namespace template
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-{{TENANT_ID}}
  labels:
    tenant: {{TENANT_ID}}
    environment: production
  annotations:
    tenant.name: {{TENANT_NAME}}
    owner: {{OWNER_EMAIL}}

# Resource quota for tenant
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-{{TENANT_ID}}-quota
  namespace: tenant-{{TENANT_ID}}
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "10"
    pods: "50"
    services: "20"
    secrets: "100"
    configmaps: "50"

# Network policy for tenant isolation
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-{{TENANT_ID}}-policy
  namespace: tenant-{{TENANT_ID}}
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          tenant: {{TENANT_ID}}
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          tenant: {{TENANT_ID}}
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 443
```

**Tenant RBAC Configuration**

```yaml
# Tenant admin role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tenant-admin
  namespace: tenant-{{TENANT_ID}}
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: tenant-{{TENANT_ID}}-admin-binding
  namespace: tenant-{{TENANT_ID}}
subjects:
- kind: User
  name: {{TENANT_ADMIN_EMAIL}}
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: tenant-admin
  apiGroup: rbac.authorization.k8s.io

# Tenant user role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tenant-user
  namespace: tenant-{{TENANT_ID}}
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch"]
```

### Cluster Lifecycle Management

**🔄 Automated Cluster Operations**

**Cluster Upgrade Automation**

```yaml
# Cluster upgrade with ArgoCD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cluster-upgrade
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/mycompany/cluster-configs
    targetRevision: main
    path: cluster-upgrades
  destination:
    server: https://kubernetes.default.svc
    namespace: cluster-upgrade
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

# Upgrade strategy configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: upgrade-strategy
  namespace: cluster-upgrade
data:
  strategy.yaml: |
    clusters:
      - name: production-cluster
        strategy: rolling
        maxUnavailable: 1
        surge: 1
      - name: staging-cluster
        strategy: recreate
        schedule: "0 2 * * 0"  # Weekly on Sunday 2 AM
```

***

## 🛡️ **Disaster Recovery**

### Backup Strategies

**💾 Comprehensive Backup Architecture**

**3-2-1 Backup Strategy Implementation**

```yaml
# Backup configuration with Velero
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: primary-backup
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero-backups-primary
    prefix: cluster-backups
  config:
    region: us-west-2
    s3Url: https://s3.us-west-2.amazonaws.com

---
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: secondary-backup
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero-backups-secondary
    prefix: cluster-backups
  config:
    region: us-east-1
    s3Url: https://s3.us-east-1.amazonaws.com

---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
    - production
    - staging
    storageLocation: primary-backup
    volumeSnapshotLocations:
    - aws-primary
    ttl: "720h"  # 30 days
    hooks:
      resources:
      - name: pre-backup-hook
        namespaces:
        - production
        labelSelector:
          matchLabels:
            backup-hook: "true"
        hooks:
        - exec:
          container: app
          command:
          - /bin/sh
          - -c
          - "echo 'Taking application snapshot before backup'"
          onError: Fail

# Cross-region backup replication
apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-replication
  namespace: velero
spec:
  schedule: "0 6 * * *"  # Daily at 6 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup-replicator
            image: amazon/aws-cli:latest
            command:
            - /bin/sh
            - -c
            - |
              aws s3 sync s3://velero-backups-primary s3://velero-backups-secondary \
                --delete --region us-east-1
              echo "Backup replication completed at $(date)"
          env:
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: access-key-id
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: secret-access-key
          restartPolicy: OnFailure
```

### Multi-Region Deployment

**🌍 Geographic Distribution Strategy**

**Multi-Region Cluster Setup**

```yaml
# Primary region cluster
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: primary-cluster
  namespace: production
  labels:
    region: us-west-2
    cluster-role: primary
spec:
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: primary-cluster-infrastructure

---
# Secondary region cluster
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: secondary-cluster
  namespace: production
  labels:
    region: us-east-1
    cluster-role: secondary
spec:
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: secondary-cluster-infrastructure

# Global DNS configuration
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: app-global-dns
  namespace: production
spec:
  endpoints:
  - dnsName: app.example.com
    recordTTL: 60
    recordType: A
    targets:
    - 1.2.3.4  # Primary cluster LB IP
    - 5.6.7.8  # Secondary cluster LB IP
    setIdentifier: primary
    healthCheck:
      protocol: HTTP
      port: 80
      path: /health
      failureThreshold: 3
  - dnsName: app.example.com
    recordTTL: 60
    recordType: A
    targets:
    - 5.6.7.8  # Secondary cluster LB IP
    setIdentifier: secondary
    healthCheck:
      protocol: HTTP
      port: 80
      path: /health
      failureThreshold: 3
```

### Data Replication

**📊 Cross-Region Data Synchronization**

**Database Replication Strategy**

```yaml
# Primary database with replication
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-primary
  namespace: production
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  postgresql:
    parameters:
      max_connections: "200"
      wal_level: "logical"
      max_wal_senders: "10"
      max_replication_slots: "10"
  bootstrap:
    initdb:
      database: appdb
      owner: appuser
      secret:
        name: postgres-credentials
  storage:
    size: 1Ti
    storageClass: gp3-encrypted
  monitoring:
    enabled: true
  externalClusters:
  - name: postgres-replica
    connectionParameters:
      host: postgres-secondary-rw.production.svc.cluster.local
      user: streaming_replica
      dbname: appdb
    password:
      name: postgres-replica-credentials
      key: password

# Replica cluster
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-replica
  namespace: production
spec:
  instances: 2
  replica:
    enabled: true
    source: postgres-primary
  bootstrap:
    pg_basebackup:
      source: postgres-primary
  storage:
    size: 1Ti
    storageClass: gp3-encrypted
  monitoring:
    enabled: true
```

### Failover Mechanisms

**🔄 Automated Failover Configuration**

**Application Failover Controller**

```yaml
# Failover controller deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: failover-controller
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: failover-controller
  template:
    metadata:
      labels:
        app: failover-controller
    spec:
      containers:
      - name: controller
        image: failover-controller:latest
        env:
        - name: PRIMARY_CLUSTER
          value: "primary-cluster"
        - name: SECONDARY_CLUSTER
          value: "secondary-cluster"
        - name: APP_NAMESPACE
          value: "production"
        - name: HEALTH_CHECK_INTERVAL
          value: "30s"
        - name: FAILOVER_THRESHOLD
          value: "3"
        - name: SLACK_WEBHOOK
          valueFrom:
            secretKeyRef:
              name: notifications
              key: slack-webhook
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

# Service with failover configuration
apiVersion: v1
kind: Service
metadata:
  name: app-service
  namespace: production
  annotations:
    failover-controller/enabled: "true"
    failover-controller/primary-endpoint: "primary-cluster.production.svc.cluster.local:8080"
    failover-controller/secondary-endpoint: "secondary-cluster.production.svc.cluster.local:8080"
    failover-controller/health-path: "/health"
    failover-controller/failover-threshold: "3"
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: web-app
```

### Recovery Procedures

**🛠️ Disaster Recovery Runbook**

**Recovery Automation**

```yaml
# Recovery procedure automation
apiVersion: batch/v1
kind: Job
metadata:
  name: disaster-recovery
  namespace: recovery
spec:
  template:
    spec:
      containers:
      - name: recovery-runner
        image: disaster-recovery:latest
        command:
        - /bin/bash
        - -c
        - |
          echo "Starting disaster recovery process at $(date)"

          # Step 1: Verify backup availability
          echo "Checking backup availability..."
          velero backup get

          # Step 2: Restore from latest backup
          echo "Restoring from backup..."
          velero restore create --from-backup latest-backup \
            --namespace-mappings production:recovery \
            --wait

          # Step 3: Verify data integrity
          echo "Verifying data integrity..."
          kubectl get pods -n recovery
          kubectl exec -it deployment/web-app -n recovery -- \
            curl -f http://localhost:8080/health

          # Step 4: Update DNS if needed
          echo "Updating DNS records..."
          # DNS update logic here

          # Step 5: Notify stakeholders
          echo "Sending notifications..."
          curl -X POST -H 'Content-type: application/json' \
            --data '{"text":"Disaster recovery completed successfully"}' \
            $SLACK_WEBHOOK

          echo "Recovery process completed at $(date)"
        env:
        - name: SLACK_WEBHOOK
          valueFrom:
            secretKeyRef:
              name: notifications
              key: slack-webhook
        - name: BACKUP_NAME
          value: "latest-backup"
      restartPolicy: Never
```

***

## 💾 **Backup & Restore**

### Velero Backup Solution

**🔧 Velero Enterprise Setup**

**Velero Installation and Configuration**

```bash
# Install Velero
curl -fsSL -o velero.tar.gz https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/

# Install Velero on cluster
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --backup-location-config region=us-west-2 \
  --snapshot-location-config region=us-west-2 \
  --use-node-agent \
  --wait
```

**Comprehensive Backup Configuration**

```yaml
# Backup schedule with different retention policies
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-full-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
    - production
    - staging
    excludedNamespaces:
    - kube-system
    - kube-public
    storageLocation: primary-backup
    volumeSnapshotLocations:
    - aws-primary
    ttl: "720h"  # 30 days
    hooks:
      resources:
      - name: database-backup-hook
        namespaces:
        - production
        labelSelector:
          matchLabels:
            app: database
        hooks:
        - exec:
          container: postgres
          command:
          - /bin/bash
          - -c
          - "pg_dump -U postgres appdb > /tmp/pre-backup.sql"
          onError: Fail
      - name: app-consistency-hook
        namespaces:
        - production
        labelSelector:
          matchLabels:
            backup-hook: "true"
        hooks:
        - exec:
          container: app
          command:
          - /bin/sh
          - -c
          - "curl -X POST http://localhost:8080/backup/prepare"

---
# Weekly backup for long-term retention
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: weekly-longterm-backup
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # Weekly on Sunday 3 AM
  template:
    includedNamespaces:
    - production
    storageLocation: secondary-backup
    volumeSnapshotLocations:
    - aws-secondary
    ttl: "8760h"  # 1 year
    hooks:
      resources:
      - name: weekly-backup-hook
        namespaces:
        - production
        hooks:
        - exec:
          container: app
          command:
          - /bin/sh
          - -c
          - "echo 'Weekly long-term backup initiated at $(date)'"

# On-demand backup configuration
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: pre-upgrade-backup
  namespace: velero
spec:
  includedNamespaces:
  - production
  excludedNamespaces:
  - kube-system
  storageLocation: primary-backup
  volumeSnapshotLocations:
  - aws-primary
  ttl: "168h"  # 7 days
  hooks:
    resources:
    - name: pre-upgrade-hook
      namespaces:
      - production
      hooks:
      - exec:
          container: app
          command:
          - /bin/sh
          - -c
          - "echo 'Pre-upgrade backup completed'"
```

### Application Backup

**📱 Application-Specific Backup Strategies**

**Stateful Application Backup**

```yaml
# Application backup strategy with consistent snapshots
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-backup-script
  namespace: production
data:
  backup.sh: |
    #!/bin/bash
    set -e

    echo "Starting application backup process..."

    # Create application snapshot
    curl -X POST http://localhost:8080/api/v1/backup/create \
      -H "Content-Type: application/json" \
      -d '{"type": "full", "description": "Automated backup"}'

    # Wait for backup completion
    while true; do
      status=$(curl -s http://localhost:8080/api/v1/backup/status | jq -r '.status')
      if [ "$status" = "completed" ]; then
        break
      fi
      echo "Backup in progress... status: $status"
      sleep 30
    done

    # Upload backup to S3
    backup_id=$(curl -s http://localhost:8080/api/v1/backup/status | jq -r '.id')
    aws s3 cp /app/backups/$backup_id.tar.gz s3://app-backups/$backup_id.tar.gz

    echo "Application backup completed successfully"
    echo "Backup ID: $backup_id"
    echo "S3 Location: s3://app-backups/$backup_id.tar.gz"

---
# Backup job for application data
apiVersion: batch/v1
kind: CronJob
metadata:
  name: app-backup-job
  namespace: production
spec:
  schedule: "0 1 * * *"  # Daily at 1 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: app-backup
            image: my-app:latest
            command:
            - /bin/bash
            - /scripts/backup.sh
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret-access-key
            - name: BACKUP_BUCKET
              value: "app-backups"
            volumeMounts:
            - name: backup-script
              mountPath: /scripts
            - name: backup-storage
              mountPath: /app/backups
          volumes:
          - name: backup-script
            configMap:
              name: app-backup-script
              defaultMode: 0755
          - name: backup-storage
            persistentVolumeClaim:
              claimName: app-backup-pvc
          restartPolicy: OnFailure
```

### Database Backup

**🗄️ Database Backup Strategies**

**PostgreSQL Backup Configuration**

```yaml
# PostgreSQL backup configuration
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-cluster
  namespace: production
spec:
  instances: 3
  postgresql:
    parameters:
      wal_level: "logical"
      archive_mode: "on"
      archive_command: "aws s3 cp %p s3://postgres-wal-archive/%f --storage-class GLACIER"
  bootstrap:
    initdb:
      database: appdb
      owner: appuser
      secret:
        name: postgres-credentials
  storage:
    size: 1Ti
    storageClass: gp3-encrypted
  backup:
    retentionPolicy: "30d"
    barmanObjectStore:
      destinationPath: "s3://postgres-backups/"
      s3Credentials:
        accessKeyId:
          name: backup-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: backup-credentials
          key: SECRET_ACCESS_KEY
      wal:
        compression: "gzip"
        encryption: "AES256"
      data:
        compression: "gzip"
        encryption: "AES256"
        jobs: 2
  monitoring:
    enabled: true

# Backup monitoring and alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgres-backup-alerts
  namespace: monitoring
spec:
  groups:
  - name: postgres-backup
    rules:
    - alert: PostgresBackupFailed
      expr: cnpg_pg_backup_last_failed_timestamp > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "PostgreSQL backup failed"
        description: "PostgreSQL backup failed for cluster {{ $labels.cluster }}"

    - alert: PostgresBackupStale
      expr: time() - cnpg_pg_backup_last_success_timestamp > 86400
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PostgreSQL backup is stale"
        description: "PostgreSQL backup for cluster {{ $labels.cluster }} hasn't succeeded in over 24 hours"
```

### Storage Backup

**💾 Persistent Volume Backup Strategy**

**Volume Backup Configuration**

```yaml
# Volume snapshot class for backups
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-vsc
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Retain
parameters:
  description: "VolumeSnapshot created by Velero"

# Volume snapshot schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: volume-backup-schedule
  namespace: velero
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM
  template:
    labelSelector:
      matchLabels:
        backup-type: volume
    storageLocation: primary-backup
    volumeSnapshotLocations:
    - aws-primary
    ttl: "720h"  # 30 days
    snapshotVolumes: true

# Application with volume backup configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app-with-volumes
  namespace: production
  labels:
    backup-type: volume
spec:
  serviceName: app-with-volumes
  replicas: 3
  selector:
    matchLabels:
      app: app-with-volumes
  template:
    metadata:
      labels:
        app: app-with-volumes
        backup-type: volume
    spec:
      containers:
      - name: app
        image: my-app:latest
        volumeMounts:
        - name: data-volume
          mountPath: /app/data
        - name: config-volume
          mountPath: /app/config
        env:
        - name: BACKUP_ENABLED
          value: "true"
        - name: BACKUP_SCHEDULE
          value: "0 4 * * *"  # Daily at 4 AM
  volumeClaimTemplates:
  - metadata:
      name: data-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3-encrypted
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: config-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3-encrypted
      resources:
        requests:
          storage: 10Gi
```

### Restore Procedures

**🔄 Restore Automation**

**Restore Configuration**

```yaml
# Restore job configuration
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: disaster-recovery-restore
  namespace: velero
spec:
  backupName: latest-backup
  includedNamespaces:
  - production
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  namespaceMapping:
    production: recovery-production
  labelSelector:
    matchLabels:
      restore: "true"
  restorePVs: true
  preserveNodePorts: true
  hooks:
    resources:
    - name: post-restore-hook
      namespace: recovery-production
      labelSelector:
        matchLabels:
          restore-hook: "true"
      hooks:
      - exec:
          container: app
          command:
          - /bin/sh
          - -c
          - "echo 'Running post-restore configuration'; curl -X POST http://localhost:8080/api/v1/restore/complete"

# Restore validation job
apiVersion: batch/v1
kind: Job
metadata:
  name: restore-validation
  namespace: recovery
spec:
  template:
    spec:
      containers:
      - name: validator
        image: restore-validator:latest
        command:
        - /bin/bash
        - -c
        - |
          echo "Starting restore validation..."

          # Check pod status
          kubectl get pods -n recovery-production

          # Validate application health
          for pod in $(kubectl get pods -n recovery-production -o jsonpath='{.items[*].metadata.name}'); do
            echo "Validating pod: $pod"
            kubectl exec -n recovery-production $pod -- curl -f http://localhost:8080/health
          done

          # Check data integrity
          kubectl exec -n recovery-production deployment/database -- \
            psql -U postgres -d appdb -c "SELECT COUNT(*) FROM users;"

          # Verify services are accessible
          kubectl exec -n recovery-production deployment/web-app -- \
            curl -f http://localhost:8080/api/v1/health

          echo "Restore validation completed successfully"
        env:
        - name: TARGET_NAMESPACE
          value: "recovery-production"
      restartPolicy: OnFailure
```

***

## 🔐 **Enterprise Security**

### Compliance & Auditing

**📋 Regulatory Compliance Framework**

**CIS Kubernetes Benchmark Implementation**

```yaml
# Pod Security Standards for compliance
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
    compliance: pci-dss
    environment: production

# Network policies for compliance
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: compliance-network-policy
  namespace: production
  annotations:
    compliance: "pci-dss"
    description: "Network policy enforcing PCI-DSS requirements"
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 80
    - protocol: TCP
      port: 443
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: database
    ports:
    - protocol: TCP
      port: 5432
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 443
```

**Audit Configuration**

```yaml
# Comprehensive audit policy
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  namespaces: ["production"]
  resources:
  - group: ""
    resources: ["secrets", "configmaps", "pods"]
  - group: "rbac.authorization.k8s.io"
    resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
- level: Request
  resources:
  - group: ""
    resources: ["events"]
  namespaces: ["*"]
- level: RequestResponse
  namespaces: ["production"]
  resources:
  - group: ""
    resources: ["pods"]
  verbs: ["create", "delete", "update", "patch"]
- level: Metadata
  omitStages:
  - RequestReceived
  userGroups:
  - "system:serviceaccounts:kube-system"
  verbs:
  - "get"
  - "list"
  - "watch"
```

### Enterprise Network Security

**🛡️ Advanced Network Security**

**Calico Enterprise Security**

```yaml
# Global network policy for enterprise security
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: enterprise-security
spec:
  selector: all()
  order: 100
  ingress:
  - action: Allow
    protocol: TCP
    source:
      selector: has(environment)
      namespaceSelector: has(environment)
    destination:
      selector: all()
      ports:
      - 80
      - 443
  - action: Deny
    protocol: TCP
    destination:
      ports:
      - 22
      - 3389
  - action: Allow
    protocol: TCP
    source:
      selector: has(role)
      namespaceSelector: has(name)
    destination:
      selector: has(role)
      namespaceSelector: has(name)
  egress:
  - action: Allow
    protocol: TCP
    destination:
      selector: all()
      ports:
      - 53
      - 443
  - action: Deny
    protocol: UDP
    destination:
      ports:
      - 53
      - 123

# Service mesh security with Istio
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT

---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: app-authz
  namespace: production
spec:
  selector:
    matchLabels:
      app: secure-app
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/production/sa/frontend-sa"]
  - to:
    - operation:
        methods: ["GET", "POST"]
```

### Identity & Access Management

**👥 Enterprise IAM Integration**

**Azure AD Integration**

```yaml
# Azure AD integration configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: azure-ad-config
  namespace: kube-system
data:
  azure.yaml: |
    apiVersion: aadpodidentity.k8s.io/v1
    kind: AzureIdentity
    metadata:
      name: aad-pod-identity
      namespace: kube-system
    spec:
      type: 0
      resourceID: /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<identity-name>
      clientID: <client-id>

---
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
  name: aad-pod-identity-binding
  namespace: production
spec:
  azureIdentity: aad-pod-identity
  selector: aad-pod-identity

# Pod with Azure AD integration
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
  namespace: production
  labels:
    aad-pod-identity: "true"
spec:
  serviceAccountName: workload-identity
  containers:
  - name: app
    image: my-secure-app:latest
    env:
    - name: AZURE_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: azure-identity
          key: client-id
    - name: AZURE_TENANT_ID
      valueFrom:
        secretKeyRef:
          name: azure-identity
          key: tenant-id
```

### Enterprise Secret Management

**🔧 Advanced Secret Management**

**HashiCorp Vault Integration**

```yaml
# Vault injector configuration
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vault-injector
  namespace: vault-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: vault-injector-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: vault-injector
  namespace: vault-system

# Application with Vault secrets
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
  namespace: production
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "production-app"
    vault.hashicorp.com/agent-inject-secret-database-creds: "secret/data/production/database"
    vault.hashicorp.com/agent-inject-template-database-creds: |
      {{- with secret "secret/data/production/database" -}}
      export DB_HOST="{{ .Data.data.host }}"
      export DB_USERNAME="{{ .Data.data.username }}"
      export DB_PASSWORD="{{ .Data.data.password }}"
      {{- end }}
spec:
  serviceAccountName: vault-auth
  containers:
  - name: app
    image: my-secure-app:latest
    command: ["/bin/sh", "-c"]
    args:
    - source /vault/secrets/database-creds && ./run-app.sh
```

***

## 📊 **Governance & Policy**

### Policy as Code

**📝 OPA Gatekeeper Policies**

**Enterprise Policy Framework**

```yaml
# Require resource limits
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlimits
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLimits
  targets:
  - target: admission.k8s.gatekeeper.sh
    rego: |
      package k8srequiredlimits
      violation[{"msg": msg}] {
        required := input.parameters.limits
        provided := input.review.object.spec.containers[_].resources.limits
        missing := required[_]
        count(missing) > 0
        msg := sprintf("Missing required limits: %v", [missing])
      }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLimits
metadata:
  name: must-have-limits
spec:
  enforcementAction: deny
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
  parameters:
    limits:
    - cpu
    - memory

# Require image from trusted registry
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8strustedregistry
spec:
  crd:
    spec:
      names:
        kind: K8sTrustedRegistry
  targets:
  - target: admission.k8s.gatekeeper.sh
    rego: |
      package k8strustedregistry
      violation[{"msg": msg}] {
        image := input.review.object.spec.containers[_].image
        not startswith(image, "registry.trusted.company.com/")
        msg := sprintf("Image %s is not from trusted registry", [image])
      }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sTrustedRegistry
metadata:
  name: trusted-registry-only
spec:
  enforcementAction: deny
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
```

### Resource Quotas

**📊 Multi-Tenant Resource Management**

**Hierarchical Resource Quotas**

```yaml
# Cluster-wide resource quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: cluster-quota
  namespace: kube-system
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi"
    limits.cpu: "200"
    limits.memory: "400Gi"
    persistentvolumeclaims: "500"
    pods: "1000"
    services: "200"
    count/deployments.apps: "100"
    count/secrets: "1000"

# Department-level quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: engineering-dept-quota
  namespace: engineering
spec:
  hard:
    requests.cpu: "50"
    requests.memory: "100Gi"
    limits.cpu: "100"
    limits.memory: "200Gi"
    persistentvolumeclaims: "100"
    pods: "200"
    services: "50"
    count/deployments.apps: "25"
    count/secrets: "200"

# Project-level quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: web-app-project-quota
  namespace: web-app-project
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    persistentvolumeclaims: "20"
    pods: "50"
    services: "10"
    count/deployments.apps: "10"
    count/secrets: "50"
```

### Enterprise Cost Management

**💰 Multi-Tenant Cost Allocation**

**Cost Allocation Framework**

```yaml
# Cost monitoring and allocation
apiVersion: v1
kind: ConfigMap
metadata:
  name: cost-allocation-rules
  namespace: monitoring
data:
  rules.yaml: |
    projects:
      - name: web-app
        department: engineering
        cost_center: 1010
        monthly_budget: 5000
      - name: mobile-app
        department: engineering
        cost_center: 1010
        monthly_budget: 3000
      - name: analytics
        department: data
        cost_center: 2010
        monthly_budget: 8000

    cost_allocation:
      cpu_rate: 0.05  # $0.05 per CPU-hour
      memory_rate: 0.01  # $0.01 per GB-hour
      storage_rate: 0.1  # $0.1 per GB-month
      network_rate: 0.02  # $0.02 per GB-transfer

# Cost alerting configuration
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-alerts
  namespace: monitoring
spec:
  groups:
  - name: cost-management
    rules:
    - alert: ProjectBudgetExceeded
      expr: project_monthly_cost > project_monthly_budget * 1.1
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Project {{ $labels.project }} budget exceeded"
        description: "Project {{ $labels.project }} has exceeded its monthly budget by 10%"

    - alert: DepartmentBudgetWarning
      expr: department_monthly_cost > department_monthly_budget * 0.9
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Department {{ $labels.department }} budget warning"
        description: "Department {{ $labels.department }} has used 90% of its monthly budget"
```

### Audit Trails

**📋 Comprehensive Auditing System**

**Audit Log Management**

```yaml
# Audit log collection and analysis
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-log-config
  namespace: audit-system
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/kubernetes/audit.log
        Parser            json
        Tag               kubernetes-audit
        Refresh_Interval  5
        Rotate_Wait       30

    [FILTER]
        Name                kubernetes
        Match               kubernetes-audit
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [OUTPUT]
        Name  forward
        Match kubernetes-audit
        Host  elasticsearch.logging.svc.cluster.local
        Port  24224

# Audit retention policy
apiVersion: batch/v1
kind: CronJob
metadata:
  name: audit-log-retention
  namespace: audit-system
spec:
  schedule: "0 0 * * 0"  # Weekly on Sunday
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: log-retention
            image: amazon/aws-cli:latest
            command:
            - /bin/bash
            - -c
            - |
              # Delete logs older than 1 year
              aws logs delete-log-group --log-group-name /aws/kubernetes/audit \
                --region us-west-2 2>/dev/null || true

              # Archive logs older than 90 days
              aws logs create-export-task \
                --log-group-name /aws/kubernetes/audit \
                --from-time $(date -d '90 days ago' --iso-8601) \
                --to $(date -d '89 days ago' --iso-8601) \
                --destination s3://audit-logs-archive/$(date +%Y-%m)/
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret-access-key
          restartPolicy: OnFailure
```

### Compliance Frameworks

**📋 Multiple Compliance Standards**

**SOC 2 Type II Compliance**

```yaml
# SOC 2 compliance policies
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: soc2-compliance-labels
spec:
  enforcementAction: deny
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod", "Service", "ConfigMap", "Secret"]
  parameters:
    labels:
    - "security.classification"
    - "data.sensitivity"
    - "owner"

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sDisallowedTags
metadata:
  name: soc2-image-policy
spec:
  enforcementAction: deny
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
  parameters:
    tags:
    - latest
    - unstable
    - beta
    - alpha
```

**GDPR Compliance**

```yaml
# GDPR compliance configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: gdpr-compliance-config
  namespace: compliance
data:
  data-retention-policy.yaml: |
    data_classes:
      - name: personal_data
        retention_days: 365
        encryption_required: true
        consent_required: true
      - name: sensitive_data
        retention_days: 180
        encryption_required: true
        consent_required: true
      - name: analytics_data
        retention_days: 730
        encryption_required: false
        consent_required: false
        anonymization: true

    processing_principles:
      - lawful_basis: required
      - purpose_limitation: required
      - data_minimization: required
      - accuracy: required
      - storage_limitation: required
      - integrity_confidentiality: required
      - accountability: required

# GDPR compliance monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gdpr-compliance
  namespace: monitoring
spec:
  groups:
  - name: gdpr-monitoring
    rules:
    - alert: PersonalDataRetentionExceeded
      expr: personal_data_retention_days > personal_data_max_retention
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: "Personal data retention period exceeded"
        description: "Personal data has been retained longer than the GDPR-compliant period"

    - alert: MissingEncryptionForSensitiveData
      expr: sensitive_data_encrypted == 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Sensitive data is not encrypted"
        description: "GDPR requires encryption for sensitive personal data"
```

***

## 🎯 **Best Practices**

### **🏗️ Enterprise Architecture**

1. **Multi-Cluster Strategy**
   * Implement federation for unified management
   * Use GitOps for consistent deployments
   * Establish clear cluster roles and responsibilities
2. **Security Framework**
   * Defense in depth approach
   * Zero trust architecture
   * Regular security assessments
3. **Compliance Management**
   * Automated policy enforcement
   * Continuous compliance monitoring
   * Regular audit preparation

### **🔄 Disaster Recovery**

1. **Backup Strategy**
   * 3-2-1 backup principle
   * Regular backup testing
   * Cross-region replication
2. **Recovery Procedures**
   * Documented runbooks
   * Regular recovery drills
   * Automated failover mechanisms
3. **Monitoring and Alerting**
   * Comprehensive monitoring
   * Proactive alerting
   * Performance baselines

### **📊 Cost Management**

1. **Resource Optimization**
   * Right-sizing strategies
   * Autoscaling implementations
   * Spot instance utilization
2. **Cost Allocation**
   * Department-level budgeting
   * Project-based tracking
   * Regular cost reviews

***

## 🔗 **Referensi**

### **📚 Dokumentasi Resmi**

* [Kubernetes Federation](https://kubernetes.io/docs/concepts/cluster-administration/federation/)
* [Cluster API](https://cluster-api.sigs.k8s.io/)
* [Velero Documentation](https://velero.io/docs/)
* [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/)

### **🛠️ Enterprise Tools**

* [Submariner](https://submariner.io/)
* [Calico Enterprise](https://www.tigera.io/products/calico-enterprise)
* [HashiCorp Vault](https://www.vaultproject.io/)
* [Kubefed](https://github.com/kubernetes-sigs/kubefed)

### **📖 Learning Resources**

* [Enterprise Kubernetes Patterns](https://kubernetes.io/docs/concepts/cluster-administration/)
* [Disaster Recovery Best Practices](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade/)
* [Compliance Frameworks](https://kubernetes.io/docs/concepts/security/)

***

\*🏢 \**Enterprise Kubernetes requires comprehensive planning, security, and governance strategies*
