第13章:部署与运维实践

本章实战要点

  • 分层部署: LB/CDN 前置,应用多副本,读写分离与缓存层搭配。

  • 配置与密钥: .env 与环境变量分层,密钥用密管/挂载注入。

  • 健康检查与自愈: readiness/liveness,灰度与滚动发布。

  • 可观测性: 指标/日志/追踪三件套齐备,SLO/告警阈值明确。

参考命令

# Dev stack (Compose)
docker-compose -f docker-compose.dev.yml up -d

# 热重载与本地运行
go run main.go  # 或 air -c .air.toml

交叉引用

  • 第9章:配置管理与环境变量。

  • 第11章:日志与监控基线;第18章:密钥与安全暴露面。

13.1 部署架构设计

在企业级应用部署中,合理的架构设计是确保系统稳定运行的基础。本章将详细介绍New API项目的部署架构和运维实践。

13.1.1 部署架构概览

graph TB
    subgraph "负载均衡层"
        LB[负载均衡器]
        CDN[CDN]
    end
    
    subgraph "应用层"
        APP1[应用实例1]
        APP2[应用实例2]
        APP3[应用实例3]
    end
    
    subgraph "缓存层"
        REDIS1[Redis主节点]
        REDIS2[Redis从节点]
    end
    
    subgraph "数据层"
        DB1[数据库主节点]
        DB2[数据库从节点]
    end
    
    subgraph "监控层"
        PROM[Prometheus]
        GRAF[Grafana]
        ALERT[AlertManager]
    end
    
    CDN --> LB
    LB --> APP1
    LB --> APP2
    LB --> APP3
    
    APP1 --> REDIS1
    APP2 --> REDIS1
    APP3 --> REDIS1
    
    REDIS1 --> REDIS2
    
    APP1 --> DB1
    APP2 --> DB1
    APP3 --> DB1
    
    DB1 --> DB2
    
    PROM --> APP1
    PROM --> APP2
    PROM --> APP3
    GRAF --> PROM
    ALERT --> PROM

图1:部署架构总览(流量入口→应用→缓存/数据库→可观测性)

13.1.2 环境配置管理

package config

import (
    "fmt"
    "os"
    "strconv"
    "strings"
    "time"
)

// 环境类型
type Environment string

const (
    EnvDevelopment Environment = "development"
    EnvTesting     Environment = "testing"
    EnvStaging     Environment = "staging"
    EnvProduction  Environment = "production"
)

// 部署配置
type DeploymentConfig struct {
    Environment Environment `json:"environment"`
    
    // 应用配置
    AppName    string `json:"app_name"`
    AppVersion string `json:"app_version"`
    Port       int    `json:"port"`
    
    // 数据库配置
    Database DatabaseConfig `json:"database"`
    
    // Redis配置
    Redis RedisConfig `json:"redis"`
    
    // 日志配置
    Logging LoggingConfig `json:"logging"`
    
    // 监控配置
    Monitoring MonitoringConfig `json:"monitoring"`
    
    // 安全配置
    Security SecurityConfig `json:"security"`
}

// 数据库配置
type DatabaseConfig struct {
    Host         string        `json:"host"`
    Port         int           `json:"port"`
    Username     string        `json:"username"`
    Password     string        `json:"password"`
    Database     string        `json:"database"`
    MaxOpenConns int           `json:"max_open_conns"`
    MaxIdleConns int           `json:"max_idle_conns"`
    MaxLifetime  time.Duration `json:"max_lifetime"`
    SSLMode      string        `json:"ssl_mode"`
}

// Redis配置
type RedisConfig struct {
    Host         string        `json:"host"`
    Port         int           `json:"port"`
    Password     string        `json:"password"`
    DB           int           `json:"db"`
    PoolSize     int           `json:"pool_size"`
    MinIdleConns int           `json:"min_idle_conns"`
    MaxRetries   int           `json:"max_retries"`
    DialTimeout  time.Duration `json:"dial_timeout"`
    ReadTimeout  time.Duration `json:"read_timeout"`
    WriteTimeout time.Duration `json:"write_timeout"`
}

// 日志配置
type LoggingConfig struct {
    Level      string `json:"level"`
    Format     string `json:"format"`
    Output     string `json:"output"`
    MaxSize    int    `json:"max_size"`
    MaxBackups int    `json:"max_backups"`
    MaxAge     int    `json:"max_age"`
    Compress   bool   `json:"compress"`
}

// 监控配置
type MonitoringConfig struct {
    Enabled        bool   `json:"enabled"`
    MetricsPath    string `json:"metrics_path"`
    PrometheusAddr string `json:"prometheus_addr"`
    JaegerAddr     string `json:"jaeger_addr"`
}

// 安全配置
type SecurityConfig struct {
    JWTSecret     string        `json:"jwt_secret"`
    JWTExpiration time.Duration `json:"jwt_expiration"`
    RateLimitRPS  int           `json:"rate_limit_rps"`
    CORSOrigins   []string      `json:"cors_origins"`
    TLSEnabled    bool          `json:"tls_enabled"`
    TLSCertFile   string        `json:"tls_cert_file"`
    TLSKeyFile    string        `json:"tls_key_file"`
}

// 加载部署配置
func LoadDeploymentConfig() (*DeploymentConfig, error) {
    config := &DeploymentConfig{
        Environment: Environment(getEnv("ENVIRONMENT", "development")),
        AppName:     getEnv("APP_NAME", "new-api"),
        AppVersion:  getEnv("APP_VERSION", "1.0.0"),
        Port:        getEnvAsInt("PORT", 8080),
    }
    
    // 加载数据库配置
    config.Database = DatabaseConfig{
        Host:         getEnv("DB_HOST", "localhost"),
        Port:         getEnvAsInt("DB_PORT", 5432),
        Username:     getEnv("DB_USERNAME", "postgres"),
        Password:     getEnv("DB_PASSWORD", ""),
        Database:     getEnv("DB_DATABASE", "newapi"),
        MaxOpenConns: getEnvAsInt("DB_MAX_OPEN_CONNS", 25),
        MaxIdleConns: getEnvAsInt("DB_MAX_IDLE_CONNS", 5),
        MaxLifetime:  getEnvAsDuration("DB_MAX_LIFETIME", 5*time.Minute),
        SSLMode:      getEnv("DB_SSL_MODE", "disable"),
    }
    
    // 加载Redis配置
    config.Redis = RedisConfig{
        Host:         getEnv("REDIS_HOST", "localhost"),
        Port:         getEnvAsInt("REDIS_PORT", 6379),
        Password:     getEnv("REDIS_PASSWORD", ""),
        DB:           getEnvAsInt("REDIS_DB", 0),
        PoolSize:     getEnvAsInt("REDIS_POOL_SIZE", 10),
        MinIdleConns: getEnvAsInt("REDIS_MIN_IDLE_CONNS", 5),
        MaxRetries:   getEnvAsInt("REDIS_MAX_RETRIES", 3),
        DialTimeout:  getEnvAsDuration("REDIS_DIAL_TIMEOUT", 5*time.Second),
        ReadTimeout:  getEnvAsDuration("REDIS_READ_TIMEOUT", 3*time.Second),
        WriteTimeout: getEnvAsDuration("REDIS_WRITE_TIMEOUT", 3*time.Second),
    }
    
    // 加载日志配置
    config.Logging = LoggingConfig{
        Level:      getEnv("LOG_LEVEL", "info"),
        Format:     getEnv("LOG_FORMAT", "json"),
        Output:     getEnv("LOG_OUTPUT", "stdout"),
        MaxSize:    getEnvAsInt("LOG_MAX_SIZE", 100),
        MaxBackups: getEnvAsInt("LOG_MAX_BACKUPS", 3),
        MaxAge:     getEnvAsInt("LOG_MAX_AGE", 28),
        Compress:   getEnvAsBool("LOG_COMPRESS", true),
    }
    
    // 加载监控配置
    config.Monitoring = MonitoringConfig{
        Enabled:        getEnvAsBool("MONITORING_ENABLED", true),
        MetricsPath:    getEnv("METRICS_PATH", "/metrics"),
        PrometheusAddr: getEnv("PROMETHEUS_ADDR", "localhost:9090"),
        JaegerAddr:     getEnv("JAEGER_ADDR", "localhost:14268"),
    }
    
    // 加载安全配置
    config.Security = SecurityConfig{
        JWTSecret:     getEnv("JWT_SECRET", "your-secret-key"),
        JWTExpiration: getEnvAsDuration("JWT_EXPIRATION", 24*time.Hour),
        RateLimitRPS:  getEnvAsInt("RATE_LIMIT_RPS", 100),
        CORSOrigins:   getEnvAsSlice("CORS_ORIGINS", []string{"*"}),
        TLSEnabled:    getEnvAsBool("TLS_ENABLED", false),
        TLSCertFile:   getEnv("TLS_CERT_FILE", ""),
        TLSKeyFile:    getEnv("TLS_KEY_FILE", ""),
    }
    
    return config, nil
}

// 验证配置
func (c *DeploymentConfig) Validate() error {
    if c.AppName == "" {
        return fmt.Errorf("app name is required")
    }
    
    if c.Port <= 0 || c.Port > 65535 {
        return fmt.Errorf("invalid port: %d", c.Port)
    }
    
    if c.Database.Host == "" {
        return fmt.Errorf("database host is required")
    }
    
    if c.Redis.Host == "" {
        return fmt.Errorf("redis host is required")
    }
    
    if c.Security.JWTSecret == "" || c.Security.JWTSecret == "your-secret-key" {
        return fmt.Errorf("JWT secret must be set and not use default value")
    }
    
    return nil
}

// 获取环境变量
func getEnv(key, defaultValue string) string {
    if value := os.Getenv(key); value != "" {
        return value
    }
    return defaultValue
}

// 获取整数环境变量
func getEnvAsInt(key string, defaultValue int) int {
    if value := os.Getenv(key); value != "" {
        if intValue, err := strconv.Atoi(value); err == nil {
            return intValue
        }
    }
    return defaultValue
}

// 获取布尔环境变量
func getEnvAsBool(key string, defaultValue bool) bool {
    if value := os.Getenv(key); value != "" {
        if boolValue, err := strconv.ParseBool(value); err == nil {
            return boolValue
        }
    }
    return defaultValue
}

// 获取时间间隔环境变量
func getEnvAsDuration(key string, defaultValue time.Duration) time.Duration {
    if value := os.Getenv(key); value != "" {
        if duration, err := time.ParseDuration(value); err == nil {
            return duration
        }
    }
    return defaultValue
}

// 获取切片环境变量
func getEnvAsSlice(key string, defaultValue []string) []string {
    if value := os.Getenv(key); value != "" {
        return strings.Split(value, ",")
    }
    return defaultValue
}

13.2 Docker容器化部署

13.2.1 Dockerfile优化

# 多阶段构建Dockerfile
FROM golang:1.21-alpine AS builder

# 设置工作目录
WORKDIR /app

# 安装必要的包
RUN apk add --no-cache git ca-certificates tzdata

# 复制go mod文件
COPY go.mod go.sum ./

# 下载依赖
RUN go mod download

# 复制源代码
COPY . .

# 构建应用
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main .

# 运行阶段
FROM alpine:latest

# 安装ca-certificates和tzdata
RUN apk --no-cache add ca-certificates tzdata

# 设置时区
ENV TZ=Asia/Shanghai

# 创建非root用户
RUN addgroup -g 1001 -S appgroup && \
    adduser -u 1001 -S appuser -G appgroup

# 设置工作目录
WORKDIR /app

# 从构建阶段复制二进制文件
COPY --from=builder /app/main .

# 复制配置文件
COPY --from=builder /app/configs ./configs

# 设置文件权限
RUN chown -R appuser:appgroup /app

# 切换到非root用户
USER appuser

# 暴露端口
EXPOSE 8080

# 健康检查
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

# 启动应用
CMD ["./main"]

13.2.2 Docker Compose配置

# docker-compose.yml
version: '3.8'

services:
  # 应用服务
  app:
    build:
      context: .
      dockerfile: Dockerfile
    image: new-api:latest
    container_name: new-api-app
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      - ENVIRONMENT=production
      - DB_HOST=postgres
      - DB_PORT=5432
      - DB_USERNAME=newapi
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_DATABASE=newapi
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - REDIS_PASSWORD=${REDIS_PASSWORD}
      - JWT_SECRET=${JWT_SECRET}
      - LOG_LEVEL=info
      - MONITORING_ENABLED=true
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - app-network
    volumes:
      - ./logs:/app/logs
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  # PostgreSQL数据库
  postgres:
    image: postgres:15-alpine
    container_name: new-api-postgres
    restart: unless-stopped
    environment:
      - POSTGRES_DB=newapi
      - POSTGRES_USER=newapi
      - POSTGRES_PASSWORD=${DB_PASSWORD}
      - POSTGRES_INITDB_ARGS=--encoding=UTF-8 --lc-collate=C --lc-ctype=C
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
    networks:
      - app-network
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U newapi -d newapi"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Redis缓存
  redis:
    image: redis:7-alpine
    container_name: new-api-redis
    restart: unless-stopped
    command: redis-server --requirepass ${REDIS_PASSWORD} --appendonly yes
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
      - ./configs/redis.conf:/usr/local/etc/redis/redis.conf
    networks:
      - app-network
    healthcheck:
      test: ["CMD", "redis-cli", "--raw", "incr", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5

  # Nginx负载均衡
  nginx:
    image: nginx:alpine
    container_name: new-api-nginx
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./configs/nginx.conf:/etc/nginx/nginx.conf
      - ./configs/ssl:/etc/nginx/ssl
      - ./logs/nginx:/var/log/nginx
    depends_on:
      - app
    networks:
      - app-network

  # Prometheus监控
  prometheus:
    image: prom/prometheus:latest
    container_name: new-api-prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./configs/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    networks:
      - app-network

  # Grafana可视化
  grafana:
    image: grafana/grafana:latest
    container_name: new-api-grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./configs/grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./configs/grafana/datasources:/etc/grafana/provisioning/datasources
    networks:
      - app-network

networks:
  app-network:
    driver: bridge

volumes:
  postgres_data:
  redis_data:
  prometheus_data:
  grafana_data:

13.2.3 环境变量配置

# .env文件
# 数据库配置
DB_PASSWORD=your_secure_db_password

# Redis配置
REDIS_PASSWORD=your_secure_redis_password

# JWT密钥
JWT_SECRET=your_very_secure_jwt_secret_key_here

# Grafana配置
GRAFANA_PASSWORD=your_grafana_admin_password

# 应用配置
APP_VERSION=1.0.0
ENVIRONMENT=production

# 监控配置
MONITORING_ENABLED=true

# 日志配置
LOG_LEVEL=info
LOG_FORMAT=json

13.3 Kubernetes部署

13.3.1 Kubernetes配置文件

# k8s/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: new-api
  labels:
    name: new-api
---
# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: new-api-config
  namespace: new-api
data:
  app.yaml: |
    environment: production
    app_name: new-api
    port: 8080
    logging:
      level: info
      format: json
    monitoring:
      enabled: true
      metrics_path: /metrics
---
# k8s/secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: new-api-secret
  namespace: new-api
type: Opaque
data:
  db-password: eW91cl9zZWN1cmVfZGJfcGFzc3dvcmQ=  # base64编码
  redis-password: eW91cl9zZWN1cmVfcmVkaXNfcGFzc3dvcmQ=
  jwt-secret: eW91cl92ZXJ5X3NlY3VyZV9qd3Rfc2VjcmV0X2tleV9oZXJl
---
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: new-api-deployment
  namespace: new-api
  labels:
    app: new-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: new-api
  template:
    metadata:
      labels:
        app: new-api
    spec:
      containers:
      - name: new-api
        image: new-api:latest
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        env:
        - name: ENVIRONMENT
          value: "production"
        - name: DB_HOST
          value: "postgres-service"
        - name: DB_PORT
          value: "5432"
        - name: DB_USERNAME
          value: "newapi"
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: new-api-secret
              key: db-password
        - name: DB_DATABASE
          value: "newapi"
        - name: REDIS_HOST
          value: "redis-service"
        - name: REDIS_PORT
          value: "6379"
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: new-api-secret
              key: redis-password
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: new-api-secret
              key: jwt-secret
        - name: LOG_LEVEL
          value: "info"
        - name: MONITORING_ENABLED
          value: "true"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: config-volume
          mountPath: /app/configs
        - name: logs-volume
          mountPath: /app/logs
      volumes:
      - name: config-volume
        configMap:
          name: new-api-config
      - name: logs-volume
        emptyDir: {}
      restartPolicy: Always
---
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: new-api-service
  namespace: new-api
  labels:
    app: new-api
spec:
  selector:
    app: new-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP
---
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: new-api-ingress
  namespace: new-api
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - api.yourdomain.com
    secretName: new-api-tls
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: new-api-service
            port:
              number: 80

13.3.2 Helm Chart配置

# helm/new-api/Chart.yaml
apiVersion: v2
name: new-api
description: A Helm chart for New API application
type: application
version: 0.1.0
appVersion: "1.0.0"

# helm/new-api/values.yaml
replicaCount: 3

image:
  repository: new-api
  pullPolicy: Always
  tag: "latest"

nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}

podSecurityContext:
  fsGroup: 1001

securityContext:
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1001

service:
  type: ClusterIP
  port: 80
  targetPort: 8080

ingress:
  enabled: true
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
  hosts:
    - host: api.yourdomain.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: new-api-tls
      hosts:
        - api.yourdomain.com

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80

nodeSelector: {}

tolerations: []

affinity: {}

# 应用配置
config:
  environment: production
  logLevel: info
  monitoring:
    enabled: true

# 数据库配置
database:
  host: postgres-service
  port: 5432
  username: newapi
  database: newapi

# Redis配置
redis:
  host: redis-service
  port: 6379

# 密钥配置
secrets:
  dbPassword: "your_secure_db_password"
  redisPassword: "your_secure_redis_password"
  jwtSecret: "your_very_secure_jwt_secret_key_here"
# helm/new-api/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "new-api.fullname" . }}
  labels:
    {{- include "new-api.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "new-api.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      labels:
        {{- include "new-api.selectorLabels" . | nindent 8 }}
    spec:
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      serviceAccountName: {{ include "new-api.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.targetPort }}
              protocol: TCP
          env:
            - name: ENVIRONMENT
              value: {{ .Values.config.environment }}
            - name: DB_HOST
              value: {{ .Values.database.host }}
            - name: DB_PORT
              value: "{{ .Values.database.port }}"
            - name: DB_USERNAME
              value: {{ .Values.database.username }}
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: {{ include "new-api.fullname" . }}-secret
                  key: db-password
            - name: DB_DATABASE
              value: {{ .Values.database.database }}
            - name: REDIS_HOST
              value: {{ .Values.redis.host }}
            - name: REDIS_PORT
              value: "{{ .Values.redis.port }}"
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: {{ include "new-api.fullname" . }}-secret
                  key: redis-password
            - name: JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: {{ include "new-api.fullname" . }}-secret
                  key: jwt-secret
            - name: LOG_LEVEL
              value: {{ .Values.config.logLevel }}
            - name: MONITORING_ENABLED
              value: "{{ .Values.config.monitoring.enabled }}"
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}

13.4 CI/CD流水线

graph LR
  Dev[Developer] --> PR[Pull Request]
  PR --> CI[CI Pipeline]
  CI --> Build[Build + Lint]
  Build --> Test[Unit/Integration Tests]
  Test --> Image[Build Image]
  Image --> Staging[Deploy Staging]
  Staging --> Verify[Smoke/Canary]
  Verify --> Prod[Deploy Production]
  Prod --> Rollback{Rollback?}
  Rollback -- yes --> Staging

图2:CI/CD 流水线与回滚路径

13.4.1 CI/CD概述

持续集成/持续部署(CI/CD)是现代软件开发的核心实践,它通过自动化的方式确保代码质量、加速交付过程并降低部署风险。

CI/CD流程设计

graph LR
    A[代码提交] --> B[代码检查]
    B --> C[单元测试]
    C --> D[安全扫描]
    D --> E[构建镜像]
    E --> F[部署测试环境]
    F --> G[集成测试]
    G --> H[部署生产环境]
    H --> I[监控验证]
    
    B --> J[代码质量门禁]
    C --> K[测试覆盖率检查]
    D --> L[安全漏洞检测]
    G --> M[冒烟测试]
    I --> N[回滚机制]

CI/CD最佳实践

  1. 分支策略

    • 主分支(main):生产环境代码

    • 开发分支(develop):开发环境代码

    • 功能分支(feature/*):新功能开发

    • 修复分支(hotfix/*):紧急修复

  2. 质量门禁

    • 代码格式检查

    • 静态代码分析

    • 单元测试覆盖率 > 80%

    • 安全漏洞扫描

  3. 部署策略

    • 蓝绿部署:零停机时间

    • 金丝雀部署:渐进式发布

    • 滚动部署:逐步替换实例

13.4.2 GitHub Actions配置

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # 代码质量检查
  lint:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: '1.21'
    
    - name: Cache Go modules
      uses: actions/cache@v3
      with:
        path: ~/go/pkg/mod
        key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
        restore-keys: |
          ${{ runner.os }}-go-
    
    - name: Install dependencies
      run: go mod download
    
    - name: Run golangci-lint
      uses: golangci/golangci-lint-action@v3
      with:
        version: latest
        args: --timeout=5m
    
    - name: Run go vet
      run: go vet ./...
    
    - name: Run go fmt
      run: |
        if [ "$(gofmt -s -l . | wc -l)" -gt 0 ]; then
          echo "Code is not formatted properly:"
          gofmt -s -l .
          exit 1
        fi

  # 单元测试
  test:
    runs-on: ubuntu-latest
    needs: lint
    
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: testdb
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      
      redis:
        image: redis:7
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 6379:6379
    
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: '1.21'
    
    - name: Cache Go modules
      uses: actions/cache@v3
      with:
        path: ~/go/pkg/mod
        key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
        restore-keys: |
          ${{ runner.os }}-go-
    
    - name: Install dependencies
      run: go mod download
    
    - name: Run tests
      env:
        DB_HOST: localhost
        DB_PORT: 5432
        DB_USERNAME: postgres
        DB_PASSWORD: postgres
        DB_DATABASE: testdb
        REDIS_HOST: localhost
        REDIS_PORT: 6379
      run: |
        go test -v -race -coverprofile=coverage.out ./...
        go tool cover -html=coverage.out -o coverage.html
    
    - name: Upload coverage to Codecov
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage.out
        flags: unittests
        name: codecov-umbrella

  # 安全扫描
  security:
    runs-on: ubuntu-latest
    needs: lint
    steps:
    - uses: actions/checkout@v4
    
    - name: Run Gosec Security Scanner
      uses: securecodewarrior/github-action-gosec@master
      with:
        args: '-fmt sarif -out gosec.sarif ./...'
    
    - name: Upload SARIF file
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: gosec.sarif

  # 构建和推送镜像
  build:
    runs-on: ubuntu-latest
    needs: [test, security]
    permissions:
      contents: read
      packages: write
    
    steps:
    - name: Checkout repository
      uses: actions/checkout@v4
    
    - name: Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}
    
    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha,prefix={{branch}}-
          type=raw,value=latest,enable={{is_default_branch}}
    
    - name: Build and push Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha
        cache-to: type=gha,mode=max

  # 部署到测试环境
  deploy-staging:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/develop'
    environment: staging
    
    steps:
    - name: Checkout repository
      uses: actions/checkout@v4
    
    - name: Configure kubectl
      uses: azure/k8s-set-context@v3
      with:
        method: kubeconfig
        kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}
    
    - name: Deploy to staging
      run: |
        kubectl set image deployment/new-api-deployment \
          new-api=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop \
          -n new-api-staging
        kubectl rollout status deployment/new-api-deployment -n new-api-staging

  # 部署到生产环境
  deploy-production:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment: production
    
    steps:
    - name: Checkout repository
      uses: actions/checkout@v4
    
    - name: Configure kubectl
      uses: azure/k8s-set-context@v3
      with:
        method: kubeconfig
        kubeconfig: ${{ secrets.KUBE_CONFIG_PRODUCTION }}
    
    - name: Deploy to production
      run: |
        kubectl set image deployment/new-api-deployment \
          new-api=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest \
          -n new-api-production
        kubectl rollout status deployment/new-api-deployment -n new-api-production
    
    - name: Notify deployment
      uses: 8398a7/action-slack@v3
      with:
        status: ${{ job.status }}
        channel: '#deployments'
        webhook_url: ${{ secrets.SLACK_WEBHOOK }}
      if: always()

13.4.2 GitLab CI/CD配置

# .gitlab-ci.yml
stages:
  - lint
  - test
  - security
  - build
  - deploy-staging
  - deploy-production

variables:
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: "/certs"
  GO_VERSION: "1.21"
  REGISTRY: $CI_REGISTRY
  IMAGE_NAME: $CI_PROJECT_PATH

# 代码质量检查
lint:
  stage: lint
  image: golangci/golangci-lint:latest
  script:
    - golangci-lint run --timeout=5m
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"
    - if: $CI_COMMIT_BRANCH == "develop"

# 单元测试
test:
  stage: test
  image: golang:$GO_VERSION
  services:
    - postgres:15
    - redis:7
  variables:
    POSTGRES_DB: testdb
    POSTGRES_USER: postgres
    POSTGRES_PASSWORD: postgres
    DB_HOST: postgres
    DB_PORT: 5432
    DB_USERNAME: postgres
    DB_PASSWORD: postgres
    DB_DATABASE: testdb
    REDIS_HOST: redis
    REDIS_PORT: 6379
  before_script:
    - go mod download
  script:
    - go test -v -race -coverprofile=coverage.out ./...
    - go tool cover -func=coverage.out
  coverage: '/total:.*?(\d+\.\d+)%/'
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage.xml
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"
    - if: $CI_COMMIT_BRANCH == "develop"

# 安全扫描
security:
  stage: security
  image: securecodewarrior/gosec:latest
  script:
    - gosec -fmt json -out gosec-report.json ./...
  artifacts:
    reports:
      sast: gosec-report.json
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"
    - if: $CI_COMMIT_BRANCH == "develop"

# 构建镜像
build:
  stage: build
  image: docker:latest
  services:
    - docker:dind
  before_script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  script:
    - docker build -t $REGISTRY/$IMAGE_NAME:$CI_COMMIT_SHA .
    - docker build -t $REGISTRY/$IMAGE_NAME:latest .
    - docker push $REGISTRY/$IMAGE_NAME:$CI_COMMIT_SHA
    - docker push $REGISTRY/$IMAGE_NAME:latest
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
    - if: $CI_COMMIT_BRANCH == "develop"

# 部署到测试环境
deploy-staging:
  stage: deploy-staging
  image: bitnami/kubectl:latest
  environment:
    name: staging
    url: https://staging-api.yourdomain.com
  before_script:
    - kubectl config use-context staging
  script:
    - kubectl set image deployment/new-api-deployment new-api=$REGISTRY/$IMAGE_NAME:$CI_COMMIT_SHA -n new-api-staging
    - kubectl rollout status deployment/new-api-deployment -n new-api-staging
  rules:
    - if: $CI_COMMIT_BRANCH == "develop"

# 部署到生产环境
deploy-production:
  stage: deploy-production
  image: bitnami/kubectl:latest
  environment:
    name: production
    url: https://api.yourdomain.com
  before_script:
    - kubectl config use-context production
  script:
    - kubectl set image deployment/new-api-deployment new-api=$REGISTRY/$IMAGE_NAME:$CI_COMMIT_SHA -n new-api-production
    - kubectl rollout status deployment/new-api-deployment -n new-api-production
  when: manual
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

13.4.3 Jenkins Pipeline配置

// Jenkinsfile
pipeline {
    agent any
    
    environment {
        REGISTRY = 'your-registry.com'
        IMAGE_NAME = 'new-api'
        KUBECONFIG = credentials('kubeconfig')
        DOCKER_REGISTRY_CREDS = credentials('docker-registry')
    }
    
    stages {
        stage('Checkout') {
            steps {
                checkout scm
            }
        }
        
        stage('Code Quality') {
            parallel {
                stage('Lint') {
                    steps {
                        sh 'golangci-lint run --timeout=5m'
                    }
                }
                
                stage('Format Check') {
                    steps {
                        sh '''
                            if [ "$(gofmt -s -l . | wc -l)" -gt 0 ]; then
                                echo "Code is not formatted properly:"
                                gofmt -s -l .
                                exit 1
                            fi
                        '''
                    }
                }
            }
        }
        
        stage('Test') {
            steps {
                sh '''
                    docker-compose -f docker-compose.test.yml up -d
                    sleep 10
                    go test -v -race -coverprofile=coverage.out ./...
                    go tool cover -html=coverage.out -o coverage.html
                    docker-compose -f docker-compose.test.yml down
                '''
            }
            post {
                always {
                    publishHTML([
                        allowMissing: false,
                        alwaysLinkToLastBuild: true,
                        keepAll: true,
                        reportDir: '.',
                        reportFiles: 'coverage.html',
                        reportName: 'Coverage Report'
                    ])
                }
            }
        }
        
        stage('Security Scan') {
            steps {
                sh 'gosec -fmt json -out gosec-report.json ./...'
            }
            post {
                always {
                    archiveArtifacts artifacts: 'gosec-report.json', fingerprint: true
                }
            }
        }
        
        stage('Build Image') {
            steps {
                script {
                    def image = docker.build("${REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}")
                    docker.withRegistry("https://${REGISTRY}", DOCKER_REGISTRY_CREDS) {
                        image.push()
                        image.push('latest')
                    }
                }
            }
        }
        
        stage('Deploy to Staging') {
            when {
                branch 'develop'
            }
            steps {
                sh '''
                    kubectl set image deployment/new-api-deployment \
                        new-api=${REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER} \
                        -n new-api-staging
                    kubectl rollout status deployment/new-api-deployment -n new-api-staging
                '''
            }
        }
        
        stage('Deploy to Production') {
            when {
                branch 'main'
            }
            steps {
                input message: 'Deploy to production?', ok: 'Deploy'
                sh '''
                    kubectl set image deployment/new-api-deployment \
                        new-api=${REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER} \
                        -n new-api-production
                    kubectl rollout status deployment/new-api-deployment -n new-api-production
                '''
            }
        }
    }
    
    post {
        always {
            cleanWs()
        }
        success {
            slackSend(
                channel: '#deployments',
                color: 'good',
                message: "✅ Pipeline succeeded for ${env.JOB_NAME} - ${env.BUILD_NUMBER}"
            )
        }
        failure {
            slackSend(
                channel: '#deployments',
                color: 'danger',
                message: "❌ Pipeline failed for ${env.JOB_NAME} - ${env.BUILD_NUMBER}"
            )
        }
    }
}

13.4.4 部署策略实现

蓝绿部署

# k8s/blue-green-deployment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: new-api-rollout
  namespace: new-api
spec:
  replicas: 3
  strategy:
    blueGreen:
      activeService: new-api-active
      previewService: new-api-preview
      autoPromotionEnabled: false
      scaleDownDelaySeconds: 30
      prePromotionAnalysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: new-api-preview
      postPromotionAnalysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: new-api-active
  selector:
    matchLabels:
      app: new-api
  template:
    metadata:
      labels:
        app: new-api
    spec:
      containers:
      - name: new-api
        image: new-api:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: new-api-active
  namespace: new-api
spec:
  selector:
    app: new-api
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: new-api-preview
  namespace: new-api
spec:
  selector:
    app: new-api
  ports:
  - port: 80
    targetPort: 8080

金丝雀部署

# k8s/canary-deployment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: new-api-canary
  namespace: new-api
spec:
  replicas: 5
  strategy:
    canary:
      steps:
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 40
      - pause: {duration: 10m}
      - setWeight: 60
      - pause: {duration: 10m}
      - setWeight: 80
      - pause: {duration: 10m}
      canaryService: new-api-canary
      stableService: new-api-stable
      trafficRouting:
        nginx:
          stableIngress: new-api-stable
          annotationPrefix: nginx.ingress.kubernetes.io
          additionalIngressAnnotations:
            canary-by-header: X-Canary
      analysis:
        templates:
        - templateName: success-rate
        - templateName: latency
        startingStep: 2
        args:
        - name: service-name
          value: new-api-canary
  selector:
    matchLabels:
      app: new-api
  template:
    metadata:
      labels:
        app: new-api
    spec:
      containers:
      - name: new-api
        image: new-api:latest
        ports:
        - containerPort: 8080

分析模板

# k8s/analysis-templates.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: new-api
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 10s
    count: 3
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[2m])) /
          sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency
  namespace: new-api
spec:
  args:
  - name: service-name
  metrics:
  - name: latency
    interval: 10s
    count: 3
    successCondition: result[0] <= 0.5
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[2m]))
            by (le)
          )

13.5 监控告警系统

13.5.1 监控系统概述

监控告警系统是保障应用稳定运行的重要基础设施,通过收集、存储、分析和可视化各种指标数据,帮助运维团队及时发现和解决问题。

监控架构设计

graph TB
    subgraph "数据采集层"
        A1[应用指标]
        A2[系统指标]
        A3[业务指标]
        A4[日志数据]
    end
    
    subgraph "数据存储层"
        B1[Prometheus]
        B2[InfluxDB]
        B3[Elasticsearch]
    end
    
    subgraph "数据处理层"
        C1[AlertManager]
        C2[Grafana]
        C3[Kibana]
    end
    
    subgraph "通知渠道"
        D1[邮件]
        D2[Slack]
        D3[钉钉]
        D4[短信]
    end
    
    A1 --> B1
    A2 --> B1
    A3 --> B2
    A4 --> B3
    
    B1 --> C1
    B1 --> C2
    B2 --> C2
    B3 --> C3
    
    C1 --> D1
    C1 --> D2
    C1 --> D3
    C1 --> D4

图2:监控系统架构(采集→存储→处理→通知)

监控指标体系

  1. 基础设施指标

    • CPU使用率、内存使用率

    • 磁盘I/O、网络I/O

    • 文件系统使用率

  2. 应用性能指标

    • 请求响应时间

    • 请求成功率

    • 并发连接数

    • 错误率

  3. 业务指标

    • 用户活跃度

    • 交易量

    • 转化率

  4. 可用性指标

    • 服务可用性

    • SLA指标

    • 故障恢复时间

13.5.2 Prometheus配置

# configs/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # 应用指标
  - job_name: 'new-api'
    static_configs:
      - targets: ['app:8080']
    metrics_path: '/metrics'
    scrape_interval: 10s
    scrape_timeout: 5s

  # 系统指标
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # 数据库指标
  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Redis指标
  - job_name: 'redis-exporter'
    static_configs:
      - targets: ['redis-exporter:9121']

  # Nginx指标
  - job_name: 'nginx-exporter'
    static_configs:
      - targets: ['nginx-exporter:9113']

13.5.2 告警规则配置

# configs/alert_rules.yml
groups:
- name: new-api-alerts
  rules:
  # 应用可用性告警
  - alert: ApplicationDown
    expr: up{job="new-api"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "New API application is down"
      description: "New API application has been down for more than 1 minute."

  # 高错误率告警
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value }} errors per second."

  # 高响应时间告警
  - alert: HighResponseTime
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High response time detected"
      description: "95th percentile response time is {{ $value }} seconds."

  # CPU使用率告警
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is {{ $value }}% on {{ $labels.instance }}."

  # 内存使用率告警
  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is {{ $value }}% on {{ $labels.instance }}."

  # 磁盘使用率告警
  - alert: HighDiskUsage
    expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High disk usage detected"
      description: "Disk usage is {{ $value }}% on {{ $labels.instance }}."

  # 数据库连接告警
  - alert: DatabaseConnectionHigh
    expr: pg_stat_activity_count > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High database connections"
      description: "Database has {{ $value }} active connections."

  # Redis内存使用告警
  - alert: RedisMemoryHigh
    expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis memory usage high"
      description: "Redis memory usage is {{ $value }}%."

13.5.3 AlertManager配置

# configs/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'your-email-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      severity: warning
    receiver: 'warning-alerts'

receivers:
- name: 'default'
  email_configs:
  - to: '[email protected]'
    subject: '[ALERT] {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

- name: 'critical-alerts'
  email_configs:
  - to: '[email protected],[email protected]'
    subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Severity: {{ .Labels.severity }}
      {{ end }}
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#alerts'
    title: 'Critical Alert'
    text: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

- name: 'warning-alerts'
  email_configs:
  - to: '[email protected]'
    subject: '[WARNING] {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'cluster', 'service']

13.5.4 Grafana仪表板配置

仪表板JSON配置

{
  "dashboard": {
    "id": null,
    "title": "New-API监控仪表板",
    "tags": ["new-api", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "请求QPS",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{job=\"new-api\"}[5m]))",
            "legendFormat": "总QPS"
          },
          {
            "expr": "sum(rate(http_requests_total{job=\"new-api\",status=~\"2..\"}[5m]))",
            "legendFormat": "成功QPS"
          }
        ],
        "yAxes": [
          {
            "label": "请求/秒",
            "min": 0
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        }
      },
      {
        "id": 2,
        "title": "响应时间",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job=\"new-api\"}[5m])) by (le))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"new-api\"}[5m])) by (le))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"new-api\"}[5m])) by (le))",
            "legendFormat": "P99"
          }
        ],
        "yAxes": [
          {
            "label": "秒",
            "min": 0
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        }
      },
      {
        "id": 3,
        "title": "错误率",
        "type": "singlestat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{job=\"new-api\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"new-api\"}[5m])) * 100",
            "legendFormat": "错误率"
          }
        ],
        "valueName": "current",
        "format": "percent",
        "thresholds": "1,5",
        "colorBackground": true,
        "gridPos": {
          "h": 4,
          "w": 6,
          "x": 0,
          "y": 8
        }
      },
      {
        "id": 4,
        "title": "活跃连接数",
        "type": "singlestat",
        "targets": [
          {
            "expr": "sum(http_connections_active{job=\"new-api\"})",
            "legendFormat": "活跃连接"
          }
        ],
        "valueName": "current",
        "format": "short",
        "gridPos": {
          "h": 4,
          "w": 6,
          "x": 6,
          "y": 8
        }
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "5s"
  }
}

监控最佳实践

  1. 告警策略设计

    • Critical: 影响服务可用性的严重问题

    • Warning: 需要关注但不影响服务的问题

    • Info: 信息性告警,用于趋势分析

  2. 数据保留策略

    • 短期数据(1-7天): 高精度,用于实时监控

    • 中期数据(1-3个月): 中等精度,用于趋势分析

    • 长期数据(1年以上): 低精度,用于历史对比

  3. 性能优化

    • 使用recording rules预计算复杂查询

    • 合理设置采集间隔和保留时间

    • 避免高基数标签

13.6 日志管理

13.6.1 日志管理概述

日志管理是运维体系中的重要组成部分,通过统一收集、存储、分析和可视化日志数据,帮助开发和运维团队快速定位问题、分析系统行为和优化性能。

日志架构设计

graph TB
    subgraph "应用层"
        A1[Web服务]
        A2[API服务]
        A3[后台任务]
        A4[数据库]
    end
    
    subgraph "日志收集层"
        B1[Filebeat]
        B2[Fluentd]
        B3[Logstash]
    end
    
    subgraph "消息队列"
        C1[Kafka]
        C2[Redis]
    end
    
    subgraph "日志处理层"
        D1[Logstash]
        D2[Fluentd]
    end
    
    subgraph "存储层"
        E1[Elasticsearch]
        E2[ClickHouse]
    end
    
    subgraph "可视化层"
        F1[Kibana]
        F2[Grafana]
    end
    
    A1 --> B1
    A2 --> B1
    A3 --> B2
    A4 --> B3
    
    B1 --> C1
    B2 --> C1
    B3 --> C2
    
    C1 --> D1
    C2 --> D2
    
    D1 --> E1
    D2 --> E2
    
    E1 --> F1
    E2 --> F2

日志分类与规范

  1. 访问日志

    • HTTP请求日志

    • API调用日志

    • 用户行为日志

  2. 应用日志

    • 业务逻辑日志

    • 错误异常日志

    • 性能监控日志

  3. 系统日志

    • 操作系统日志

    • 容器运行日志

    • 基础设施日志

  4. 安全日志

    • 认证授权日志

    • 安全事件日志

    • 审计日志

日志格式标准化

// internal/logger/structured.go
package logger

import (
	"encoding/json"
	"time"
)

// LogEntry 标准化日志条目
type LogEntry struct {
	Timestamp   time.Time              `json:"timestamp"`
	Level       string                 `json:"level"`
	Service     string                 `json:"service"`
	TraceID     string                 `json:"trace_id"`
	SpanID      string                 `json:"span_id"`
	Message     string                 `json:"message"`
	Fields      map[string]interface{} `json:"fields,omitempty"`
	Error       string                 `json:"error,omitempty"`
	StackTrace  string                 `json:"stack_trace,omitempty"`
	UserID      string                 `json:"user_id,omitempty"`
	RequestID   string                 `json:"request_id,omitempty"`
	HTTPMethod  string                 `json:"http_method,omitempty"`
	HTTPPath    string                 `json:"http_path,omitempty"`
	HTTPStatus  int                    `json:"http_status,omitempty"`
	Duration    int64                  `json:"duration_ms,omitempty"`
}

// ToJSON 转换为JSON格式
func (le *LogEntry) ToJSON() ([]byte, error) {
	return json.Marshal(le)
}

// NewLogEntry 创建新的日志条目
func NewLogEntry(level, service, message string) *LogEntry {
	return &LogEntry{
		Timestamp: time.Now(),
		Level:     level,
		Service:   service,
		Message:   message,
		Fields:    make(map[string]interface{}),
	}
}

13.6.2 ELK Stack配置

# docker-compose-elk.yml
version: '3.8'

services:
  # Elasticsearch
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    networks:
      - elk-network

  # Logstash
  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    container_name: logstash
    ports:
      - "5044:5044"
      - "9600:9600"
    volumes:
      - ./configs/logstash/pipeline:/usr/share/logstash/pipeline
      - ./configs/logstash/config:/usr/share/logstash/config
    depends_on:
      - elasticsearch
    networks:
      - elk-network

  # Kibana
  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch
    networks:
      - elk-network

  # Filebeat
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.11.0
    container_name: filebeat
    user: root
    volumes:
      - ./configs/filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - ./logs:/var/log/app:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - logstash
    networks:
      - elk-network

volumes:
  elasticsearch_data:

networks:
  elk-network:
    driver: bridge
# configs/filebeat/filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  fields:
    service: new-api
    environment: production
  fields_under_root: true
  multiline.pattern: '^\d{4}-\d{2}-\d{2}'
  multiline.negate: true
  multiline.match: after

- type: docker
  containers.ids:
    - '*'
  processors:
    - add_docker_metadata:
        host: "unix:///var/run/docker.sock"

output.logstash:
  hosts: ["logstash:5044"]

processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644
# configs/logstash/pipeline/logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [service] == "new-api" {
    json {
      source => "message"
    }
    
    date {
      match => [ "timestamp", "ISO8601" ]
    }
    
    mutate {
      remove_field => [ "message", "@version" ]
    }
  }
  
  if [container][name] {
    mutate {
      add_field => { "container_name" => "%{[container][name]}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "new-api-logs-%{+YYYY.MM.dd}"
  }
  
  stdout {
    codec => rubydebug
  }
}

13.6.2 日志轮转配置

package logging

import (
    "io"
    "os"
    "path/filepath"
    "time"
    
    "gopkg.in/natefinch/lumberjack.v2"
    "github.com/sirupsen/logrus"
)

// 日志轮转配置
type RotationConfig struct {
    Filename   string `json:"filename"`
    MaxSize    int    `json:"max_size"`    // MB
    MaxBackups int    `json:"max_backups"`
    MaxAge     int    `json:"max_age"`     // days
    Compress   bool   `json:"compress"`
    LocalTime  bool   `json:"local_time"`
}

// 创建轮转日志写入器
func NewRotationWriter(config RotationConfig) io.Writer {
    return &lumberjack.Logger{
        Filename:   config.Filename,
        MaxSize:    config.MaxSize,
        MaxBackups: config.MaxBackups,
        MaxAge:     config.MaxAge,
        Compress:   config.Compress,
        LocalTime:  config.LocalTime,
    }
}

// 日志管理器
type LogManager struct {
    logger   *logrus.Logger
    config   RotationConfig
    writers  []io.Writer
}

// 创建日志管理器
func NewLogManager(config RotationConfig) *LogManager {
    logger := logrus.New()
    
    // 创建轮转写入器
    rotationWriter := NewRotationWriter(config)
    
    // 创建多写入器
    writers := []io.Writer{rotationWriter}
    
    // 如果是开发环境,同时输出到控制台
    if os.Getenv("ENVIRONMENT") == "development" {
        writers = append(writers, os.Stdout)
    }
    
    multiWriter := io.MultiWriter(writers...)
    logger.SetOutput(multiWriter)
    
    // 设置JSON格式
    logger.SetFormatter(&logrus.JSONFormatter{
        TimestampFormat: time.RFC3339,
    })
    
    return &LogManager{
        logger:  logger,
        config:  config,
        writers: writers,
    }
}

// 获取日志器
func (lm *LogManager) GetLogger() *logrus.Logger {
    return lm.logger
}

// 清理旧日志
func (lm *LogManager) CleanupOldLogs() error {
    logDir := filepath.Dir(lm.config.Filename)
    
    return filepath.Walk(logDir, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        
        // 检查是否为日志文件且超过保留期限
        if info.IsDir() {
            return nil
        }
        
        if time.Since(info.ModTime()) > time.Duration(lm.config.MaxAge)*24*time.Hour {
            return os.Remove(path)
        }
        
        return nil
    })
}

// 获取日志统计信息
func (lm *LogManager) GetLogStats() (map[string]interface{}, error) {
    logDir := filepath.Dir(lm.config.Filename)
    
    stats := map[string]interface{}{
        "total_files": 0,
        "total_size":  int64(0),
        "oldest_log":  time.Now(),
        "newest_log":  time.Time{},
    }
    
    err := filepath.Walk(logDir, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        
        if !info.IsDir() {
            stats["total_files"] = stats["total_files"].(int) + 1
            stats["total_size"] = stats["total_size"].(int64) + info.Size()
            
            if info.ModTime().Before(stats["oldest_log"].(time.Time)) {
                stats["oldest_log"] = info.ModTime()
            }
            
            if info.ModTime().After(stats["newest_log"].(time.Time)) {
                stats["newest_log"] = info.ModTime()
            }
        }
        
        return nil
    })
    
    return stats, err
}

13.6.3 日志分析与可视化

Kibana仪表板配置

{
  "version": "7.10.0",
  "objects": [
    {
      "id": "new-api-logs-dashboard",
      "type": "dashboard",
      "attributes": {
        "title": "New-API日志分析仪表板",
        "hits": 0,
        "description": "New-API应用日志分析和监控",
        "panelsJSON": "[\n  {\n    \"id\": \"log-level-distribution\",\n    \"type\": \"pie\",\n    \"gridData\": {\n      \"x\": 0,\n      \"y\": 0,\n      \"w\": 24,\n      \"h\": 15\n    }\n  },\n  {\n    \"id\": \"error-logs-timeline\",\n    \"type\": \"histogram\",\n    \"gridData\": {\n      \"x\": 24,\n      \"y\": 0,\n      \"w\": 24,\n      \"h\": 15\n    }\n  },\n  {\n    \"id\": \"top-error-messages\",\n    \"type\": \"data_table\",\n    \"gridData\": {\n      \"x\": 0,\n      \"y\": 15,\n      \"w\": 48,\n      \"h\": 15\n    }\n  }\n]",
        "timeRestore": false,
        "kibanaSavedObjectMeta": {
          "searchSourceJSON": "{\"query\":{\"match_all\":{}},\"filter\":[]}"
        }
      }
    },
    {
      "id": "log-level-distribution",
      "type": "visualization",
      "attributes": {
        "title": "日志级别分布",
        "visState": "{\"title\":\"日志级别分布\",\"type\":\"pie\",\"params\":{\"addTooltip\":true,\"addLegend\":true,\"legendPosition\":\"right\",\"isDonut\":true},\"aggs\":[{\"id\":\"1\",\"enabled\":true,\"type\":\"count\",\"schema\":\"metric\",\"params\":{}},{\"id\":\"2\",\"enabled\":true,\"type\":\"terms\",\"schema\":\"segment\",\"params\":{\"field\":\"level.keyword\",\"size\":10,\"order\":\"desc\",\"orderBy\":\"1\"}}]}",
        "uiStateJSON": "{}",
        "description": "",
        "kibanaSavedObjectMeta": {
          "searchSourceJSON": "{\"index\":\"new-api-logs-*\",\"query\":{\"match_all\":{}},\"filter\":[]}"
        }
      }
    }
  ]
}

日志告警规则

# configs/log-alerts.yml
rules:
  - alert: HighErrorRate
    expr: |
      (
        sum(rate(log_entries_total{level="error"}[5m]))
        /
        sum(rate(log_entries_total[5m]))
      ) * 100 > 5
    for: 2m
    labels:
      severity: warning
      service: new-api
    annotations:
      summary: "应用错误率过高"
      description: "过去5分钟内错误率为 {{ $value }}%"
      
  - alert: CriticalErrorSpike
    expr: |
      increase(log_entries_total{level="error"}[1m]) > 10
    for: 1m
    labels:
      severity: critical
      service: new-api
    annotations:
      summary: "错误日志激增"
      description: "1分钟内出现 {{ $value }} 条错误日志"
      
  - alert: LogVolumeHigh
    expr: |
      sum(rate(log_entries_total[5m])) > 1000
    for: 5m
    labels:
      severity: warning
      service: new-api
    annotations:
      summary: "日志量过大"
      description: "当前日志生成速率为 {{ $value }} 条/秒"

日志分析脚本

// scripts/log-analyzer.go
package main

import (
	"bufio"
	"encoding/json"
	"fmt"
	"os"
	"regexp"
	"sort"
	"strings"
	"time"
)

// LogAnalyzer 日志分析器
type LogAnalyzer struct {
	errorPatterns []*regexp.Regexp
	stats         map[string]int
	errorCounts   map[string]int
	timeRange     struct {
		start time.Time
		end   time.Time
	}
}

// NewLogAnalyzer 创建日志分析器
func NewLogAnalyzer() *LogAnalyzer {
	return &LogAnalyzer{
		errorPatterns: []*regexp.Regexp{
			regexp.MustCompile(`(?i)error|exception|failed|panic`),
			regexp.MustCompile(`(?i)timeout|connection.*refused`),
			regexp.MustCompile(`(?i)out.*of.*memory|memory.*leak`),
		},
		stats:       make(map[string]int),
		errorCounts: make(map[string]int),
	}
}

// AnalyzeFile 分析日志文件
func (la *LogAnalyzer) AnalyzeFile(filename string) error {
	file, err := os.Open(filename)
	if err != nil {
		return err
	}
	defer file.Close()
	
	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		la.analyzeLine(line)
	}
	
	return scanner.Err()
}

// analyzeLine 分析单行日志
func (la *LogAnalyzer) analyzeLine(line string) {
	// 尝试解析JSON格式日志
	var logEntry map[string]interface{}
	if err := json.Unmarshal([]byte(line), &logEntry); err == nil {
		if level, ok := logEntry["level"].(string); ok {
			la.stats[level]++
		}
		
		if level, ok := logEntry["level"].(string); ok && level == "error" {
			if msg, ok := logEntry["message"].(string); ok {
				la.categorizeError(msg)
			}
		}
	} else {
		// 处理非JSON格式日志
		la.analyzeTextLog(line)
	}
}

// categorizeError 错误分类
func (la *LogAnalyzer) categorizeError(message string) {
	for i, pattern := range la.errorPatterns {
		if pattern.MatchString(message) {
			category := fmt.Sprintf("error_type_%d", i+1)
			la.errorCounts[category]++
			return
		}
	}
	la.errorCounts["other_errors"]++
}

// analyzeTextLog 分析文本格式日志
func (la *LogAnalyzer) analyzeTextLog(line string) {
	lowerLine := strings.ToLower(line)
	
	switch {
	case strings.Contains(lowerLine, "error"):
		la.stats["error"]++
	case strings.Contains(lowerLine, "warn"):
		la.stats["warning"]++
	case strings.Contains(lowerLine, "info"):
		la.stats["info"]++
	default:
		la.stats["other"]++
	}
}

// GenerateReport 生成分析报告
func (la *LogAnalyzer) GenerateReport() {
	fmt.Println("=== 日志分析报告 ===")
	fmt.Println("\n日志级别统计:")
	
	for level, count := range la.stats {
		fmt.Printf("%s: %d\n", level, count)
	}
	
	fmt.Println("\n错误类型统计:")
	for errorType, count := range la.errorCounts {
		fmt.Printf("%s: %d\n", errorType, count)
	}
}

func main() {
	if len(os.Args) < 2 {
		fmt.Println("Usage: go run log-analyzer.go <log-file>")
		os.Exit(1)
	}
	
	analyzer := NewLogAnalyzer()
	if err := analyzer.AnalyzeFile(os.Args[1]); err != nil {
		fmt.Printf("Error analyzing file: %v\n", err)
		os.Exit(1)
	}
	
	analyzer.GenerateReport()
}
        }
        
        // 跳过目录
        if info.IsDir() {
            return nil
        }
        
        // 检查文件是否过期
        if time.Since(info.ModTime()) > time.Duration(lm.config.MaxAge)*24*time.Hour {
            return os.Remove(path)
        }
        
        return nil
    })
}

// 获取日志统计信息
func (lm *LogManager) GetLogStats() (map[string]interface{}, error) {
    logDir := filepath.Dir(lm.config.Filename)
    
    stats := map[string]interface{}{
        "log_dir":     logDir,
        "total_files": 0,
        "total_size":  int64(0),
        "files":       []map[string]interface{}{},
    }
    
    err := filepath.Walk(logDir, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        
        if !info.IsDir() {
            stats["total_files"] = stats["total_files"].(int) + 1
            stats["total_size"] = stats["total_size"].(int64) + info.Size()
            
            fileInfo := map[string]interface{}{
                "name":     info.Name(),
                "size":     info.Size(),
                "mod_time": info.ModTime(),
            }
            
            stats["files"] = append(stats["files"].([]map[string]interface{}), fileInfo)
        }
        
        return nil
    })
    
    return stats, err
}

13.7 备份与恢复

13.7.1 备份策略概述

备份与恢复是保障数据安全和业务连续性的关键措施。通过制定完善的备份策略和恢复流程,确保在系统故障、数据损坏或灾难发生时能够快速恢复业务。

备份策略设计

graph TB
    subgraph "备份类型"
        A1[全量备份]
        A2[增量备份]
        A3[差异备份]
        A4[日志备份]
    end
    
    subgraph "备份对象"
        B1[数据库]
        B2[应用文件]
        B3[配置文件]
        B4[日志文件]
        B5[用户数据]
    end
    
    subgraph "存储位置"
        C1[本地存储]
        C2[网络存储]
        C3[云存储]
        C4[异地备份]
    end
    
    subgraph "恢复策略"
        D1[完全恢复]
        D2[时间点恢复]
        D3[部分恢复]
        D4[灾难恢复]
    end
    
    A1 --> B1
    A2 --> B2
    A3 --> B3
    A4 --> B4
    
    B1 --> C1
    B2 --> C2
    B3 --> C3
    B4 --> C4
    
    C1 --> D1
    C2 --> D2
    C3 --> D3
    C4 --> D4

备份策略矩阵

数据类型
备份频率
备份方式
保留期限
存储位置

核心数据库

每日全量 + 每小时增量

热备份

30天

本地+云存储

应用文件

每周全量

冷备份

90天

网络存储

配置文件

变更时备份

版本控制

永久

Git仓库

日志文件

每日归档

压缩备份

180天

云存储

用户上传文件

每日增量

同步备份

365天

云存储+异地

备份管理器设计

// internal/backup/manager.go
package backup

import (
	"context"
	"fmt"
	"log"
	"os"
	"path/filepath"
	"time"
)

// BackupType 备份类型
type BackupType string

const (
	FullBackup        BackupType = "full"
	IncrementalBackup BackupType = "incremental"
	DifferentialBackup BackupType = "differential"
)

// BackupConfig 备份配置
type BackupConfig struct {
	Name           string        `json:"name"`
	Type           BackupType    `json:"type"`
	Source         string        `json:"source"`
	Destination    string        `json:"destination"`
	Schedule       string        `json:"schedule"`        // Cron表达式
	RetentionDays  int           `json:"retention_days"`  // 保留天数
	Compress       bool          `json:"compress"`        // 是否压缩
	Encrypt        bool          `json:"encrypt"`         // 是否加密
	NotifyOnError  bool          `json:"notify_on_error"` // 错误时通知
	NotifyOnSuccess bool         `json:"notify_on_success"` // 成功时通知
	Timeout        time.Duration `json:"timeout"`         // 超时时间
}

// BackupResult 备份结果
type BackupResult struct {
	ID          string        `json:"id"`
	Name        string        `json:"name"`
	Type        BackupType    `json:"type"`
	StartTime   time.Time     `json:"start_time"`
	EndTime     time.Time     `json:"end_time"`
	Duration    time.Duration `json:"duration"`
	Size        int64         `json:"size"`
	Status      string        `json:"status"`
	Error       string        `json:"error,omitempty"`
	FilePath    string        `json:"file_path"`
	Checksum    string        `json:"checksum"`
}

// BackupManager 备份管理器
type BackupManager struct {
	configs   []BackupConfig
	results   []BackupResult
	notifier  Notifier
	encryptor Encryptor
}

// Notifier 通知接口
type Notifier interface {
	Notify(message string) error
}

// Encryptor 加密接口
type Encryptor interface {
	Encrypt(src, dst string) error
	Decrypt(src, dst string) error
}

// NewBackupManager 创建备份管理器
func NewBackupManager(configs []BackupConfig) *BackupManager {
	return &BackupManager{
		configs: configs,
		results: make([]BackupResult, 0),
	}
}

// SetNotifier 设置通知器
func (bm *BackupManager) SetNotifier(notifier Notifier) {
	bm.notifier = notifier
}

// SetEncryptor 设置加密器
func (bm *BackupManager) SetEncryptor(encryptor Encryptor) {
	bm.encryptor = encryptor
}

// ExecuteBackup 执行备份
func (bm *BackupManager) ExecuteBackup(ctx context.Context, configName string) (*BackupResult, error) {
	config := bm.findConfig(configName)
	if config == nil {
		return nil, fmt.Errorf("backup config not found: %s", configName)
	}
	
	result := &BackupResult{
		ID:        generateBackupID(),
		Name:      config.Name,
		Type:      config.Type,
		StartTime: time.Now(),
		Status:    "running",
	}
	
	// 设置超时
	if config.Timeout > 0 {
		var cancel context.CancelFunc
		ctx, cancel = context.WithTimeout(ctx, config.Timeout)
		defer cancel()
	}
	
	// 执行备份
	err := bm.performBackup(ctx, config, result)
	result.EndTime = time.Now()
	result.Duration = result.EndTime.Sub(result.StartTime)
	
	if err != nil {
		result.Status = "failed"
		result.Error = err.Error()
		
		if config.NotifyOnError && bm.notifier != nil {
			bm.notifier.Notify(fmt.Sprintf("备份失败: %s - %s", config.Name, err.Error()))
		}
	} else {
		result.Status = "completed"
		
		if config.NotifyOnSuccess && bm.notifier != nil {
			bm.notifier.Notify(fmt.Sprintf("备份成功: %s", config.Name))
		}
	}
	
	bm.results = append(bm.results, *result)
	return result, err
}

// findConfig 查找配置
func (bm *BackupManager) findConfig(name string) *BackupConfig {
	for _, config := range bm.configs {
		if config.Name == name {
			return &config
		}
	}
	return nil
}

// generateBackupID 生成备份ID
func generateBackupID() string {
	return fmt.Sprintf("backup_%d", time.Now().Unix())
}

13.7.2 数据库备份脚本

#!/bin/bash
# scripts/backup-database.sh

set -e

# 配置变量
DB_HOST=${DB_HOST:-"localhost"}
DB_PORT=${DB_PORT:-"5432"}
DB_NAME=${DB_NAME:-"newapi"}
DB_USER=${DB_USER:-"newapi"}
BACKUP_DIR=${BACKUP_DIR:-"/backups/database"}
RETENTION_DAYS=${RETENTION_DAYS:-"7"}

# 创建备份目录
mkdir -p "$BACKUP_DIR"

# 生成备份文件名
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
BACKUP_FILE="${BACKUP_DIR}/newapi_backup_${TIMESTAMP}.sql"
COMPRESSED_FILE="${BACKUP_FILE}.gz"

echo "Starting database backup..."
echo "Host: $DB_HOST:$DB_PORT"
echo "Database: $DB_NAME"
echo "Backup file: $COMPRESSED_FILE"

# 执行备份
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" \
    --verbose --clean --no-owner --no-privileges \
    --format=custom > "$BACKUP_FILE"

# 压缩备份文件
gzip "$BACKUP_FILE"

# 验证备份文件
if [ -f "$COMPRESSED_FILE" ]; then
    BACKUP_SIZE=$(du -h "$COMPRESSED_FILE" | cut -f1)
    echo "Backup completed successfully. Size: $BACKUP_SIZE"
else
    echo "Backup failed!"
    exit 1
fi

# 清理旧备份
echo "Cleaning up old backups (older than $RETENTION_DAYS days)..."
find "$BACKUP_DIR" -name "newapi_backup_*.sql.gz" -mtime +"$RETENTION_DAYS" -delete

# 上传到云存储(可选)
if [ -n "$AWS_S3_BUCKET" ]; then
    echo "Uploading backup to S3..."
    aws s3 cp "$COMPRESSED_FILE" "s3://$AWS_S3_BUCKET/database-backups/"
fi

echo "Database backup process completed."

13.7.2 数据库恢复脚本

#!/bin/bash
# scripts/restore-database.sh

set -e

# 检查参数
if [ $# -ne 1 ]; then
    echo "Usage: $0 <backup_file>"
    echo "Example: $0 /backups/database/newapi_backup_20231201_120000.sql.gz"
    exit 1
fi

BACKUP_FILE="$1"

# 配置变量
DB_HOST=${DB_HOST:-"localhost"}
DB_PORT=${DB_PORT:-"5432"}
DB_NAME=${DB_NAME:-"newapi"}
DB_USER=${DB_USER:-"newapi"}

# 检查备份文件是否存在
if [ ! -f "$BACKUP_FILE" ]; then
    echo "Backup file not found: $BACKUP_FILE"
    exit 1
fi

echo "Starting database restore..."
echo "Host: $DB_HOST:$DB_PORT"
echo "Database: $DB_NAME"
echo "Backup file: $BACKUP_FILE"

# 确认操作
read -p "This will overwrite the existing database. Are you sure? (y/N): " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
    echo "Restore cancelled."
    exit 1
fi

# 停止应用服务(可选)
echo "Stopping application services..."
docker-compose stop app || true

# 解压备份文件(如果需要)
if [[ "$BACKUP_FILE" == *.gz ]]; then
    TEMP_FILE="/tmp/restore_$(basename "$BACKUP_FILE" .gz)"
    gunzip -c "$BACKUP_FILE" > "$TEMP_FILE"
    RESTORE_FILE="$TEMP_FILE"
else
    RESTORE_FILE="$BACKUP_FILE"
fi

# 执行恢复
echo "Restoring database..."
pg_restore -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" \
    --verbose --clean --no-owner --no-privileges \
    "$RESTORE_FILE"

# 清理临时文件
if [ -n "$TEMP_FILE" ] && [ -f "$TEMP_FILE" ]; then
    rm "$TEMP_FILE"
fi

# 重启应用服务
echo "Starting application services..."
docker-compose start app

echo "Database restore completed successfully."

13.7.3 自动备份定时任务

# crontab配置
# 每天凌晨2点执行数据库备份
0 2 * * * /path/to/scripts/backup-database.sh >> /var/log/backup.log 2>&1

# 每周日凌晨3点执行完整备份
0 3 * * 0 /path/to/scripts/full-backup.sh >> /var/log/backup.log 2>&1

# 每月1号凌晨4点清理旧备份
0 4 1 * * /path/to/scripts/cleanup-backups.sh >> /var/log/backup.log 2>&1
package backup

import (
    "context"
    "fmt"
    "os"
    "os/exec"
    "path/filepath"
    "time"
    
    "github.com/robfig/cron/v3"
    "github.com/sirupsen/logrus"
)

// 备份管理器
type BackupManager struct {
    config BackupConfig
    cron   *cron.Cron
    logger *logrus.Logger
}

// 备份配置
type BackupConfig struct {
    DatabaseURL    string        `json:"database_url"`
    BackupDir      string        `json:"backup_dir"`
    RetentionDays  int           `json:"retention_days"`
    Schedule       string        `json:"schedule"`
    S3Bucket       string        `json:"s3_bucket"`
    S3Region       string        `json:"s3_region"`
    NotifyWebhook  string        `json:"notify_webhook"`
    Timeout        time.Duration `json:"timeout"`
}

// 创建备份管理器
func NewBackupManager(config BackupConfig, logger *logrus.Logger) *BackupManager {
    return &BackupManager{
        config: config,
        cron:   cron.New(),
        logger: logger,
    }
}

// 启动备份调度
func (bm *BackupManager) Start() error {
    // 添加定时任务
    _, err := bm.cron.AddFunc(bm.config.Schedule, bm.performBackup)
    if err != nil {
        return fmt.Errorf("failed to add backup schedule: %w", err)
    }
    
    bm.cron.Start()
    bm.logger.Info("Backup manager started")
    
    return nil
}

// 停止备份调度
func (bm *BackupManager) Stop() {
    bm.cron.Stop()
    bm.logger.Info("Backup manager stopped")
}

// 执行备份
func (bm *BackupManager) performBackup() {
    ctx, cancel := context.WithTimeout(context.Background(), bm.config.Timeout)
    defer cancel()
    
    timestamp := time.Now().Format("20060102_150405")
    backupFile := filepath.Join(bm.config.BackupDir, fmt.Sprintf("backup_%s.sql", timestamp))
    
    bm.logger.Info("Starting database backup")
    
    // 执行pg_dump命令
    cmd := exec.CommandContext(ctx, "pg_dump", bm.config.DatabaseURL, "-f", backupFile)
    if err := cmd.Run(); err != nil {
        bm.logger.WithError(err).Error("Backup failed")
        bm.notifyFailure(err)
        return
    }
    
    // 上传到S3(如果配置了)
    if bm.config.S3Bucket != "" {
        if err := bm.uploadToS3(backupFile); err != nil {
            bm.logger.WithError(err).Error("Failed to upload backup to S3")
        }
    }
    
    // 清理旧备份
    bm.cleanupOldBackups()
    
    bm.logger.Info("Backup completed successfully")
    bm.notifySuccess(backupFile)
}

// 上传到S3
func (bm *BackupManager) uploadToS3(filePath string) error {
    // S3上传逻辑
    return nil
}

// 清理旧备份
func (bm *BackupManager) cleanupOldBackups() {
    cutoff := time.Now().AddDate(0, 0, -bm.config.RetentionDays)
    
    filepath.Walk(bm.config.BackupDir, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        
        if !info.IsDir() && info.ModTime().Before(cutoff) {
            if err := os.Remove(path); err != nil {
                bm.logger.WithError(err).Errorf("Failed to remove old backup: %s", path)
            } else {
                bm.logger.Infof("Removed old backup: %s", path)
            }
        }
        
        return nil
    })
}

// 通知成功
func (bm *BackupManager) notifySuccess(backupFile string) {
    if bm.config.NotifyWebhook == "" {
        return
    }
    
    message := fmt.Sprintf("Backup completed successfully: %s", backupFile)
    bm.sendNotification(message, "success")
}

// 通知失败
func (bm *BackupManager) notifyFailure(err error) {
    if bm.config.NotifyWebhook == "" {
        return
    }
    
    message := fmt.Sprintf("Backup failed: %s", err.Error())
    bm.sendNotification(message, "error")
}

// 发送通知
func (bm *BackupManager) sendNotification(message, level string) {
    // 发送Webhook通知的逻辑
    bm.logger.Infof("Notification sent: %s", message)
}

13.7.3 恢复管理器

恢复管理器设计

// internal/recovery/manager.go
package recovery

import (
	"context"
	"fmt"
	"os"
	"path/filepath"
	"sort"
	"strings"
	"time"
)

// RecoveryType 恢复类型
type RecoveryType string

const (
	FullRecovery         RecoveryType = "full"
	PointInTimeRecovery  RecoveryType = "point_in_time"
	PartialRecovery      RecoveryType = "partial"
	DisasterRecovery     RecoveryType = "disaster"
)

// RecoveryConfig 恢复配置
type RecoveryConfig struct {
	Type            RecoveryType  `json:"type"`
	BackupPath      string        `json:"backup_path"`
	TargetTime      *time.Time    `json:"target_time,omitempty"`
	TargetDatabase  string        `json:"target_database"`
	Tables          []string      `json:"tables,omitempty"`
	VerifyIntegrity bool          `json:"verify_integrity"`
	Timeout         time.Duration `json:"timeout"`
}

// RecoveryResult 恢复结果
type RecoveryResult struct {
	ID               string        `json:"id"`
	Type             RecoveryType  `json:"type"`
	StartTime        time.Time     `json:"start_time"`
	EndTime          time.Time     `json:"end_time"`
	Duration         time.Duration `json:"duration"`
	Status           string        `json:"status"`
	Error            string        `json:"error,omitempty"`
	RecoveredTables  []string      `json:"recovered_tables"`
	RecoveredRecords int64         `json:"recovered_records"`
	IntegrityCheck   bool          `json:"integrity_check"`
}

// RecoveryManager 恢复管理器
type RecoveryManager struct {
	backupDir string
	results   []RecoveryResult
}

// NewRecoveryManager 创建恢复管理器
func NewRecoveryManager(backupDir string) *RecoveryManager {
	return &RecoveryManager{
		backupDir: backupDir,
		results:   make([]RecoveryResult, 0),
	}
}

// ExecuteRecovery 执行恢复
func (rm *RecoveryManager) ExecuteRecovery(ctx context.Context, config RecoveryConfig) (*RecoveryResult, error) {
	result := &RecoveryResult{
		ID:        generateRecoveryID(),
		Type:      config.Type,
		StartTime: time.Now(),
		Status:    "running",
	}
	
	// 设置超时
	if config.Timeout > 0 {
		var cancel context.CancelFunc
		ctx, cancel = context.WithTimeout(ctx, config.Timeout)
		defer cancel()
	}
	
	// 根据恢复类型执行不同的恢复策略
	var err error
	switch config.Type {
	case FullRecovery:
		err = rm.performFullRecovery(ctx, config, result)
	case PointInTimeRecovery:
		err = rm.performPointInTimeRecovery(ctx, config, result)
	case PartialRecovery:
		err = rm.performPartialRecovery(ctx, config, result)
	case DisasterRecovery:
		err = rm.performDisasterRecovery(ctx, config, result)
	default:
		err = fmt.Errorf("unsupported recovery type: %s", config.Type)
	}
	
	result.EndTime = time.Now()
	result.Duration = result.EndTime.Sub(result.StartTime)
	
	if err != nil {
		result.Status = "failed"
		result.Error = err.Error()
	} else {
		result.Status = "completed"
		
		// 执行完整性检查
		if config.VerifyIntegrity {
			result.IntegrityCheck = rm.verifyIntegrity(config.TargetDatabase)
		}
	}
	
	rm.results = append(rm.results, *result)
	return result, err
}

// performFullRecovery 执行完全恢复
func (rm *RecoveryManager) performFullRecovery(ctx context.Context, config RecoveryConfig, result *RecoveryResult) error {
	// 查找最新的备份文件
	backupFile, err := rm.findLatestBackup()
	if err != nil {
		return fmt.Errorf("find latest backup: %w", err)
	}
	
	// 执行恢复
	return rm.restoreFromBackup(ctx, backupFile, config.TargetDatabase)
}

// performPointInTimeRecovery 执行时间点恢复
func (rm *RecoveryManager) performPointInTimeRecovery(ctx context.Context, config RecoveryConfig, result *RecoveryResult) error {
	if config.TargetTime == nil {
		return fmt.Errorf("target time is required for point-in-time recovery")
	}
	
	// 查找目标时间点之前的最新备份
	backupFile, err := rm.findBackupBeforeTime(*config.TargetTime)
	if err != nil {
		return fmt.Errorf("find backup before time: %w", err)
	}
	
	// 执行基础恢复
	if err := rm.restoreFromBackup(ctx, backupFile, config.TargetDatabase); err != nil {
		return err
	}
	
	// 应用WAL日志到目标时间点
	return rm.applyWALToTime(ctx, config.TargetDatabase, *config.TargetTime)
}

// findLatestBackup 查找最新备份
func (rm *RecoveryManager) findLatestBackup() (string, error) {
	files, err := filepath.Glob(filepath.Join(rm.backupDir, "backup_*.sql"))
	if err != nil {
		return "", err
	}
	
	if len(files) == 0 {
		return "", fmt.Errorf("no backup files found")
	}
	
	// 按文件名排序(包含时间戳)
	sort.Strings(files)
	return files[len(files)-1], nil
}

// generateRecoveryID 生成恢复ID
func generateRecoveryID() string {
	return fmt.Sprintf("recovery_%d", time.Now().Unix())
}

13.7.4 灾难恢复流程

灾难恢复计划

graph TB
    A[灾难发生] --> B[评估影响范围]
    B --> C{数据中心可用?}
    C -->|是| D[本地恢复]
    C -->|否| E[异地恢复]
    
    D --> F[启动备用系统]
    E --> G[激活灾备中心]
    
    F --> H[恢复数据库]
    G --> H
    
    H --> I[恢复应用服务]
    I --> J[验证系统功能]
    J --> K[切换用户流量]
    K --> L[监控系统状态]
    L --> M[恢复完成]

图3:灾难恢复流程图

灾难恢复自动化脚本

#!/bin/bash
# scripts/disaster-recovery.sh

set -e

# 配置参数
DR_SITE_HOST="dr.example.com"
DR_DATABASE_URL="postgresql://user:pass@dr-db:5432/newapi"
DR_BACKUP_PATH="/dr/backups"
HEALTH_CHECK_URL="http://dr.example.com/health"
DNS_FAILOVER_SCRIPT="/scripts/dns-failover.sh"

# 日志函数
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a /var/log/disaster-recovery.log
}

# 检查灾备站点状态
check_dr_site() {
    log "Checking disaster recovery site status..."
    
    if curl -f -s "$HEALTH_CHECK_URL" > /dev/null; then
        log "DR site is healthy"
        return 0
    else
        log "DR site is not responding"
        return 1
    fi
}

# 激活灾备站点
activate_dr_site() {
    log "Activating disaster recovery site..."
    
    # 启动灾备数据库
    ssh "$DR_SITE_HOST" "docker-compose -f /opt/newapi/docker-compose-dr.yml up -d db"
    
    # 等待数据库启动
    sleep 30
    
    # 恢复最新备份
    LATEST_BACKUP=$(ssh "$DR_SITE_HOST" "ls -t $DR_BACKUP_PATH/backup_*.sql | head -1")
    if [ -n "$LATEST_BACKUP" ]; then
        log "Restoring from backup: $LATEST_BACKUP"
        ssh "$DR_SITE_HOST" "pg_restore -d '$DR_DATABASE_URL' '$LATEST_BACKUP'"
    else
        log "No backup found for restoration"
        exit 1
    fi
    
    # 启动应用服务
    ssh "$DR_SITE_HOST" "docker-compose -f /opt/newapi/docker-compose-dr.yml up -d app"
    
    # 等待应用启动
    sleep 60
    
    log "DR site activated successfully"
}

# DNS故障转移
perform_dns_failover() {
    log "Performing DNS failover..."
    
    if [ -x "$DNS_FAILOVER_SCRIPT" ]; then
        "$DNS_FAILOVER_SCRIPT" "$DR_SITE_HOST"
        log "DNS failover completed"
    else
        log "DNS failover script not found or not executable"
    fi
}

# 验证恢复结果
verify_recovery() {
    log "Verifying disaster recovery..."
    
    # 检查应用健康状态
    for i in {1..10}; do
        if curl -f -s "$HEALTH_CHECK_URL" > /dev/null; then
            log "Application is healthy after recovery"
            return 0
        fi
        log "Waiting for application to become healthy... ($i/10)"
        sleep 30
    done
    
    log "Application health check failed after recovery"
    return 1
}

# 主流程
main() {
    log "Starting disaster recovery process..."
    
    # 检查灾备站点
    if ! check_dr_site; then
        log "DR site check failed, attempting to activate..."
        activate_dr_site
    fi
    
    # 执行DNS故障转移
    perform_dns_failover
    
    # 验证恢复结果
    if verify_recovery; then
        log "Disaster recovery completed successfully"
        exit 0
    else
        log "Disaster recovery failed"
        exit 1
    fi
}

# 执行主流程
main "$@"
    defer cancel()
    
    bm.logger.Info("Starting scheduled backup")
    
    if err := bm.BackupDatabase(ctx); err != nil {
        bm.logger.WithError(err).Error("Backup failed")
        bm.notifyFailure(err)
        return
    }
    
    if err := bm.CleanupOldBackups(); err != nil {
        bm.logger.WithError(err).Warn("Failed to cleanup old backups")
    }
    
    bm.logger.Info("Backup completed successfully")
    bm.notifySuccess()
}

// 备份数据库
func (bm *BackupManager) BackupDatabase(ctx context.Context) error {
    // 创建备份目录
    if err := os.MkdirAll(bm.config.BackupDir, 0755); err != nil {
        return fmt.Errorf("failed to create backup directory: %w", err)
    }
    
    // 生成备份文件名
    timestamp := time.Now().Format("20060102_150405")
    backupFile := filepath.Join(bm.config.BackupDir, fmt.Sprintf("newapi_backup_%s.sql", timestamp))
    compressedFile := backupFile + ".gz"
    
    // 执行pg_dump
    cmd := exec.CommandContext(ctx, "pg_dump", bm.config.DatabaseURL,
        "--verbose", "--clean", "--no-owner", "--no-privileges",
        "--format=custom", "--file="+backupFile)
    
    if err := cmd.Run(); err != nil {
        return fmt.Errorf("pg_dump failed: %w", err)
    }
    
    // 压缩备份文件
    if err := bm.compressFile(backupFile, compressedFile); err != nil {
        return fmt.Errorf("failed to compress backup: %w", err)
    }
    
    // 删除未压缩文件
    os.Remove(backupFile)
    
    // 上传到S3(如果配置了)
    if bm.config.S3Bucket != "" {
        if err := bm.uploadToS3(compressedFile); err != nil {
            bm.logger.WithError(err).Warn("Failed to upload backup to S3")
        }
    }
    
    return nil
}

// 压缩文件
func (bm *BackupManager) compressFile(src, dst string) error {
    cmd := exec.Command("gzip", "-c", src)
    
    output, err := os.Create(dst)
    if err != nil {
        return err
    }
    defer output.Close()
    
    cmd.Stdout = output
    return cmd.Run()
}

// 上传到S3
func (bm *BackupManager) uploadToS3(filePath string) error {
    fileName := filepath.Base(filePath)
    s3Key := fmt.Sprintf("database-backups/%s", fileName)
    
    cmd := exec.Command("aws", "s3", "cp", filePath, fmt.Sprintf("s3://%s/%s", bm.config.S3Bucket, s3Key))
    return cmd.Run()
}

// 清理旧备份
func (bm *BackupManager) CleanupOldBackups() error {
    cutoff := time.Now().AddDate(0, 0, -bm.config.RetentionDays)
    
    return filepath.Walk(bm.config.BackupDir, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        
        if !info.IsDir() && info.ModTime().Before(cutoff) {
            bm.logger.WithField("file", path).Info("Removing old backup")
            return os.Remove(path)
        }
        
        return nil
    })
}

// 通知成功
func (bm *BackupManager) notifySuccess() {
    if bm.config.NotifyWebhook != "" {
        // 发送成功通知
        // 实现webhook通知逻辑
    }
}

// 通知失败
func (bm *BackupManager) notifyFailure(err error) {
    if bm.config.NotifyWebhook != "" {
        // 发送失败通知
        // 实现webhook通知逻辑
    }
}

13.8 性能优化与调优

13.8.1 性能优化概述

性能优化策略

性能优化是一个系统性工程,需要从多个维度进行考虑:

graph TB
    A[性能优化] --> B[应用层优化]
    A --> C[数据库优化]
    A --> D[系统层优化]
    A --> E[网络优化]
    
    B --> B1[代码优化]
    B --> B2[内存管理]
    B --> B3[并发优化]
    B --> B4[缓存策略]
    
    C --> C1[查询优化]
    C --> C2[索引优化]
    C --> C3[连接池]
    C --> C4[分库分表]
    
    D --> D1[CPU优化]
    D --> D2[内存优化]
    D --> D3[IO优化]
    D --> D4[容器优化]
    
    E --> E1[负载均衡]
    E --> E2[CDN加速]
    E --> E3[压缩传输]
    E --> E4[连接复用]

性能优化原则

  1. 测量驱动优化:先测量,后优化

  2. 找到瓶颈:识别真正的性能瓶颈

  3. 渐进式优化:逐步优化,避免过度优化

  4. 权衡取舍:在性能、可维护性、复杂度之间平衡

性能监控体系

// internal/performance/monitor.go
package performance

import (
	"context"
	"runtime"
	"sync"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/sirupsen/logrus"
)

// PerformanceConfig 性能配置
type PerformanceConfig struct {
	EnableMetrics     bool          `json:"enable_metrics"`
	MetricsInterval   time.Duration `json:"metrics_interval"`
	EnableProfiling   bool          `json:"enable_profiling"`
	ProfilingPort     int           `json:"profiling_port"`
	MemoryThreshold   int64         `json:"memory_threshold"`
	GoroutineThreshold int          `json:"goroutine_threshold"`
}

// SystemMetrics 系统指标
type SystemMetrics struct {
	CPUUsage      float64 `json:"cpu_usage"`
	MemoryUsage   int64   `json:"memory_usage"`
	GoroutineCount int    `json:"goroutine_count"`
	GCPauseTime   time.Duration `json:"gc_pause_time"`
	HeapSize      int64   `json:"heap_size"`
	StackSize     int64   `json:"stack_size"`
}

// PerformanceAlert 性能告警
type PerformanceAlert struct {
	Type      string    `json:"type"`
	Level     string    `json:"level"`
	Message   string    `json:"message"`
	Value     float64   `json:"value"`
	Threshold float64   `json:"threshold"`
	Timestamp time.Time `json:"timestamp"`
}

13.8.2 应用性能优化

package performance

import (
    "context"
    "runtime"
    "time"
    
    "github.com/gin-gonic/gin"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/sirupsen/logrus"
)

// 性能监控器
type PerformanceMonitor struct {
    logger  *logrus.Logger
    metrics *PerformanceMetrics
}

// 性能指标
type PerformanceMetrics struct {
    RequestDuration   *prometheus.HistogramVec
    RequestCount      *prometheus.CounterVec
    ActiveConnections prometheus.Gauge
    MemoryUsage       prometheus.Gauge
    GoroutineCount    prometheus.Gauge
    GCDuration        prometheus.Histogram
}

// 创建性能监控器
func NewPerformanceMonitor(logger *logrus.Logger) *PerformanceMonitor {
    metrics := &PerformanceMetrics{
        RequestDuration: prometheus.NewHistogramVec(
            prometheus.HistogramOpts{
                Name:    "http_request_duration_seconds",
                Help:    "HTTP request duration in seconds",
                Buckets: prometheus.DefBuckets,
            },
            []string{"method", "endpoint", "status"},
        ),
        RequestCount: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "http_requests_total",
                Help: "Total number of HTTP requests",
            },
            []string{"method", "endpoint", "status"},
        ),
        ActiveConnections: prometheus.NewGauge(
            prometheus.GaugeOpts{
                Name: "active_connections",
                Help: "Number of active connections",
            },
        ),
        MemoryUsage: prometheus.NewGauge(
            prometheus.GaugeOpts{
                Name: "memory_usage_bytes",
                Help: "Current memory usage in bytes",
            },
        ),
        GoroutineCount: prometheus.NewGauge(
            prometheus.GaugeOpts{
                Name: "goroutine_count",
                Help: "Number of goroutines",
            },
        ),
        GCDuration: prometheus.NewHistogram(
            prometheus.HistogramOpts{
                Name:    "gc_duration_seconds",
                Help:    "Garbage collection duration in seconds",
                Buckets: prometheus.DefBuckets,
            },
        ),
    }
    
    // 注册指标
    prometheus.MustRegister(
        metrics.RequestDuration,
        metrics.RequestCount,
        metrics.ActiveConnections,
        metrics.MemoryUsage,
        metrics.GoroutineCount,
        metrics.GCDuration,
    )
    
    return &PerformanceMonitor{
        logger:  logger,
        metrics: metrics,
    }
}

// 启动性能监控
func (pm *PerformanceMonitor) Start(ctx context.Context) {
    ticker := time.NewTicker(10 * time.Second)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            pm.collectMetrics()
        }
    }
}

// 收集指标
func (pm *PerformanceMonitor) collectMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    
    // 更新内存使用指标
    pm.metrics.MemoryUsage.Set(float64(m.Alloc))
    
    // 更新协程数量指标
    pm.metrics.GoroutineCount.Set(float64(runtime.NumGoroutine()))
    
    // 记录GC信息
    if m.NumGC > 0 {
        gcDuration := time.Duration(m.PauseNs[(m.NumGC+255)%256])
        pm.metrics.GCDuration.Observe(gcDuration.Seconds())
    }
}

// HTTP中间件
func (pm *PerformanceMonitor) HTTPMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()
        
        // 增加活跃连接数
        pm.metrics.ActiveConnections.Inc()
        defer pm.metrics.ActiveConnections.Dec()
        
        c.Next()
        
        // 记录请求指标
        duration := time.Since(start)
        status := c.Writer.Status()
        
        pm.metrics.RequestDuration.WithLabelValues(
            c.Request.Method,
            c.FullPath(),
            string(rune(status)),
        ).Observe(duration.Seconds())
        
        pm.metrics.RequestCount.WithLabelValues(
            c.Request.Method,
            c.FullPath(),
            string(rune(status)),
        ).Inc()
        
        // 记录慢请求
        if duration > 1*time.Second {
            pm.logger.WithFields(logrus.Fields{
                "method":   c.Request.Method,
                "path":     c.Request.URL.Path,
                "duration": duration,
                "status":   status,
            }).Warn("Slow request detected")
        }
    }
}

13.8.3 数据库性能优化

索引优化策略

-- 数据库性能优化脚本
-- scripts/optimize-database.sql

-- 创建复合索引
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email_status ON users(email, status) WHERE status = 1;
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_created_at ON users(created_at DESC);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tokens_user_id_status ON tokens(user_id, status) WHERE status = 1;
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tokens_created_at ON tokens(created_at DESC);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_channels_status_type ON channels(status, type);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_logs_user_id_created_at ON logs(user_id, created_at DESC);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_logs_created_at_type ON logs(created_at DESC, type);

-- 部分索引(提高效率)
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_active_users ON users(id) WHERE status = 1;
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_active_channels ON channels(id) WHERE status = 1;

-- 表达式索引
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email_lower ON users(LOWER(email));

-- 分区表设计
CREATE TABLE IF NOT EXISTS logs_partitioned (
    LIKE logs INCLUDING ALL
) PARTITION BY RANGE (created_at);

-- 自动创建分区的函数
CREATE OR REPLACE FUNCTION create_monthly_partition(table_name text, start_date date)
RETURNS void AS $$
DECLARE
    partition_name text;
    end_date date;
BEGIN
    partition_name := table_name || '_' || to_char(start_date, 'YYYY_MM');
    end_date := start_date + interval '1 month';
    
    EXECUTE format('CREATE TABLE IF NOT EXISTS %I PARTITION OF %I FOR VALUES FROM (%L) TO (%L)',
                   partition_name, table_name, start_date, end_date);
END;
$$ LANGUAGE plpgsql;

-- 创建最近几个月的分区
SELECT create_monthly_partition('logs_partitioned', date_trunc('month', CURRENT_DATE - interval '1 month'));
SELECT create_monthly_partition('logs_partitioned', date_trunc('month', CURRENT_DATE));
SELECT create_monthly_partition('logs_partitioned', date_trunc('month', CURRENT_DATE + interval '1 month'));

-- 更新表统计信息
ANALYZE users;
ANALYZE tokens;
ANALYZE channels;
ANALYZE logs;

-- 查询优化建议
-- 1. 避免SELECT *
-- 2. 使用LIMIT限制结果集
-- 3. 合理使用JOIN
-- 4. 避免在WHERE子句中使用函数

连接池优化

// internal/database/pool.go
package database

import (
	"database/sql"
	"time"

	_ "github.com/lib/pq"
)

// PoolConfig 连接池配置
type PoolConfig struct {
	MaxOpenConns    int           `json:"max_open_conns"`
	MaxIdleConns    int           `json:"max_idle_conns"`
	ConnMaxLifetime time.Duration `json:"conn_max_lifetime"`
	ConnMaxIdleTime time.Duration `json:"conn_max_idle_time"`
}

// OptimizeConnectionPool 优化连接池
func OptimizeConnectionPool(db *sql.DB, config PoolConfig) {
	// 设置最大打开连接数
	// 建议值:CPU核心数 * 2
	db.SetMaxOpenConns(config.MaxOpenConns)
	
	// 设置最大空闲连接数
	// 建议值:MaxOpenConns的一半
	db.SetMaxIdleConns(config.MaxIdleConns)
	
	// 设置连接最大生存时间
	// 建议值:5-10分钟
	db.SetConnMaxLifetime(config.ConnMaxLifetime)
	
	// 设置连接最大空闲时间
	// 建议值:1-2分钟
	db.SetConnMaxIdleTime(config.ConnMaxIdleTime)
}

// GetOptimalPoolConfig 获取最优连接池配置
func GetOptimalPoolConfig(cpuCores int) PoolConfig {
	return PoolConfig{
		MaxOpenConns:    cpuCores * 2,
		MaxIdleConns:    cpuCores,
		ConnMaxLifetime: 5 * time.Minute,
		ConnMaxIdleTime: 1 * time.Minute,
	}
}

13.8.4 系统调优

容器资源优化

# docker-compose.performance.yml
version: '3.8'

services:
  app:
    image: newapi:latest
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '1.0'
          memory: 1G
    environment:
      - GOGC=100
      - GOMEMLIMIT=1800MiB
      - GOMAXPROCS=2
    ulimits:
      nofile:
        soft: 65536
        hard: 65536
    sysctls:
      - net.core.somaxconn=65535
      - net.ipv4.tcp_keepalive_time=600
      - net.ipv4.tcp_keepalive_intvl=60
      - net.ipv4.tcp_keepalive_probes=3

  db:
    image: postgres:15
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G
    environment:
      - POSTGRES_SHARED_BUFFERS=1GB
      - POSTGRES_EFFECTIVE_CACHE_SIZE=3GB
      - POSTGRES_WORK_MEM=64MB
      - POSTGRES_MAINTENANCE_WORK_MEM=256MB
    command: >
      postgres
      -c shared_buffers=1GB
      -c effective_cache_size=3GB
      -c work_mem=64MB
      -c maintenance_work_mem=256MB
      -c max_connections=200
      -c random_page_cost=1.1
      -c effective_io_concurrency=200
      -c checkpoint_completion_target=0.9
      -c wal_buffers=16MB
      -c default_statistics_target=100

  redis:
    image: redis:7-alpine
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 512M
    command: >
      redis-server
      --maxmemory 800mb
      --maxmemory-policy allkeys-lru
      --save 900 1
      --save 300 10
      --save 60 10000

Go应用调优

// internal/tuning/optimizer.go
package tuning

import (
	"os"
	"runtime"
	"runtime/debug"
	"strconv"
	"time"
)

// TuningConfig 调优配置
type TuningConfig struct {
	GOGC           int           `json:"gogc"`
	GOMAXPROCS     int           `json:"gomaxprocs"`
	GCPercent      int           `json:"gc_percent"`
	MemoryLimit    int64         `json:"memory_limit"`
	ReadTimeout    time.Duration `json:"read_timeout"`
	WriteTimeout   time.Duration `json:"write_timeout"`
	IdleTimeout    time.Duration `json:"idle_timeout"`
}

// ApplyOptimizations 应用优化配置
func ApplyOptimizations(config TuningConfig) {
	// 设置GC目标百分比
	if config.GCPercent > 0 {
		debug.SetGCPercent(config.GCPercent)
	}
	
	// 设置内存限制
	if config.MemoryLimit > 0 {
		debug.SetMemoryLimit(config.MemoryLimit)
	}
	
	// 设置最大处理器数
	if config.GOMAXPROCS > 0 {
		runtime.GOMAXPROCS(config.GOMAXPROCS)
	}
	
	// 从环境变量读取配置
	if gogc := os.Getenv("GOGC"); gogc != "" {
		if val, err := strconv.Atoi(gogc); err == nil {
			debug.SetGCPercent(val)
		}
	}
	
	if gomemlimit := os.Getenv("GOMEMLIMIT"); gomemlimit != "" {
		if val, err := strconv.ParseInt(gomemlimit, 10, 64); err == nil {
			debug.SetMemoryLimit(val)
		}
	}
}

// GetRecommendedConfig 获取推荐配置
func GetRecommendedConfig() TuningConfig {
	cpuCount := runtime.NumCPU()
	
	return TuningConfig{
		GOGC:         100, // 默认值
		GOMAXPROCS:   cpuCount,
		GCPercent:    100,
		MemoryLimit:  0, // 由GOMEMLIMIT环境变量控制
		ReadTimeout:  30 * time.Second,
		WriteTimeout: 30 * time.Second,
		IdleTimeout:  120 * time.Second,
	}
}

// MonitorGCStats 监控GC统计信息
func MonitorGCStats() {
	var stats debug.GCStats
	debug.ReadGCStats(&stats)
	
	// 记录GC统计信息
	// 可以发送到监控系统
}

性能调优脚本

#!/bin/bash
# scripts/performance-tuning.sh

set -e

echo "Starting performance tuning..."

# 系统参数优化
echo "Optimizing system parameters..."

# 增加文件描述符限制
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf

# 网络参数优化
sysctl -w net.core.somaxconn=65535
sysctl -w net.core.netdev_max_backlog=5000
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=60
sysctl -w net.ipv4.tcp_keepalive_probes=3
sysctl -w net.ipv4.tcp_fin_timeout=30

# 内存参数优化
sysctl -w vm.swappiness=10
sysctl -w vm.dirty_ratio=15
sysctl -w vm.dirty_background_ratio=5

# Docker优化
echo "Optimizing Docker..."

# 设置Docker daemon配置
cat > /etc/docker/daemon.json << EOF
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ],
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Hard": 65536,
      "Soft": 65536
    }
  }
}
EOF

# 重启Docker服务
systemctl restart docker

echo "Performance tuning completed!"

13.9 本章小结

本章深入探讨了Go企业级应用的部署与运维实践,涵盖了从容器化部署到性能优化的完整运维体系。通过New-API项目的实际案例,我们学习了:

核心知识点

  1. 容器化部署:掌握了Docker容器化的最佳实践,包括多阶段构建、镜像优化和安全配置

  2. 编排与调度:学习了Kubernetes集群部署、服务发现、负载均衡和自动扩缩容

  3. 配置管理:了解了配置文件管理、环境变量配置和敏感信息保护

  4. CI/CD流水线:构建了完整的持续集成和持续部署流程,包括代码质量检查、自动化测试和部署策略

  5. 监控告警:建立了全方位的监控体系,包括应用监控、基础设施监控和业务监控

  6. 日志管理:实现了集中化日志收集、分析和可视化

  7. 备份恢复:设计了完善的数据备份策略和灾难恢复方案

  8. 性能优化:从应用层、数据库层和系统层进行全面的性能调优

技术要点

  • 容器技术:Docker、Kubernetes、Helm等容器生态工具

  • 监控工具:Prometheus、Grafana、AlertManager等监控组件

  • 日志系统:ELK Stack(Elasticsearch、Logstash、Kibana)

  • CI/CD工具:GitHub Actions、GitLab CI、Jenkins等

  • 数据库优化:索引优化、查询优化、连接池配置

  • 系统调优:资源限制、网络优化、内核参数调整

最佳实践

  1. 基础设施即代码:使用声明式配置管理基础设施

  2. 监控驱动运维:建立完善的监控指标和告警机制

  3. 自动化优先:尽可能自动化运维流程,减少人工干预

  4. 安全第一:在部署和运维的每个环节都要考虑安全因素

  5. 渐进式优化:基于监控数据进行渐进式性能优化

  6. 文档化管理:完善的运维文档和操作手册

13.10 练习题

基础练习

  1. 容器化部署

    • 为New-API项目编写一个优化的Dockerfile

    • 创建docker-compose.yml文件,包含应用、数据库和Redis

    • 实现多环境配置管理(开发、测试、生产)

  2. Kubernetes部署

    • 编写Kubernetes部署清单文件

    • 配置Service和Ingress

    • 实现ConfigMap和Secret管理

  3. 监控配置

    • 配置Prometheus监控New-API应用

    • 创建Grafana仪表板

    • 设置关键指标的告警规则

进阶练习

  1. CI/CD流水线

    • 设计完整的CI/CD流水线

    • 实现自动化测试和部署

    • 配置多环境部署策略

  2. 性能优化

    • 分析New-API的性能瓶颈

    • 优化数据库查询和索引

    • 调优Go应用的内存和GC参数

  3. 高可用架构

    • 设计New-API的高可用部署架构

    • 实现数据库主从复制

    • 配置负载均衡和故障转移

综合项目

  1. 完整运维体系

    • 为New-API构建完整的运维体系

    • 包括部署、监控、日志、备份、性能优化

    • 编写运维文档和应急预案

13.11 扩展阅读

官方文档

技术书籍

  1. 《Kubernetes权威指南》 - 龚正等著,电子工业出版社

    • 深入理解Kubernetes的架构和实践

    • ISBN: 978-7-121-31682-8

  2. 《Docker技术入门与实战》 - 杨保华等著,机械工业出版社

    • 全面掌握Docker容器技术

    • ISBN: 978-7-111-58804-6

  3. 《SRE:Google运维解密》 - Betsy Beyer等著,电子工业出版社

    • 学习Google的运维理念和实践

    • ISBN: 978-7-121-29094-4

  4. 《高性能MySQL》 - Baron Schwartz等著,电子工业出版社

    • 数据库性能优化的经典之作

    • ISBN: 978-7-121-19885-4

  5. 《Go语言高级编程》 - 柴树杉等著,人民邮电出版社

    • Go语言性能优化和最佳实践

    • ISBN: 978-7-115-49491-9

在线资源

  1. 云原生计算基金会(CNCF)

开源项目

  1. 部署工具

社区资源

  1. 技术会议

    • KubeCon + CloudNativeCon

    • DockerCon

    • GopherCon

通过本章的学习和实践,读者应该能够掌握Go企业级应用的完整部署与运维体系,为实际项目的生产环境部署打下坚实的基础。运维是一个持续改进的过程,需要结合实际业务场景,不断优化和完善运维体系。

最后更新于

这有帮助吗?