Files
interview/questions/11-运维/可观测性.md
yasinshaw 0e46a367c4 refactor: rename files to Chinese and organize by category
Organized 50 interview questions into 12 categories:
- 01-分布式系统 (9 files): 分布式事务, 分布式锁, 一致性哈希, CAP理论, etc.
- 02-数据库 (2 files): MySQL索引优化, MyBatis核心原理
- 03-缓存 (5 files): Redis数据结构, 缓存问题, LRU算法, etc.
- 04-消息队列 (1 file): RocketMQ/Kafka
- 05-并发编程 (4 files): 线程池, 设计模式, 限流策略, etc.
- 06-JVM (1 file): JVM和垃圾回收
- 07-系统设计 (8 files): 秒杀系统, 短链接, IM, Feed流, etc.
- 08-算法与数据结构 (4 files): B+树, 红黑树, 跳表, 时间轮
- 09-网络与安全 (3 files): TCP/IP, 加密安全, 性能优化
- 10-中间件 (4 files): Spring Boot, Nacos, Dubbo, Nginx
- 11-运维 (4 files): Kubernetes, CI/CD, Docker, 可观测性
- 12-面试技巧 (1 file): 面试技巧和职业规划

All files renamed to Chinese for better accessibility and
organized into categorized folders for easier navigation.

Generated with [Claude Code](https://claude.com/claude-code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-03-01 00:10:53 +08:00

29 KiB
Raw Permalink Blame History

可观测性 (Observability)

问题

背景:在分布式系统中,如何快速定位和解决问题成为关键挑战。可观测性通过监控、日志和链路追踪三大支柱,帮助开发和运维团队理解系统内部状态。

问题

  1. 什么是可观测性?它和监控有什么区别?
  2. 监控、日志、链路追踪三大支柱的作用是什么?
  3. Prometheus + Grafana 监控架构是如何设计的?
  4. ELKElasticsearch、Logstash、Kibana日志栈如何搭建
  5. 分布式追踪Jaeger/Zipkin的原理是什么
  6. 如何设计监控告警规则?
  7. 如何实现全链路追踪?
  8. 如何定位性能瓶颈?
  9. 如何设计监控指标体系?
  10. 在实际项目中如何落地区观测性?

标准答案

1. 可观测性概述

定义

可观测性Observability
通过系统外部输出Metrics、Logs、Traces推断系统内部状态的能力

监控Monitoring
通过预定义的指标检查系统是否正常运行

对比

监控Monitoring
├─ 主动询问系统状态(预设规则)
├─ 关注已知问题(如 CPU 使用率 > 80%
└─ 问题:无法发现未知问题

可观测性Observability
├─ 被动收集系统输出(数据驱动)
├─ 可以发现未知问题
└─ 支持根因分析Root Cause Analysis

三大支柱

1. Metrics指标数值型数据
   - Counter计数器请求数、错误数
   - Gauge仪表盘CPU 使用率、内存使用量
   - Histogram直方图请求延迟分布

2. Logs日志离散事件
   - 应用日志:错误日志、调试日志
   - 访问日志Nginx access.log
   - 审计日志:操作记录

3. Traces追踪请求路径
   - Trace一次完整的请求从客户端到后端
   - Span单个服务的处理过程
   - Span ID、Trace ID关联标识

2. 三大支柱详解

Metrics指标

# Prometheus 指标示例
# 1. Counter只增不减
http_requests_total{method="GET",path="/api/users",status="200"} 12345

# 2. Gauge可增可减
memory_usage_bytes{instance="localhost:8080"} 1073741824
cpu_usage_percent{instance="localhost:8080"} 45.2

# 3. Histogram分布
http_request_duration_seconds_bucket{le="0.1"} 5000
http_request_duration_seconds_bucket{le="0.5"} 9500
http_request_duration_seconds_bucket{le="+Inf"} 10000

代码示例Spring Boot Actuator

@RestController
public class UserController {

    private final Counter requestCounter;
    private final Gauge memoryGauge;

    public UserController(MeterRegistry registry) {
        this.requestCounter = Counter.builder("http.requests.total")
            .tag("method", "GET")
            .tag("path", "/api/users")
            .register(registry);

        this.memoryGauge = Gauge.builder("jvm.memory.used", Runtime.getRuntime(), Runtime::totalMemory)
            .register(registry);
    }

    @GetMapping("/api/users")
    public List<User> getUsers() {
        requestCounter.increment();
        return userService.findAll();
    }
}

Logs日志

// 结构化日志JSON 格式)
@Slf4j
@RestController
public class UserController {

    @GetMapping("/api/users/{id}")
    public User getUserById(@PathVariable Long id) {
        log.info("Get user by id", logContext()
            .with("userId", id)
            .with("traceId", MDC.get("traceId"))
            .with("spanId", MDC.get("spanId"))
        );

        User user = userService.findById(id);

        if (user == null) {
            log.warn("User not found", logContext()
                .with("userId", id)
                .with("traceId", MDC.get("traceId"))
            );
            throw new UserNotFoundException(id);
        }

        return user;
    }

    private LogContext logContext() {
        return new LogContext();
    }
}

// 日志输出
{
  "timestamp": "2024-01-01T10:00:00Z",
  "level": "INFO",
  "logger": "com.example.UserController",
  "message": "Get user by id",
  "userId": 123,
  "traceId": "a1b2c3d4e5f6g7h8",
  "spanId": "i9j0k1l2m3n4o5p6",
  "thread": "http-nio-8080-exec-1"
}

Traces追踪

Trace一次完整请求
Client → Gateway → Service A → Service B → Service C
   │         │           │            │            │
   └─────────┴───────────┴────────────┴────────────┘
                    Trace ID: abc123

Span单个服务处理
Gateway (Span 1)
  ├─ Service A (Span 2)
  │   └─ Service B (Span 3)
  │       └─ Service C (Span 4)

3. Prometheus + Grafana 架构

架构图

                    ┌─────────────────┐
                    │   Applications  │
                    │  ( exporters )  │
                    └─────────────────┘
                           │
                           │ /metrics
                           │
                    ┌─────────────────┐
                    │  Prometheus     │
                    │  (Pull 指标)     │
                    └─────────────────┘
                           │
                           │ 存储
                           ▼
                    ┌─────────────────┐
                    │ TSDB (时序数据库)│
                    └─────────────────┘
                           │
                           │ 查询
                           ▼
                    ┌─────────────────┐
                    │    Grafana      │
                    │  (可视化仪表盘)  │
                    └─────────────────┘
                           │
                           │ 告警
                           ▼
                    ┌─────────────────┐
                    │  Alertmanager   │
                    │  (告警路由)      │
                    └─────────────────┘
                           │
                           │ 通知
                           ▼
                    ┌─────────────────┐
                    │  Email/Webhook  │
                    │  钉钉/企业微信    │
                    └─────────────────┘

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s  # 每 15 秒采集一次
  evaluation_interval: 15s  # 每 15 秒评估告警规则

# 告警规则
rule_files:
  - "alerts/*.yml"

# 抓取配置
scrape_configs:
  # Spring Boot Actuator
  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['localhost:8080']

  # Kubernetes 服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

# 告警管理
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

Spring Boot 集成

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.yml
management:
  endpoints:
    web:
      exposure:
        include: prometheus,health,info
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}

Grafana Dashboard

{
  "dashboard": {
    "title": "Spring Boot Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_server_requests_seconds_count[1m])",
            "legendFormat": "{{method}} {{uri}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Request Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[1m]))",
            "legendFormat": "P95 Latency"
          }
        ],
        "type": "graph"
      },
      {
        "title": "JVM Memory Usage",
        "targets": [
          {
            "expr": "jvm_memory_used_bytes{area=\"heap\"}",
            "legendFormat": "Heap Used"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

4. ELK 日志栈

架构图

                    ┌─────────────────┐
                    │   Applications  │
                    │  (日志输出)      │
                    └─────────────────┘
                           │
                           │ Filebeat/Logstash
                           │
                    ┌─────────────────┐
                    │    Logstash     │
                    │  (日志处理)      │
                    ├─────────────────┤
                    │ - 过滤          │
                    │ - 转换          │
                    │ - 解析          │
                    └─────────────────┘
                           │
                           │
                    ┌─────────────────┐
                    │  Elasticsearch  │
                    │  (日志存储)      │
                    └─────────────────┘
                           │
                           │ 查询
                           ▼
                    ┌─────────────────┐
                    │     Kibana      │
                    │  (日志可视化)    │
                    └─────────────────┘

Logstash 配置

# logstash.conf
input {
  file {
    path => "/var/log/app/*.log"
    start_position => "beginning"
    codec => json
  }

  beats {
    port => 5044
  }
}

filter {
  # 解析 JSON 日志
  json {
    source => "message"
  }

  # 提取时间戳
  date {
    match => ["timestamp", "ISO8601"]
  }

  # 提取 Trace ID
  grok {
    match => {
      "message" => '"traceId":"%{DATA:traceId}"'
    }
  }

  # 添加应用名称
  mutate {
    add_field => {
      "application" => "my-app"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "my-app-%{+YYYY.MM.dd}"
  }

  stdout {
    codec => rubydebug
  }
}

Filebeat 配置

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    json.keys_under_root: true
    json.add_error_key: true
    fields:
      app: my-app
      env: production

output.logstash:
  hosts: ["logstash:5044"]

# 日志 multiline 处理
multiline.type: pattern
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
multiline.negate: true
multiline.match: after

Kibana 查询

# 1. 简单查询
level: "ERROR"

# 2. 范围查询
@timestamp: [now-1h TO now]

# 3. 通配符
message: "*NullPointerException*"

# 4. 正则表达式
message: /.*User \d+ not found.*/

# 5. 聚合查询
# 按错误级别统计
level: "ERROR" | stats count by level

# 按时间统计
# histogram @timestamp, interval 1m

# 按服务统计
# terms appName

# 6. 全链路追踪
# 查询同一 Trace ID 的所有日志
traceId: "a1b2c3d4e5f6g7h8"

5. 分布式追踪

原理

1. 客户端请求生成 Trace ID
2. 每个服务处理时生成 Span
3. Span 记录:
   - Span ID当前 Span 唯一 ID
   - Parent Span ID父 Span ID
   - Trace ID全局 Trace ID
   - Timestamp开始时间
   - Duration耗时
   - Tags标签
   - Logs日志
4. Span 上报到 Jaeger/Zipkin
5. 追踪系统构建调用链

Jaeger 架构

                    ┌─────────────────┐
                    │   Applications  │
                    │  (Jaeger Client)│
                    └─────────────────┘
                           │
                           │ UDP/HTTP
                           │
                    ┌─────────────────┐
                    │     Agent       │
                    │  (数据采集)      │
                    └─────────────────┘
                           │
                           │
                    ┌─────────────────┐
                    │     Collector   │
                    │  (数据处理)      │
                    └─────────────────┘
                           │
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
   ┌─────────┐       ┌──────────┐       ┌──────────┐
   │Elasticsearch│    │  Cassandra│     │ Kafka    │
   └─────────┘       └──────────┘       └──────────┘
        │
        │ 查询
        ▼
   ┌─────────┐
   │  Query  │
   │ Service │
   └─────────┘
        │
        │ Web UI
        ▼
   ┌─────────┐
   │   Web   │
   │   UI    │
   └─────────┘

Spring Boot 集成 Jaeger

<!-- pom.xml -->
<dependency>
    <groupId>io.opentracing.contrib</groupId>
    <artifactId>opentracing-spring-jaeger-web-starter</artifactId>
</dependency>
# application.yml
opentracing:
  jaeger:
    enabled: true
    service-name: my-app
    udp-sender:
      host: jaeger-agent
      port: 6831
    sampler:
      probability: 0.1  # 10% 采样

代码示例

@RestController
public class UserController {

    private final Tracer tracer;

    @GetMapping("/api/users/{id}")
    public User getUserById(@PathVariable Long id) {
        // 创建自定义 Span
        Span span = tracer.buildSpan("getUserById")
            .withTag("userId", id)
            .start();

        try (Scope scope = tracer.scopeManager().activate(span)) {
            User user = userService.findById(id);

            if (user == null) {
                span.setTag("error", true);
                span.log("User not found");
                throw new UserNotFoundException(id);
            }

            return user;
        } finally {
            span.finish();
        }
    }
}

Zipkin 集成

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-zipkin</artifactId>
</dependency>
# application.yml
spring:
  zipkin:
    base-url: http://zipkin:9411
  sleuth:
    sampler:
      probability: 0.1  # 10% 采样

6. 监控告警规则

Prometheus 告警规则

# alerts.yml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # 高错误率
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

      # 高延迟
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }} seconds"

      # 服务下线
      - alert: ServiceDown
        expr: up{job="spring-boot"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} is down"

      # JVM 内存使用率高
      - alert: HighMemoryUsage
        expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Heap memory usage is {{ $value | humanizePercentage }}"

      # 磁盘空间不足
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low"
          description: "Disk space is {{ $value | humanizePercentage }} available"

Alertmanager 配置

# alertmanager.yml
global:
  resolve_timeout: 5m

# 路由配置
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'critical'
      continue: true

    - match:
        severity: warning
      receiver: 'warning'

# 接收器配置
receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook-server/default'

  - name: 'critical'
    webhook_configs:
      - url: 'http://webhook-server/critical'
    email_configs:
      - to: 'oncall@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'

  - name: 'warning'
    webhook_configs:
      - url: 'http://webhook-server/warning'

# 抑制规则
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

钉钉告警

# 钉钉 Webhook 示例
from flask import Flask, request
import requests
import json

app = Flask(__name__)

@app.route('/alertmanager', methods=['POST'])
def alertmanager():
    data = request.json

    for alert in data.get('alerts', []):
        status = alert.get('status')
        labels = alert.get('labels', {})
        annotations = alert.get('annotations', {})

        message = {
            "msgtype": "markdown",
            "markdown": {
                "title": f"Alert: {labels.get('alertname')}",
                "text": f"""
### {labels.get('alertname')}

**Status:** {status}
**Severity:** {labels.get('severity')}
**Instance:** {labels.get('instance')}

**Summary:** {annotations.get('summary')}
**Description:** {annotations.get('description')}

**Starts:** {alert.get('startsAt')}
                """
            }
        }

        requests.post(
            'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN',
            json=message
        )

    return 'OK'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

7. 全链路追踪

实现方案

1. 客户端生成 Trace ID
2. HTTP Header 传递 Trace ID
   - X-Trace-Id
   - X-Span-Id
3. 每个服务记录 Span
4. 异步上报到 Jaeger/Zipkin
5. 追踪系统构建调用链

Spring Cloud Sleuth 实现

// 1. 配置 Sleuth
@Configuration
public class TracingConfig {

    @Bean
    public HttpTraceCustomizer httpTraceCustomizer() {
        return (builder) -> builder.include(EVERYTHING);
    }
}

// 2. RestTemplate 传递 Trace ID
@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        return new RestTemplate();
    }

    @Bean
    public RestTemplateCustomizer restTemplateCustomizer(Tracer tracer) {
        return restTemplate -> {
            restTemplate.setInterceptors(Collections.singletonList(new ClientHttpRequestInterceptor() {
                @Override
                public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException {
                    Span span = tracer.activeSpan();
                    if (span != null) {
                        request.getHeaders().add("X-Trace-Id", span.context().traceId());
                        request.getHeaders().add("X-Span-Id", span.context().spanId());
                    }
                    return execution.execute(request, body);
                }
            }));
        };
    }
}

// 3. Kafka 消息传递 Trace ID
@Configuration
public class KafkaConfig {

    @Bean
    public ProducerFactory<String, String> producerFactory(Tracer tracer) {
        Map<String, Object> configProps = new HashMap<>();
        configProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");

        return new DefaultKafkaProducerFactory<>(configProps,
            new StringSerializer(),
            new StringSerializer());
    }
}

// 4. 数据库查询传递 Trace ID
@Configuration
public class DatabaseConfig {

    @Bean
    public DataSource dataSource(Tracer tracer) {
        HikariDataSource dataSource = new HikariDataSource();
        dataSource.setJdbcUrl("jdbc:mysql://localhost:3306/db");

        dataSource.setConnectionTestQuery("SELECT 1");
        dataSource.setConnectionInitSql("SET @trace_id = '" + tracer.activeSpan().context().traceId() + "'");

        return dataSource;
    }
}

Trace ID 关联日志

// 使用 MDC 传递 Trace ID
@Slf4j
@Component
public class TraceIdFilter implements Filter {

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
        String traceId = request.getHeader("X-Trace-Id");
        if (traceId == null) {
            traceId = UUID.randomUUID().toString();
        }

        MDC.put("traceId", traceId);

        try {
            chain.doFilter(request, response);
        } finally {
            MDC.clear();
        }
    }
}

// Logback 配置
<configuration>
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - traceId=%X{traceId} - %msg%n</pattern>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="CONSOLE" />
    </root>
</configuration>

8. 性能瓶颈定位

定位流程

1. 监控告警Prometheus
   - CPU 使用率高
   - 内存使用率高
   - 请求延迟高

2. 链路追踪Jaeger
   - 定位慢请求
   - 找出耗时最长的服务

3. 日志分析ELK
   - 查找错误日志
   - 分析异常堆栈

4. 性能分析Profiling
   - CPU Profiling
   - Memory Profiling
   - Thread Dump

案例 1慢查询定位

-- 1. 开启 MySQL 慢查询日志
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;

-- 2. 分析慢查询日志
pt-query-digest /var/log/mysql/slow.log

-- 3. 优化 SQL
-- 添加索引
CREATE INDEX idx_user_email ON users(email);

-- 重写查询
-- Before
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';

-- After
SELECT * FROM users WHERE email = 'alice@example.com';

案例 2内存泄漏定位

# 1. 导出堆转储
jmap -dump:format=b,file=heap.hprof <pid>

# 2. 使用 MAT 分析
# - 查看 Dominator Tree
# - 查找 Leak Suspects
# - 查看 Histogram

# 3. 定位泄漏代码
# - 未关闭的资源Connection、Stream
# - 静态集合持有大对象
# - 缓存未设置过期时间

案例 3CPU 高负载定位

# 1. 查看 CPU 使用率
top -p <pid>

# 2. 导出线程快照
jstack <pid> > thread.dump

# 3. 查找繁忙线程
printf "%x\n" <tid>  # 转换为 16 进制
grep -A 20 <tid-hex> thread.dump

# 4. 分析代码
# - 死循环
# - 正则表达式(回溯)
# - 大对象序列化

9. 监控指标体系

分层指标

1. 基础设施层Infrastructure
   - CPU 使用率
   - 内存使用率
   - 磁盘 I/O
   - 网络流量

2. 平台层Platform
   - Kubernetes 集群健康
   - Pod 数量
   - Node 状态

3. 中间件层Middleware
   - Redis连接数、命令执行时间、内存使用率
   - MySQLQPS、慢查询、连接数、主从延迟
   - Kafka消息积压、消费延迟

4. 应用层Application
   - QPS每秒请求数
   - Latency延迟 P50、P95、P99
   - Error Rate错误率
   - Saturation饱和度

5. 业务层Business
   - 订单量
   - 支付成功率
   - 用户活跃度

RED 方法

R - Rate (请求速率)
- QPSQueries Per Second
- RPSRequests Per Second

E - Errors (错误率)
- HTTP 5xx 错误率
- 业务异常率

D - Duration (请求耗时)
- P50中位数
- P9595 分位)
- P9999 分位)

USE 方法

U - Utilization (资源利用率)
- CPU 使用率
- 内存使用率
- 磁盘使用率

S - Saturation (资源饱和度)
- CPU 运行队列长度
- 内存 Swap 使用量
- 磁盘 I/O 等待时间

E - Errors (错误数)
- 硬件错误ECC、磁盘坏道
- 软件错误OOM、连接超时

Grafana Dashboard 示例

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "QPS",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[1m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))"
          }
        ]
      },
      {
        "title": "P95 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))"
          }
        ]
      },
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "rate(process_cpu_seconds_total[1m])"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"}"
          }
        ]
      }
    ]
  }
}

10. 实际项目落地

场景 1电商系统监控

需求:
- 监控订单接口性能
- 发现慢查询并优化
- 监控支付成功率

方案:
1. Prometheus 监控
   - QPS、延迟、错误率
   - JVM 指标
   - MySQL 慢查询

2. Jaeger 链路追踪
   - 订单创建流程
   - 支付流程

3. ELK 日志分析
   - 订单日志
   - 支付日志

4. Grafana 仪表盘
   - 业务指标(订单量、支付成功率)
   - 技术指标QPS、延迟

场景 2微服务链路追踪

需求:
- 追踪跨服务请求
- 定位性能瓶颈
- 分析服务依赖

方案:
1. Spring Cloud Sleuth 生成 Trace ID
2. Jaeger 收集 Span
3. Kibana 关联日志(通过 Trace ID
4. Prometheus 监控每个服务性能

示例:
用户下单
├─ 订单服务(创建订单)
├─ 库存服务(扣减库存)
├─ 支付服务(创建支付)
└─ 物流服务(分配物流)

通过 Trace ID 关联所有服务的日志

11. 阿里 P7 加分项

架构设计能力

  • 设计过企业级可观测性平台(统一监控、日志、追踪)
  • 有多集群、多地域的监控架构经验
  • 实现过自定义监控 Agent 和 Collector

深度理解

  • 熟悉 Prometheus 内部机制TSDB、存储引擎、查询引擎
  • 理解 Elasticsearch 底层原理Lucene、分片、副本
  • 有 Jaeger/Zipkin 源码阅读经验

性能优化

  • 优化过 Prometheus 查询性能Recording Rules、联邦
  • 优化过 Elasticsearch 索引性能分片策略、Mapping 设计)
  • 优化过日志采集性能(采样率、批量上传)

生产实践

  • 解决过海量数据存储和查询问题(数据降采样、冷热分离)
  • 实现过智能告警(动态阈值、异常检测、机器学习)
  • 有故障快速定位经验(根因分析、故障复盘)

开源贡献

  • 向 Prometheus/Grafana/Jaeger 社区提交过 PR
  • 开发过自定义 Exporter
  • 编写过相关技术博客或演讲

可观测性最佳实践

  • 实现 SLO/SLIService Level Objective/Indicator
  • 使用 Error Budget 管理发布节奏
  • 有混沌工程实践Chaos Engineering
  • 实现 APMApplication Performance Monitoring

业务监控

  • 设计过业务指标大盘
  • 实现过实时数据大屏Druid、ClickHouse
  • 有用户行为分析经验(埋点、漏斗分析)