# 可观测性 (Observability) ## 问题 **背景**:在分布式系统中,如何快速定位和解决问题成为关键挑战。可观测性通过监控、日志和链路追踪三大支柱,帮助开发和运维团队理解系统内部状态。 **问题**: 1. 什么是可观测性?它和监控有什么区别? 2. 监控、日志、链路追踪三大支柱的作用是什么? 3. Prometheus + Grafana 监控架构是如何设计的? 4. ELK(Elasticsearch、Logstash、Kibana)日志栈如何搭建? 5. 分布式追踪(Jaeger/Zipkin)的原理是什么? 6. 如何设计监控告警规则? 7. 如何实现全链路追踪? 8. 如何定位性能瓶颈? 9. 如何设计监控指标体系? 10. 在实际项目中如何落地区观测性? --- ## 标准答案 ### 1. 可观测性概述 #### **定义**: ``` 可观测性(Observability): 通过系统外部输出(Metrics、Logs、Traces)推断系统内部状态的能力 监控(Monitoring): 通过预定义的指标检查系统是否正常运行 ``` #### **对比**: ``` 监控(Monitoring): ├─ 主动询问系统状态(预设规则) ├─ 关注已知问题(如 CPU 使用率 > 80%) └─ 问题:无法发现未知问题 可观测性(Observability): ├─ 被动收集系统输出(数据驱动) ├─ 可以发现未知问题 └─ 支持根因分析(Root Cause Analysis) ``` #### **三大支柱**: ``` 1. Metrics(指标):数值型数据 - Counter(计数器):请求数、错误数 - Gauge(仪表盘):CPU 使用率、内存使用量 - Histogram(直方图):请求延迟分布 2. Logs(日志):离散事件 - 应用日志:错误日志、调试日志 - 访问日志:Nginx access.log - 审计日志:操作记录 3. Traces(追踪):请求路径 - Trace:一次完整的请求(从客户端到后端) - Span:单个服务的处理过程 - Span ID、Trace ID:关联标识 ``` --- ### 2. 三大支柱详解 #### **Metrics(指标)**: ```yaml # Prometheus 指标示例 # 1. Counter(只增不减) http_requests_total{method="GET",path="/api/users",status="200"} 12345 # 2. Gauge(可增可减) memory_usage_bytes{instance="localhost:8080"} 1073741824 cpu_usage_percent{instance="localhost:8080"} 45.2 # 3. Histogram(分布) http_request_duration_seconds_bucket{le="0.1"} 5000 http_request_duration_seconds_bucket{le="0.5"} 9500 http_request_duration_seconds_bucket{le="+Inf"} 10000 ``` **代码示例(Spring Boot Actuator)**: ```java @RestController public class UserController { private final Counter requestCounter; private final Gauge memoryGauge; public UserController(MeterRegistry registry) { this.requestCounter = Counter.builder("http.requests.total") .tag("method", "GET") .tag("path", "/api/users") .register(registry); this.memoryGauge = Gauge.builder("jvm.memory.used", Runtime.getRuntime(), Runtime::totalMemory) .register(registry); } @GetMapping("/api/users") public List getUsers() { requestCounter.increment(); return userService.findAll(); } } ``` #### **Logs(日志)**: ```java // 结构化日志(JSON 格式) @Slf4j @RestController public class UserController { @GetMapping("/api/users/{id}") public User getUserById(@PathVariable Long id) { log.info("Get user by id", logContext() .with("userId", id) .with("traceId", MDC.get("traceId")) .with("spanId", MDC.get("spanId")) ); User user = userService.findById(id); if (user == null) { log.warn("User not found", logContext() .with("userId", id) .with("traceId", MDC.get("traceId")) ); throw new UserNotFoundException(id); } return user; } private LogContext logContext() { return new LogContext(); } } // 日志输出 { "timestamp": "2024-01-01T10:00:00Z", "level": "INFO", "logger": "com.example.UserController", "message": "Get user by id", "userId": 123, "traceId": "a1b2c3d4e5f6g7h8", "spanId": "i9j0k1l2m3n4o5p6", "thread": "http-nio-8080-exec-1" } ``` #### **Traces(追踪)**: ``` Trace(一次完整请求): Client → Gateway → Service A → Service B → Service C │ │ │ │ │ └─────────┴───────────┴────────────┴────────────┘ Trace ID: abc123 Span(单个服务处理): Gateway (Span 1) ├─ Service A (Span 2) │ └─ Service B (Span 3) │ └─ Service C (Span 4) ``` --- ### 3. Prometheus + Grafana 架构 #### **架构图**: ``` ┌─────────────────┐ │ Applications │ │ ( exporters ) │ └─────────────────┘ │ │ /metrics │ ┌─────────────────┐ │ Prometheus │ │ (Pull 指标) │ └─────────────────┘ │ │ 存储 ▼ ┌─────────────────┐ │ TSDB (时序数据库)│ └─────────────────┘ │ │ 查询 ▼ ┌─────────────────┐ │ Grafana │ │ (可视化仪表盘) │ └─────────────────┘ │ │ 告警 ▼ ┌─────────────────┐ │ Alertmanager │ │ (告警路由) │ └─────────────────┘ │ │ 通知 ▼ ┌─────────────────┐ │ Email/Webhook │ │ 钉钉/企业微信 │ └─────────────────┘ ``` #### **Prometheus 配置**: ```yaml # prometheus.yml global: scrape_interval: 15s # 每 15 秒采集一次 evaluation_interval: 15s # 每 15 秒评估告警规则 # 告警规则 rule_files: - "alerts/*.yml" # 抓取配置 scrape_configs: # Spring Boot Actuator - job_name: 'spring-boot' metrics_path: '/actuator/prometheus' static_configs: - targets: ['localhost:8080'] # Kubernetes 服务发现 - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # 告警管理 alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 ``` #### **Spring Boot 集成**: ```xml org.springframework.boot spring-boot-starter-actuator io.micrometer micrometer-registry-prometheus ``` ```yaml # application.yml management: endpoints: web: exposure: include: prometheus,health,info metrics: export: prometheus: enabled: true tags: application: ${spring.application.name} ``` #### **Grafana Dashboard**: ```json { "dashboard": { "title": "Spring Boot Metrics", "panels": [ { "title": "Request Rate", "targets": [ { "expr": "rate(http_server_requests_seconds_count[1m])", "legendFormat": "{{method}} {{uri}}" } ], "type": "graph" }, { "title": "Request Latency", "targets": [ { "expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[1m]))", "legendFormat": "P95 Latency" } ], "type": "graph" }, { "title": "JVM Memory Usage", "targets": [ { "expr": "jvm_memory_used_bytes{area=\"heap\"}", "legendFormat": "Heap Used" } ], "type": "graph" } ] } } ``` --- ### 4. ELK 日志栈 #### **架构图**: ``` ┌─────────────────┐ │ Applications │ │ (日志输出) │ └─────────────────┘ │ │ Filebeat/Logstash │ ┌─────────────────┐ │ Logstash │ │ (日志处理) │ ├─────────────────┤ │ - 过滤 │ │ - 转换 │ │ - 解析 │ └─────────────────┘ │ │ ┌─────────────────┐ │ Elasticsearch │ │ (日志存储) │ └─────────────────┘ │ │ 查询 ▼ ┌─────────────────┐ │ Kibana │ │ (日志可视化) │ └─────────────────┘ ``` #### **Logstash 配置**: ```conf # logstash.conf input { file { path => "/var/log/app/*.log" start_position => "beginning" codec => json } beats { port => 5044 } } filter { # 解析 JSON 日志 json { source => "message" } # 提取时间戳 date { match => ["timestamp", "ISO8601"] } # 提取 Trace ID grok { match => { "message" => '"traceId":"%{DATA:traceId}"' } } # 添加应用名称 mutate { add_field => { "application" => "my-app" } } } output { elasticsearch { hosts => ["elasticsearch:9200"] index => "my-app-%{+YYYY.MM.dd}" } stdout { codec => rubydebug } } ``` #### **Filebeat 配置**: ```yaml # filebeat.yml filebeat.inputs: - type: log enabled: true paths: - /var/log/app/*.log json.keys_under_root: true json.add_error_key: true fields: app: my-app env: production output.logstash: hosts: ["logstash:5044"] # 日志 multiline 处理 multiline.type: pattern multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}' multiline.negate: true multiline.match: after ``` #### **Kibana 查询**: ``` # 1. 简单查询 level: "ERROR" # 2. 范围查询 @timestamp: [now-1h TO now] # 3. 通配符 message: "*NullPointerException*" # 4. 正则表达式 message: /.*User \d+ not found.*/ # 5. 聚合查询 # 按错误级别统计 level: "ERROR" | stats count by level # 按时间统计 # histogram @timestamp, interval 1m # 按服务统计 # terms appName # 6. 全链路追踪 # 查询同一 Trace ID 的所有日志 traceId: "a1b2c3d4e5f6g7h8" ``` --- ### 5. 分布式追踪 #### **原理**: ``` 1. 客户端请求生成 Trace ID 2. 每个服务处理时生成 Span 3. Span 记录: - Span ID(当前 Span 唯一 ID) - Parent Span ID(父 Span ID) - Trace ID(全局 Trace ID) - Timestamp(开始时间) - Duration(耗时) - Tags(标签) - Logs(日志) 4. Span 上报到 Jaeger/Zipkin 5. 追踪系统构建调用链 ``` #### **Jaeger 架构**: ``` ┌─────────────────┐ │ Applications │ │ (Jaeger Client)│ └─────────────────┘ │ │ UDP/HTTP │ ┌─────────────────┐ │ Agent │ │ (数据采集) │ └─────────────────┘ │ │ ┌─────────────────┐ │ Collector │ │ (数据处理) │ └─────────────────┘ │ │ ┌──────────────────┼──────────────────┐ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ │Elasticsearch│ │ Cassandra│ │ Kafka │ └─────────┘ └──────────┘ └──────────┘ │ │ 查询 ▼ ┌─────────┐ │ Query │ │ Service │ └─────────┘ │ │ Web UI ▼ ┌─────────┐ │ Web │ │ UI │ └─────────┘ ``` #### **Spring Boot 集成 Jaeger**: ```xml io.opentracing.contrib opentracing-spring-jaeger-web-starter ``` ```yaml # application.yml opentracing: jaeger: enabled: true service-name: my-app udp-sender: host: jaeger-agent port: 6831 sampler: probability: 0.1 # 10% 采样 ``` **代码示例**: ```java @RestController public class UserController { private final Tracer tracer; @GetMapping("/api/users/{id}") public User getUserById(@PathVariable Long id) { // 创建自定义 Span Span span = tracer.buildSpan("getUserById") .withTag("userId", id) .start(); try (Scope scope = tracer.scopeManager().activate(span)) { User user = userService.findById(id); if (user == null) { span.setTag("error", true); span.log("User not found"); throw new UserNotFoundException(id); } return user; } finally { span.finish(); } } } ``` #### **Zipkin 集成**: ```xml org.springframework.cloud spring-cloud-starter-zipkin ``` ```yaml # application.yml spring: zipkin: base-url: http://zipkin:9411 sleuth: sampler: probability: 0.1 # 10% 采样 ``` --- ### 6. 监控告警规则 #### **Prometheus 告警规则**: ```yaml # alerts.yml groups: - name: application_alerts interval: 30s rules: # 高错误率 - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value }} errors/sec" # 高延迟 - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }} seconds" # 服务下线 - alert: ServiceDown expr: up{job="spring-boot"} == 0 for: 1m labels: severity: critical annotations: summary: "Service is down" description: "{{ $labels.instance }} is down" # JVM 内存使用率高 - alert: HighMemoryUsage expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Heap memory usage is {{ $value | humanizePercentage }}" # 磁盘空间不足 - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Disk space low" description: "Disk space is {{ $value | humanizePercentage }} available" ``` #### **Alertmanager 配置**: ```yaml # alertmanager.yml global: resolve_timeout: 5m # 路由配置 route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: - match: severity: critical receiver: 'critical' continue: true - match: severity: warning receiver: 'warning' # 接收器配置 receivers: - name: 'default' webhook_configs: - url: 'http://webhook-server/default' - name: 'critical' webhook_configs: - url: 'http://webhook-server/critical' email_configs: - to: 'oncall@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.example.com:587' auth_username: 'alertmanager@example.com' auth_password: 'password' - name: 'warning' webhook_configs: - url: 'http://webhook-server/warning' # 抑制规则 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service'] ``` #### **钉钉告警**: ```python # 钉钉 Webhook 示例 from flask import Flask, request import requests import json app = Flask(__name__) @app.route('/alertmanager', methods=['POST']) def alertmanager(): data = request.json for alert in data.get('alerts', []): status = alert.get('status') labels = alert.get('labels', {}) annotations = alert.get('annotations', {}) message = { "msgtype": "markdown", "markdown": { "title": f"Alert: {labels.get('alertname')}", "text": f""" ### {labels.get('alertname')} **Status:** {status} **Severity:** {labels.get('severity')} **Instance:** {labels.get('instance')} **Summary:** {annotations.get('summary')} **Description:** {annotations.get('description')} **Starts:** {alert.get('startsAt')} """ } } requests.post( 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN', json=message ) return 'OK' if __name__ == '__main__': app.run(host='0.0.0.0', port=5000) ``` --- ### 7. 全链路追踪 #### **实现方案**: ``` 1. 客户端生成 Trace ID 2. HTTP Header 传递 Trace ID - X-Trace-Id - X-Span-Id 3. 每个服务记录 Span 4. 异步上报到 Jaeger/Zipkin 5. 追踪系统构建调用链 ``` #### **Spring Cloud Sleuth 实现**: ```java // 1. 配置 Sleuth @Configuration public class TracingConfig { @Bean public HttpTraceCustomizer httpTraceCustomizer() { return (builder) -> builder.include(EVERYTHING); } } // 2. RestTemplate 传递 Trace ID @Configuration public class RestTemplateConfig { @Bean public RestTemplate restTemplate() { return new RestTemplate(); } @Bean public RestTemplateCustomizer restTemplateCustomizer(Tracer tracer) { return restTemplate -> { restTemplate.setInterceptors(Collections.singletonList(new ClientHttpRequestInterceptor() { @Override public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException { Span span = tracer.activeSpan(); if (span != null) { request.getHeaders().add("X-Trace-Id", span.context().traceId()); request.getHeaders().add("X-Span-Id", span.context().spanId()); } return execution.execute(request, body); } })); }; } } // 3. Kafka 消息传递 Trace ID @Configuration public class KafkaConfig { @Bean public ProducerFactory producerFactory(Tracer tracer) { Map configProps = new HashMap<>(); configProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); return new DefaultKafkaProducerFactory<>(configProps, new StringSerializer(), new StringSerializer()); } } // 4. 数据库查询传递 Trace ID @Configuration public class DatabaseConfig { @Bean public DataSource dataSource(Tracer tracer) { HikariDataSource dataSource = new HikariDataSource(); dataSource.setJdbcUrl("jdbc:mysql://localhost:3306/db"); dataSource.setConnectionTestQuery("SELECT 1"); dataSource.setConnectionInitSql("SET @trace_id = '" + tracer.activeSpan().context().traceId() + "'"); return dataSource; } } ``` #### **Trace ID 关联日志**: ```java // 使用 MDC 传递 Trace ID @Slf4j @Component public class TraceIdFilter implements Filter { @Override public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { String traceId = request.getHeader("X-Trace-Id"); if (traceId == null) { traceId = UUID.randomUUID().toString(); } MDC.put("traceId", traceId); try { chain.doFilter(request, response); } finally { MDC.clear(); } } } // Logback 配置 %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - traceId=%X{traceId} - %msg%n ``` --- ### 8. 性能瓶颈定位 #### **定位流程**: ``` 1. 监控告警(Prometheus) - CPU 使用率高 - 内存使用率高 - 请求延迟高 2. 链路追踪(Jaeger) - 定位慢请求 - 找出耗时最长的服务 3. 日志分析(ELK) - 查找错误日志 - 分析异常堆栈 4. 性能分析(Profiling) - CPU Profiling - Memory Profiling - Thread Dump ``` #### **案例 1:慢查询定位** ```sql -- 1. 开启 MySQL 慢查询日志 SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 1; -- 2. 分析慢查询日志 pt-query-digest /var/log/mysql/slow.log -- 3. 优化 SQL -- 添加索引 CREATE INDEX idx_user_email ON users(email); -- 重写查询 -- Before SELECT * FROM users WHERE LOWER(email) = 'alice@example.com'; -- After SELECT * FROM users WHERE email = 'alice@example.com'; ``` #### **案例 2:内存泄漏定位** ```bash # 1. 导出堆转储 jmap -dump:format=b,file=heap.hprof # 2. 使用 MAT 分析 # - 查看 Dominator Tree # - 查找 Leak Suspects # - 查看 Histogram # 3. 定位泄漏代码 # - 未关闭的资源(Connection、Stream) # - 静态集合持有大对象 # - 缓存未设置过期时间 ``` #### **案例 3:CPU 高负载定位** ```bash # 1. 查看 CPU 使用率 top -p # 2. 导出线程快照 jstack > thread.dump # 3. 查找繁忙线程 printf "%x\n" # 转换为 16 进制 grep -A 20 thread.dump # 4. 分析代码 # - 死循环 # - 正则表达式(回溯) # - 大对象序列化 ``` --- ### 9. 监控指标体系 #### **分层指标**: ``` 1. 基础设施层(Infrastructure) - CPU 使用率 - 内存使用率 - 磁盘 I/O - 网络流量 2. 平台层(Platform) - Kubernetes 集群健康 - Pod 数量 - Node 状态 3. 中间件层(Middleware) - Redis:连接数、命令执行时间、内存使用率 - MySQL:QPS、慢查询、连接数、主从延迟 - Kafka:消息积压、消费延迟 4. 应用层(Application) - QPS(每秒请求数) - Latency(延迟 P50、P95、P99) - Error Rate(错误率) - Saturation(饱和度) 5. 业务层(Business) - 订单量 - 支付成功率 - 用户活跃度 ``` #### **RED 方法**: ``` R - Rate (请求速率) - QPS(Queries Per Second) - RPS(Requests Per Second) E - Errors (错误率) - HTTP 5xx 错误率 - 业务异常率 D - Duration (请求耗时) - P50(中位数) - P95(95 分位) - P99(99 分位) ``` #### **USE 方法**: ``` U - Utilization (资源利用率) - CPU 使用率 - 内存使用率 - 磁盘使用率 S - Saturation (资源饱和度) - CPU 运行队列长度 - 内存 Swap 使用量 - 磁盘 I/O 等待时间 E - Errors (错误数) - 硬件错误(ECC、磁盘坏道) - 软件错误(OOM、连接超时) ``` #### **Grafana Dashboard 示例**: ```json { "dashboard": { "title": "Service Overview", "panels": [ { "title": "QPS", "targets": [ { "expr": "sum(rate(http_requests_total[1m]))" } ] }, { "title": "Error Rate", "targets": [ { "expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))" } ] }, { "title": "P95 Latency", "targets": [ { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))" } ] }, { "title": "CPU Usage", "targets": [ { "expr": "rate(process_cpu_seconds_total[1m])" } ] }, { "title": "Memory Usage", "targets": [ { "expr": "jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"}" } ] } ] } } ``` --- ### 10. 实际项目落地 #### **场景 1:电商系统监控** ``` 需求: - 监控订单接口性能 - 发现慢查询并优化 - 监控支付成功率 方案: 1. Prometheus 监控 - QPS、延迟、错误率 - JVM 指标 - MySQL 慢查询 2. Jaeger 链路追踪 - 订单创建流程 - 支付流程 3. ELK 日志分析 - 订单日志 - 支付日志 4. Grafana 仪表盘 - 业务指标(订单量、支付成功率) - 技术指标(QPS、延迟) ``` #### **场景 2:微服务链路追踪** ``` 需求: - 追踪跨服务请求 - 定位性能瓶颈 - 分析服务依赖 方案: 1. Spring Cloud Sleuth 生成 Trace ID 2. Jaeger 收集 Span 3. Kibana 关联日志(通过 Trace ID) 4. Prometheus 监控每个服务性能 示例: 用户下单 ├─ 订单服务(创建订单) ├─ 库存服务(扣减库存) ├─ 支付服务(创建支付) └─ 物流服务(分配物流) 通过 Trace ID 关联所有服务的日志 ``` --- ### 11. 阿里 P7 加分项 **架构设计能力**: - 设计过企业级可观测性平台(统一监控、日志、追踪) - 有多集群、多地域的监控架构经验 - 实现过自定义监控 Agent 和 Collector **深度理解**: - 熟悉 Prometheus 内部机制(TSDB、存储引擎、查询引擎) - 理解 Elasticsearch 底层原理(Lucene、分片、副本) - 有 Jaeger/Zipkin 源码阅读经验 **性能优化**: - 优化过 Prometheus 查询性能(Recording Rules、联邦) - 优化过 Elasticsearch 索引性能(分片策略、Mapping 设计) - 优化过日志采集性能(采样率、批量上传) **生产实践**: - 解决过海量数据存储和查询问题(数据降采样、冷热分离) - 实现过智能告警(动态阈值、异常检测、机器学习) - 有故障快速定位经验(根因分析、故障复盘) **开源贡献**: - 向 Prometheus/Grafana/Jaeger 社区提交过 PR - 开发过自定义 Exporter - 编写过相关技术博客或演讲 **可观测性最佳实践**: - 实现 SLO/SLI(Service Level Objective/Indicator) - 使用 Error Budget 管理发布节奏 - 有混沌工程实践(Chaos Engineering) - 实现 APM(Application Performance Monitoring) **业务监控**: - 设计过业务指标大盘 - 实现过实时数据大屏(Druid、ClickHouse) - 有用户行为分析经验(埋点、漏斗分析)