Add comprehensive interview materials for: - Service Mesh (Istio, Linkerd) - RPC Framework (Dubbo, gRPC) - Container Orchestration (Kubernetes) - CI/CD (Jenkins, GitLab CI, GitHub Actions) - Observability (Monitoring, Logging, Tracing) Each file includes: - 5-10 core questions - Detailed standard answers - Code examples - Real-world project experience - Alibaba P7 bonus points Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1180 lines
29 KiB
Markdown
1180 lines
29 KiB
Markdown
# 可观测性 (Observability)
|
||
|
||
## 问题
|
||
|
||
**背景**:在分布式系统中,如何快速定位和解决问题成为关键挑战。可观测性通过监控、日志和链路追踪三大支柱,帮助开发和运维团队理解系统内部状态。
|
||
|
||
**问题**:
|
||
1. 什么是可观测性?它和监控有什么区别?
|
||
2. 监控、日志、链路追踪三大支柱的作用是什么?
|
||
3. Prometheus + Grafana 监控架构是如何设计的?
|
||
4. ELK(Elasticsearch、Logstash、Kibana)日志栈如何搭建?
|
||
5. 分布式追踪(Jaeger/Zipkin)的原理是什么?
|
||
6. 如何设计监控告警规则?
|
||
7. 如何实现全链路追踪?
|
||
8. 如何定位性能瓶颈?
|
||
9. 如何设计监控指标体系?
|
||
10. 在实际项目中如何落地区观测性?
|
||
|
||
---
|
||
|
||
## 标准答案
|
||
|
||
### 1. 可观测性概述
|
||
|
||
#### **定义**:
|
||
```
|
||
可观测性(Observability):
|
||
通过系统外部输出(Metrics、Logs、Traces)推断系统内部状态的能力
|
||
|
||
监控(Monitoring):
|
||
通过预定义的指标检查系统是否正常运行
|
||
```
|
||
|
||
#### **对比**:
|
||
```
|
||
监控(Monitoring):
|
||
├─ 主动询问系统状态(预设规则)
|
||
├─ 关注已知问题(如 CPU 使用率 > 80%)
|
||
└─ 问题:无法发现未知问题
|
||
|
||
可观测性(Observability):
|
||
├─ 被动收集系统输出(数据驱动)
|
||
├─ 可以发现未知问题
|
||
└─ 支持根因分析(Root Cause Analysis)
|
||
```
|
||
|
||
#### **三大支柱**:
|
||
```
|
||
1. Metrics(指标):数值型数据
|
||
- Counter(计数器):请求数、错误数
|
||
- Gauge(仪表盘):CPU 使用率、内存使用量
|
||
- Histogram(直方图):请求延迟分布
|
||
|
||
2. Logs(日志):离散事件
|
||
- 应用日志:错误日志、调试日志
|
||
- 访问日志:Nginx access.log
|
||
- 审计日志:操作记录
|
||
|
||
3. Traces(追踪):请求路径
|
||
- Trace:一次完整的请求(从客户端到后端)
|
||
- Span:单个服务的处理过程
|
||
- Span ID、Trace ID:关联标识
|
||
```
|
||
|
||
---
|
||
|
||
### 2. 三大支柱详解
|
||
|
||
#### **Metrics(指标)**:
|
||
```yaml
|
||
# Prometheus 指标示例
|
||
# 1. Counter(只增不减)
|
||
http_requests_total{method="GET",path="/api/users",status="200"} 12345
|
||
|
||
# 2. Gauge(可增可减)
|
||
memory_usage_bytes{instance="localhost:8080"} 1073741824
|
||
cpu_usage_percent{instance="localhost:8080"} 45.2
|
||
|
||
# 3. Histogram(分布)
|
||
http_request_duration_seconds_bucket{le="0.1"} 5000
|
||
http_request_duration_seconds_bucket{le="0.5"} 9500
|
||
http_request_duration_seconds_bucket{le="+Inf"} 10000
|
||
```
|
||
|
||
**代码示例(Spring Boot Actuator)**:
|
||
```java
|
||
@RestController
|
||
public class UserController {
|
||
|
||
private final Counter requestCounter;
|
||
private final Gauge memoryGauge;
|
||
|
||
public UserController(MeterRegistry registry) {
|
||
this.requestCounter = Counter.builder("http.requests.total")
|
||
.tag("method", "GET")
|
||
.tag("path", "/api/users")
|
||
.register(registry);
|
||
|
||
this.memoryGauge = Gauge.builder("jvm.memory.used", Runtime.getRuntime(), Runtime::totalMemory)
|
||
.register(registry);
|
||
}
|
||
|
||
@GetMapping("/api/users")
|
||
public List<User> getUsers() {
|
||
requestCounter.increment();
|
||
return userService.findAll();
|
||
}
|
||
}
|
||
```
|
||
|
||
#### **Logs(日志)**:
|
||
```java
|
||
// 结构化日志(JSON 格式)
|
||
@Slf4j
|
||
@RestController
|
||
public class UserController {
|
||
|
||
@GetMapping("/api/users/{id}")
|
||
public User getUserById(@PathVariable Long id) {
|
||
log.info("Get user by id", logContext()
|
||
.with("userId", id)
|
||
.with("traceId", MDC.get("traceId"))
|
||
.with("spanId", MDC.get("spanId"))
|
||
);
|
||
|
||
User user = userService.findById(id);
|
||
|
||
if (user == null) {
|
||
log.warn("User not found", logContext()
|
||
.with("userId", id)
|
||
.with("traceId", MDC.get("traceId"))
|
||
);
|
||
throw new UserNotFoundException(id);
|
||
}
|
||
|
||
return user;
|
||
}
|
||
|
||
private LogContext logContext() {
|
||
return new LogContext();
|
||
}
|
||
}
|
||
|
||
// 日志输出
|
||
{
|
||
"timestamp": "2024-01-01T10:00:00Z",
|
||
"level": "INFO",
|
||
"logger": "com.example.UserController",
|
||
"message": "Get user by id",
|
||
"userId": 123,
|
||
"traceId": "a1b2c3d4e5f6g7h8",
|
||
"spanId": "i9j0k1l2m3n4o5p6",
|
||
"thread": "http-nio-8080-exec-1"
|
||
}
|
||
```
|
||
|
||
#### **Traces(追踪)**:
|
||
```
|
||
Trace(一次完整请求):
|
||
Client → Gateway → Service A → Service B → Service C
|
||
│ │ │ │ │
|
||
└─────────┴───────────┴────────────┴────────────┘
|
||
Trace ID: abc123
|
||
|
||
Span(单个服务处理):
|
||
Gateway (Span 1)
|
||
├─ Service A (Span 2)
|
||
│ └─ Service B (Span 3)
|
||
│ └─ Service C (Span 4)
|
||
```
|
||
|
||
---
|
||
|
||
### 3. Prometheus + Grafana 架构
|
||
|
||
#### **架构图**:
|
||
```
|
||
┌─────────────────┐
|
||
│ Applications │
|
||
│ ( exporters ) │
|
||
└─────────────────┘
|
||
│
|
||
│ /metrics
|
||
│
|
||
┌─────────────────┐
|
||
│ Prometheus │
|
||
│ (Pull 指标) │
|
||
└─────────────────┘
|
||
│
|
||
│ 存储
|
||
▼
|
||
┌─────────────────┐
|
||
│ TSDB (时序数据库)│
|
||
└─────────────────┘
|
||
│
|
||
│ 查询
|
||
▼
|
||
┌─────────────────┐
|
||
│ Grafana │
|
||
│ (可视化仪表盘) │
|
||
└─────────────────┘
|
||
│
|
||
│ 告警
|
||
▼
|
||
┌─────────────────┐
|
||
│ Alertmanager │
|
||
│ (告警路由) │
|
||
└─────────────────┘
|
||
│
|
||
│ 通知
|
||
▼
|
||
┌─────────────────┐
|
||
│ Email/Webhook │
|
||
│ 钉钉/企业微信 │
|
||
└─────────────────┘
|
||
```
|
||
|
||
#### **Prometheus 配置**:
|
||
```yaml
|
||
# prometheus.yml
|
||
global:
|
||
scrape_interval: 15s # 每 15 秒采集一次
|
||
evaluation_interval: 15s # 每 15 秒评估告警规则
|
||
|
||
# 告警规则
|
||
rule_files:
|
||
- "alerts/*.yml"
|
||
|
||
# 抓取配置
|
||
scrape_configs:
|
||
# Spring Boot Actuator
|
||
- job_name: 'spring-boot'
|
||
metrics_path: '/actuator/prometheus'
|
||
static_configs:
|
||
- targets: ['localhost:8080']
|
||
|
||
# Kubernetes 服务发现
|
||
- job_name: 'kubernetes-pods'
|
||
kubernetes_sd_configs:
|
||
- role: pod
|
||
relabel_configs:
|
||
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
|
||
action: keep
|
||
regex: true
|
||
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
|
||
action: replace
|
||
target_label: __metrics_path__
|
||
regex: (.+)
|
||
|
||
# 告警管理
|
||
alerting:
|
||
alertmanagers:
|
||
- static_configs:
|
||
- targets:
|
||
- alertmanager:9093
|
||
```
|
||
|
||
#### **Spring Boot 集成**:
|
||
```xml
|
||
<!-- pom.xml -->
|
||
<dependency>
|
||
<groupId>org.springframework.boot</groupId>
|
||
<artifactId>spring-boot-starter-actuator</artifactId>
|
||
</dependency>
|
||
<dependency>
|
||
<groupId>io.micrometer</groupId>
|
||
<artifactId>micrometer-registry-prometheus</artifactId>
|
||
</dependency>
|
||
```
|
||
|
||
```yaml
|
||
# application.yml
|
||
management:
|
||
endpoints:
|
||
web:
|
||
exposure:
|
||
include: prometheus,health,info
|
||
metrics:
|
||
export:
|
||
prometheus:
|
||
enabled: true
|
||
tags:
|
||
application: ${spring.application.name}
|
||
```
|
||
|
||
#### **Grafana Dashboard**:
|
||
```json
|
||
{
|
||
"dashboard": {
|
||
"title": "Spring Boot Metrics",
|
||
"panels": [
|
||
{
|
||
"title": "Request Rate",
|
||
"targets": [
|
||
{
|
||
"expr": "rate(http_server_requests_seconds_count[1m])",
|
||
"legendFormat": "{{method}} {{uri}}"
|
||
}
|
||
],
|
||
"type": "graph"
|
||
},
|
||
{
|
||
"title": "Request Latency",
|
||
"targets": [
|
||
{
|
||
"expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[1m]))",
|
||
"legendFormat": "P95 Latency"
|
||
}
|
||
],
|
||
"type": "graph"
|
||
},
|
||
{
|
||
"title": "JVM Memory Usage",
|
||
"targets": [
|
||
{
|
||
"expr": "jvm_memory_used_bytes{area=\"heap\"}",
|
||
"legendFormat": "Heap Used"
|
||
}
|
||
],
|
||
"type": "graph"
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 4. ELK 日志栈
|
||
|
||
#### **架构图**:
|
||
```
|
||
┌─────────────────┐
|
||
│ Applications │
|
||
│ (日志输出) │
|
||
└─────────────────┘
|
||
│
|
||
│ Filebeat/Logstash
|
||
│
|
||
┌─────────────────┐
|
||
│ Logstash │
|
||
│ (日志处理) │
|
||
├─────────────────┤
|
||
│ - 过滤 │
|
||
│ - 转换 │
|
||
│ - 解析 │
|
||
└─────────────────┘
|
||
│
|
||
│
|
||
┌─────────────────┐
|
||
│ Elasticsearch │
|
||
│ (日志存储) │
|
||
└─────────────────┘
|
||
│
|
||
│ 查询
|
||
▼
|
||
┌─────────────────┐
|
||
│ Kibana │
|
||
│ (日志可视化) │
|
||
└─────────────────┘
|
||
```
|
||
|
||
#### **Logstash 配置**:
|
||
```conf
|
||
# logstash.conf
|
||
input {
|
||
file {
|
||
path => "/var/log/app/*.log"
|
||
start_position => "beginning"
|
||
codec => json
|
||
}
|
||
|
||
beats {
|
||
port => 5044
|
||
}
|
||
}
|
||
|
||
filter {
|
||
# 解析 JSON 日志
|
||
json {
|
||
source => "message"
|
||
}
|
||
|
||
# 提取时间戳
|
||
date {
|
||
match => ["timestamp", "ISO8601"]
|
||
}
|
||
|
||
# 提取 Trace ID
|
||
grok {
|
||
match => {
|
||
"message" => '"traceId":"%{DATA:traceId}"'
|
||
}
|
||
}
|
||
|
||
# 添加应用名称
|
||
mutate {
|
||
add_field => {
|
||
"application" => "my-app"
|
||
}
|
||
}
|
||
}
|
||
|
||
output {
|
||
elasticsearch {
|
||
hosts => ["elasticsearch:9200"]
|
||
index => "my-app-%{+YYYY.MM.dd}"
|
||
}
|
||
|
||
stdout {
|
||
codec => rubydebug
|
||
}
|
||
}
|
||
```
|
||
|
||
#### **Filebeat 配置**:
|
||
```yaml
|
||
# filebeat.yml
|
||
filebeat.inputs:
|
||
- type: log
|
||
enabled: true
|
||
paths:
|
||
- /var/log/app/*.log
|
||
json.keys_under_root: true
|
||
json.add_error_key: true
|
||
fields:
|
||
app: my-app
|
||
env: production
|
||
|
||
output.logstash:
|
||
hosts: ["logstash:5044"]
|
||
|
||
# 日志 multiline 处理
|
||
multiline.type: pattern
|
||
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
|
||
multiline.negate: true
|
||
multiline.match: after
|
||
```
|
||
|
||
#### **Kibana 查询**:
|
||
```
|
||
# 1. 简单查询
|
||
level: "ERROR"
|
||
|
||
# 2. 范围查询
|
||
@timestamp: [now-1h TO now]
|
||
|
||
# 3. 通配符
|
||
message: "*NullPointerException*"
|
||
|
||
# 4. 正则表达式
|
||
message: /.*User \d+ not found.*/
|
||
|
||
# 5. 聚合查询
|
||
# 按错误级别统计
|
||
level: "ERROR" | stats count by level
|
||
|
||
# 按时间统计
|
||
# histogram @timestamp, interval 1m
|
||
|
||
# 按服务统计
|
||
# terms appName
|
||
|
||
# 6. 全链路追踪
|
||
# 查询同一 Trace ID 的所有日志
|
||
traceId: "a1b2c3d4e5f6g7h8"
|
||
```
|
||
|
||
---
|
||
|
||
### 5. 分布式追踪
|
||
|
||
#### **原理**:
|
||
```
|
||
1. 客户端请求生成 Trace ID
|
||
2. 每个服务处理时生成 Span
|
||
3. Span 记录:
|
||
- Span ID(当前 Span 唯一 ID)
|
||
- Parent Span ID(父 Span ID)
|
||
- Trace ID(全局 Trace ID)
|
||
- Timestamp(开始时间)
|
||
- Duration(耗时)
|
||
- Tags(标签)
|
||
- Logs(日志)
|
||
4. Span 上报到 Jaeger/Zipkin
|
||
5. 追踪系统构建调用链
|
||
```
|
||
|
||
#### **Jaeger 架构**:
|
||
```
|
||
┌─────────────────┐
|
||
│ Applications │
|
||
│ (Jaeger Client)│
|
||
└─────────────────┘
|
||
│
|
||
│ UDP/HTTP
|
||
│
|
||
┌─────────────────┐
|
||
│ Agent │
|
||
│ (数据采集) │
|
||
└─────────────────┘
|
||
│
|
||
│
|
||
┌─────────────────┐
|
||
│ Collector │
|
||
│ (数据处理) │
|
||
└─────────────────┘
|
||
│
|
||
│
|
||
┌──────────────────┼──────────────────┐
|
||
│ │ │
|
||
┌─────────┐ ┌──────────┐ ┌──────────┐
|
||
│Elasticsearch│ │ Cassandra│ │ Kafka │
|
||
└─────────┘ └──────────┘ └──────────┘
|
||
│
|
||
│ 查询
|
||
▼
|
||
┌─────────┐
|
||
│ Query │
|
||
│ Service │
|
||
└─────────┘
|
||
│
|
||
│ Web UI
|
||
▼
|
||
┌─────────┐
|
||
│ Web │
|
||
│ UI │
|
||
└─────────┘
|
||
```
|
||
|
||
#### **Spring Boot 集成 Jaeger**:
|
||
```xml
|
||
<!-- pom.xml -->
|
||
<dependency>
|
||
<groupId>io.opentracing.contrib</groupId>
|
||
<artifactId>opentracing-spring-jaeger-web-starter</artifactId>
|
||
</dependency>
|
||
```
|
||
|
||
```yaml
|
||
# application.yml
|
||
opentracing:
|
||
jaeger:
|
||
enabled: true
|
||
service-name: my-app
|
||
udp-sender:
|
||
host: jaeger-agent
|
||
port: 6831
|
||
sampler:
|
||
probability: 0.1 # 10% 采样
|
||
```
|
||
|
||
**代码示例**:
|
||
```java
|
||
@RestController
|
||
public class UserController {
|
||
|
||
private final Tracer tracer;
|
||
|
||
@GetMapping("/api/users/{id}")
|
||
public User getUserById(@PathVariable Long id) {
|
||
// 创建自定义 Span
|
||
Span span = tracer.buildSpan("getUserById")
|
||
.withTag("userId", id)
|
||
.start();
|
||
|
||
try (Scope scope = tracer.scopeManager().activate(span)) {
|
||
User user = userService.findById(id);
|
||
|
||
if (user == null) {
|
||
span.setTag("error", true);
|
||
span.log("User not found");
|
||
throw new UserNotFoundException(id);
|
||
}
|
||
|
||
return user;
|
||
} finally {
|
||
span.finish();
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
#### **Zipkin 集成**:
|
||
```xml
|
||
<!-- pom.xml -->
|
||
<dependency>
|
||
<groupId>org.springframework.cloud</groupId>
|
||
<artifactId>spring-cloud-starter-zipkin</artifactId>
|
||
</dependency>
|
||
```
|
||
|
||
```yaml
|
||
# application.yml
|
||
spring:
|
||
zipkin:
|
||
base-url: http://zipkin:9411
|
||
sleuth:
|
||
sampler:
|
||
probability: 0.1 # 10% 采样
|
||
```
|
||
|
||
---
|
||
|
||
### 6. 监控告警规则
|
||
|
||
#### **Prometheus 告警规则**:
|
||
```yaml
|
||
# alerts.yml
|
||
groups:
|
||
- name: application_alerts
|
||
interval: 30s
|
||
rules:
|
||
# 高错误率
|
||
- alert: HighErrorRate
|
||
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "High error rate detected"
|
||
description: "Error rate is {{ $value }} errors/sec"
|
||
|
||
# 高延迟
|
||
- alert: HighLatency
|
||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "High latency detected"
|
||
description: "P95 latency is {{ $value }} seconds"
|
||
|
||
# 服务下线
|
||
- alert: ServiceDown
|
||
expr: up{job="spring-boot"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Service is down"
|
||
description: "{{ $labels.instance }} is down"
|
||
|
||
# JVM 内存使用率高
|
||
- alert: HighMemoryUsage
|
||
expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.9
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "High memory usage"
|
||
description: "Heap memory usage is {{ $value | humanizePercentage }}"
|
||
|
||
# 磁盘空间不足
|
||
- alert: DiskSpaceLow
|
||
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Disk space low"
|
||
description: "Disk space is {{ $value | humanizePercentage }} available"
|
||
```
|
||
|
||
#### **Alertmanager 配置**:
|
||
```yaml
|
||
# alertmanager.yml
|
||
global:
|
||
resolve_timeout: 5m
|
||
|
||
# 路由配置
|
||
route:
|
||
group_by: ['alertname', 'cluster', 'service']
|
||
group_wait: 10s
|
||
group_interval: 10s
|
||
repeat_interval: 12h
|
||
receiver: 'default'
|
||
|
||
routes:
|
||
- match:
|
||
severity: critical
|
||
receiver: 'critical'
|
||
continue: true
|
||
|
||
- match:
|
||
severity: warning
|
||
receiver: 'warning'
|
||
|
||
# 接收器配置
|
||
receivers:
|
||
- name: 'default'
|
||
webhook_configs:
|
||
- url: 'http://webhook-server/default'
|
||
|
||
- name: 'critical'
|
||
webhook_configs:
|
||
- url: 'http://webhook-server/critical'
|
||
email_configs:
|
||
- to: 'oncall@example.com'
|
||
from: 'alertmanager@example.com'
|
||
smarthost: 'smtp.example.com:587'
|
||
auth_username: 'alertmanager@example.com'
|
||
auth_password: 'password'
|
||
|
||
- name: 'warning'
|
||
webhook_configs:
|
||
- url: 'http://webhook-server/warning'
|
||
|
||
# 抑制规则
|
||
inhibit_rules:
|
||
- source_match:
|
||
severity: 'critical'
|
||
target_match:
|
||
severity: 'warning'
|
||
equal: ['alertname', 'cluster', 'service']
|
||
```
|
||
|
||
#### **钉钉告警**:
|
||
```python
|
||
# 钉钉 Webhook 示例
|
||
from flask import Flask, request
|
||
import requests
|
||
import json
|
||
|
||
app = Flask(__name__)
|
||
|
||
@app.route('/alertmanager', methods=['POST'])
|
||
def alertmanager():
|
||
data = request.json
|
||
|
||
for alert in data.get('alerts', []):
|
||
status = alert.get('status')
|
||
labels = alert.get('labels', {})
|
||
annotations = alert.get('annotations', {})
|
||
|
||
message = {
|
||
"msgtype": "markdown",
|
||
"markdown": {
|
||
"title": f"Alert: {labels.get('alertname')}",
|
||
"text": f"""
|
||
### {labels.get('alertname')}
|
||
|
||
**Status:** {status}
|
||
**Severity:** {labels.get('severity')}
|
||
**Instance:** {labels.get('instance')}
|
||
|
||
**Summary:** {annotations.get('summary')}
|
||
**Description:** {annotations.get('description')}
|
||
|
||
**Starts:** {alert.get('startsAt')}
|
||
"""
|
||
}
|
||
}
|
||
|
||
requests.post(
|
||
'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN',
|
||
json=message
|
||
)
|
||
|
||
return 'OK'
|
||
|
||
if __name__ == '__main__':
|
||
app.run(host='0.0.0.0', port=5000)
|
||
```
|
||
|
||
---
|
||
|
||
### 7. 全链路追踪
|
||
|
||
#### **实现方案**:
|
||
```
|
||
1. 客户端生成 Trace ID
|
||
2. HTTP Header 传递 Trace ID
|
||
- X-Trace-Id
|
||
- X-Span-Id
|
||
3. 每个服务记录 Span
|
||
4. 异步上报到 Jaeger/Zipkin
|
||
5. 追踪系统构建调用链
|
||
```
|
||
|
||
#### **Spring Cloud Sleuth 实现**:
|
||
```java
|
||
// 1. 配置 Sleuth
|
||
@Configuration
|
||
public class TracingConfig {
|
||
|
||
@Bean
|
||
public HttpTraceCustomizer httpTraceCustomizer() {
|
||
return (builder) -> builder.include(EVERYTHING);
|
||
}
|
||
}
|
||
|
||
// 2. RestTemplate 传递 Trace ID
|
||
@Configuration
|
||
public class RestTemplateConfig {
|
||
|
||
@Bean
|
||
public RestTemplate restTemplate() {
|
||
return new RestTemplate();
|
||
}
|
||
|
||
@Bean
|
||
public RestTemplateCustomizer restTemplateCustomizer(Tracer tracer) {
|
||
return restTemplate -> {
|
||
restTemplate.setInterceptors(Collections.singletonList(new ClientHttpRequestInterceptor() {
|
||
@Override
|
||
public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException {
|
||
Span span = tracer.activeSpan();
|
||
if (span != null) {
|
||
request.getHeaders().add("X-Trace-Id", span.context().traceId());
|
||
request.getHeaders().add("X-Span-Id", span.context().spanId());
|
||
}
|
||
return execution.execute(request, body);
|
||
}
|
||
}));
|
||
};
|
||
}
|
||
}
|
||
|
||
// 3. Kafka 消息传递 Trace ID
|
||
@Configuration
|
||
public class KafkaConfig {
|
||
|
||
@Bean
|
||
public ProducerFactory<String, String> producerFactory(Tracer tracer) {
|
||
Map<String, Object> configProps = new HashMap<>();
|
||
configProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
|
||
|
||
return new DefaultKafkaProducerFactory<>(configProps,
|
||
new StringSerializer(),
|
||
new StringSerializer());
|
||
}
|
||
}
|
||
|
||
// 4. 数据库查询传递 Trace ID
|
||
@Configuration
|
||
public class DatabaseConfig {
|
||
|
||
@Bean
|
||
public DataSource dataSource(Tracer tracer) {
|
||
HikariDataSource dataSource = new HikariDataSource();
|
||
dataSource.setJdbcUrl("jdbc:mysql://localhost:3306/db");
|
||
|
||
dataSource.setConnectionTestQuery("SELECT 1");
|
||
dataSource.setConnectionInitSql("SET @trace_id = '" + tracer.activeSpan().context().traceId() + "'");
|
||
|
||
return dataSource;
|
||
}
|
||
}
|
||
```
|
||
|
||
#### **Trace ID 关联日志**:
|
||
```java
|
||
// 使用 MDC 传递 Trace ID
|
||
@Slf4j
|
||
@Component
|
||
public class TraceIdFilter implements Filter {
|
||
|
||
@Override
|
||
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
|
||
String traceId = request.getHeader("X-Trace-Id");
|
||
if (traceId == null) {
|
||
traceId = UUID.randomUUID().toString();
|
||
}
|
||
|
||
MDC.put("traceId", traceId);
|
||
|
||
try {
|
||
chain.doFilter(request, response);
|
||
} finally {
|
||
MDC.clear();
|
||
}
|
||
}
|
||
}
|
||
|
||
// Logback 配置
|
||
<configuration>
|
||
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
|
||
<encoder>
|
||
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - traceId=%X{traceId} - %msg%n</pattern>
|
||
</encoder>
|
||
</appender>
|
||
|
||
<root level="INFO">
|
||
<appender-ref ref="CONSOLE" />
|
||
</root>
|
||
</configuration>
|
||
```
|
||
|
||
---
|
||
|
||
### 8. 性能瓶颈定位
|
||
|
||
#### **定位流程**:
|
||
```
|
||
1. 监控告警(Prometheus)
|
||
- CPU 使用率高
|
||
- 内存使用率高
|
||
- 请求延迟高
|
||
|
||
2. 链路追踪(Jaeger)
|
||
- 定位慢请求
|
||
- 找出耗时最长的服务
|
||
|
||
3. 日志分析(ELK)
|
||
- 查找错误日志
|
||
- 分析异常堆栈
|
||
|
||
4. 性能分析(Profiling)
|
||
- CPU Profiling
|
||
- Memory Profiling
|
||
- Thread Dump
|
||
```
|
||
|
||
#### **案例 1:慢查询定位**
|
||
```sql
|
||
-- 1. 开启 MySQL 慢查询日志
|
||
SET GLOBAL slow_query_log = 'ON';
|
||
SET GLOBAL long_query_time = 1;
|
||
|
||
-- 2. 分析慢查询日志
|
||
pt-query-digest /var/log/mysql/slow.log
|
||
|
||
-- 3. 优化 SQL
|
||
-- 添加索引
|
||
CREATE INDEX idx_user_email ON users(email);
|
||
|
||
-- 重写查询
|
||
-- Before
|
||
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
|
||
|
||
-- After
|
||
SELECT * FROM users WHERE email = 'alice@example.com';
|
||
```
|
||
|
||
#### **案例 2:内存泄漏定位**
|
||
```bash
|
||
# 1. 导出堆转储
|
||
jmap -dump:format=b,file=heap.hprof <pid>
|
||
|
||
# 2. 使用 MAT 分析
|
||
# - 查看 Dominator Tree
|
||
# - 查找 Leak Suspects
|
||
# - 查看 Histogram
|
||
|
||
# 3. 定位泄漏代码
|
||
# - 未关闭的资源(Connection、Stream)
|
||
# - 静态集合持有大对象
|
||
# - 缓存未设置过期时间
|
||
```
|
||
|
||
#### **案例 3:CPU 高负载定位**
|
||
```bash
|
||
# 1. 查看 CPU 使用率
|
||
top -p <pid>
|
||
|
||
# 2. 导出线程快照
|
||
jstack <pid> > thread.dump
|
||
|
||
# 3. 查找繁忙线程
|
||
printf "%x\n" <tid> # 转换为 16 进制
|
||
grep -A 20 <tid-hex> thread.dump
|
||
|
||
# 4. 分析代码
|
||
# - 死循环
|
||
# - 正则表达式(回溯)
|
||
# - 大对象序列化
|
||
```
|
||
|
||
---
|
||
|
||
### 9. 监控指标体系
|
||
|
||
#### **分层指标**:
|
||
```
|
||
1. 基础设施层(Infrastructure)
|
||
- CPU 使用率
|
||
- 内存使用率
|
||
- 磁盘 I/O
|
||
- 网络流量
|
||
|
||
2. 平台层(Platform)
|
||
- Kubernetes 集群健康
|
||
- Pod 数量
|
||
- Node 状态
|
||
|
||
3. 中间件层(Middleware)
|
||
- Redis:连接数、命令执行时间、内存使用率
|
||
- MySQL:QPS、慢查询、连接数、主从延迟
|
||
- Kafka:消息积压、消费延迟
|
||
|
||
4. 应用层(Application)
|
||
- QPS(每秒请求数)
|
||
- Latency(延迟 P50、P95、P99)
|
||
- Error Rate(错误率)
|
||
- Saturation(饱和度)
|
||
|
||
5. 业务层(Business)
|
||
- 订单量
|
||
- 支付成功率
|
||
- 用户活跃度
|
||
```
|
||
|
||
#### **RED 方法**:
|
||
```
|
||
R - Rate (请求速率)
|
||
- QPS(Queries Per Second)
|
||
- RPS(Requests Per Second)
|
||
|
||
E - Errors (错误率)
|
||
- HTTP 5xx 错误率
|
||
- 业务异常率
|
||
|
||
D - Duration (请求耗时)
|
||
- P50(中位数)
|
||
- P95(95 分位)
|
||
- P99(99 分位)
|
||
```
|
||
|
||
#### **USE 方法**:
|
||
```
|
||
U - Utilization (资源利用率)
|
||
- CPU 使用率
|
||
- 内存使用率
|
||
- 磁盘使用率
|
||
|
||
S - Saturation (资源饱和度)
|
||
- CPU 运行队列长度
|
||
- 内存 Swap 使用量
|
||
- 磁盘 I/O 等待时间
|
||
|
||
E - Errors (错误数)
|
||
- 硬件错误(ECC、磁盘坏道)
|
||
- 软件错误(OOM、连接超时)
|
||
```
|
||
|
||
#### **Grafana Dashboard 示例**:
|
||
```json
|
||
{
|
||
"dashboard": {
|
||
"title": "Service Overview",
|
||
"panels": [
|
||
{
|
||
"title": "QPS",
|
||
"targets": [
|
||
{
|
||
"expr": "sum(rate(http_requests_total[1m]))"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"title": "Error Rate",
|
||
"targets": [
|
||
{
|
||
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"title": "P95 Latency",
|
||
"targets": [
|
||
{
|
||
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"title": "CPU Usage",
|
||
"targets": [
|
||
{
|
||
"expr": "rate(process_cpu_seconds_total[1m])"
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"title": "Memory Usage",
|
||
"targets": [
|
||
{
|
||
"expr": "jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"}"
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 10. 实际项目落地
|
||
|
||
#### **场景 1:电商系统监控**
|
||
```
|
||
需求:
|
||
- 监控订单接口性能
|
||
- 发现慢查询并优化
|
||
- 监控支付成功率
|
||
|
||
方案:
|
||
1. Prometheus 监控
|
||
- QPS、延迟、错误率
|
||
- JVM 指标
|
||
- MySQL 慢查询
|
||
|
||
2. Jaeger 链路追踪
|
||
- 订单创建流程
|
||
- 支付流程
|
||
|
||
3. ELK 日志分析
|
||
- 订单日志
|
||
- 支付日志
|
||
|
||
4. Grafana 仪表盘
|
||
- 业务指标(订单量、支付成功率)
|
||
- 技术指标(QPS、延迟)
|
||
```
|
||
|
||
#### **场景 2:微服务链路追踪**
|
||
```
|
||
需求:
|
||
- 追踪跨服务请求
|
||
- 定位性能瓶颈
|
||
- 分析服务依赖
|
||
|
||
方案:
|
||
1. Spring Cloud Sleuth 生成 Trace ID
|
||
2. Jaeger 收集 Span
|
||
3. Kibana 关联日志(通过 Trace ID)
|
||
4. Prometheus 监控每个服务性能
|
||
|
||
示例:
|
||
用户下单
|
||
├─ 订单服务(创建订单)
|
||
├─ 库存服务(扣减库存)
|
||
├─ 支付服务(创建支付)
|
||
└─ 物流服务(分配物流)
|
||
|
||
通过 Trace ID 关联所有服务的日志
|
||
```
|
||
|
||
---
|
||
|
||
### 11. 阿里 P7 加分项
|
||
|
||
**架构设计能力**:
|
||
- 设计过企业级可观测性平台(统一监控、日志、追踪)
|
||
- 有多集群、多地域的监控架构经验
|
||
- 实现过自定义监控 Agent 和 Collector
|
||
|
||
**深度理解**:
|
||
- 熟悉 Prometheus 内部机制(TSDB、存储引擎、查询引擎)
|
||
- 理解 Elasticsearch 底层原理(Lucene、分片、副本)
|
||
- 有 Jaeger/Zipkin 源码阅读经验
|
||
|
||
**性能优化**:
|
||
- 优化过 Prometheus 查询性能(Recording Rules、联邦)
|
||
- 优化过 Elasticsearch 索引性能(分片策略、Mapping 设计)
|
||
- 优化过日志采集性能(采样率、批量上传)
|
||
|
||
**生产实践**:
|
||
- 解决过海量数据存储和查询问题(数据降采样、冷热分离)
|
||
- 实现过智能告警(动态阈值、异常检测、机器学习)
|
||
- 有故障快速定位经验(根因分析、故障复盘)
|
||
|
||
**开源贡献**:
|
||
- 向 Prometheus/Grafana/Jaeger 社区提交过 PR
|
||
- 开发过自定义 Exporter
|
||
- 编写过相关技术博客或演讲
|
||
|
||
**可观测性最佳实践**:
|
||
- 实现 SLO/SLI(Service Level Objective/Indicator)
|
||
- 使用 Error Budget 管理发布节奏
|
||
- 有混沌工程实践(Chaos Engineering)
|
||
- 实现 APM(Application Performance Monitoring)
|
||
|
||
**业务监控**:
|
||
- 设计过业务指标大盘
|
||
- 实现过实时数据大屏(Druid、ClickHouse)
|
||
- 有用户行为分析经验(埋点、漏斗分析)
|