Files
interview/questions/11-运维/可观测性.md
yasinshaw 0e46a367c4 refactor: rename files to Chinese and organize by category
Organized 50 interview questions into 12 categories:
- 01-分布式系统 (9 files): 分布式事务, 分布式锁, 一致性哈希, CAP理论, etc.
- 02-数据库 (2 files): MySQL索引优化, MyBatis核心原理
- 03-缓存 (5 files): Redis数据结构, 缓存问题, LRU算法, etc.
- 04-消息队列 (1 file): RocketMQ/Kafka
- 05-并发编程 (4 files): 线程池, 设计模式, 限流策略, etc.
- 06-JVM (1 file): JVM和垃圾回收
- 07-系统设计 (8 files): 秒杀系统, 短链接, IM, Feed流, etc.
- 08-算法与数据结构 (4 files): B+树, 红黑树, 跳表, 时间轮
- 09-网络与安全 (3 files): TCP/IP, 加密安全, 性能优化
- 10-中间件 (4 files): Spring Boot, Nacos, Dubbo, Nginx
- 11-运维 (4 files): Kubernetes, CI/CD, Docker, 可观测性
- 12-面试技巧 (1 file): 面试技巧和职业规划

All files renamed to Chinese for better accessibility and
organized into categorized folders for easier navigation.

Generated with [Claude Code](https://claude.com/claude-code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-03-01 00:10:53 +08:00

1180 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 可观测性 (Observability)
## 问题
**背景**:在分布式系统中,如何快速定位和解决问题成为关键挑战。可观测性通过监控、日志和链路追踪三大支柱,帮助开发和运维团队理解系统内部状态。
**问题**
1. 什么是可观测性?它和监控有什么区别?
2. 监控、日志、链路追踪三大支柱的作用是什么?
3. Prometheus + Grafana 监控架构是如何设计的?
4. ELKElasticsearch、Logstash、Kibana日志栈如何搭建
5. 分布式追踪Jaeger/Zipkin的原理是什么
6. 如何设计监控告警规则?
7. 如何实现全链路追踪?
8. 如何定位性能瓶颈?
9. 如何设计监控指标体系?
10. 在实际项目中如何落地区观测性?
---
## 标准答案
### 1. 可观测性概述
#### **定义**
```
可观测性Observability
通过系统外部输出Metrics、Logs、Traces推断系统内部状态的能力
监控Monitoring
通过预定义的指标检查系统是否正常运行
```
#### **对比**
```
监控Monitoring
├─ 主动询问系统状态(预设规则)
├─ 关注已知问题(如 CPU 使用率 > 80%
└─ 问题:无法发现未知问题
可观测性Observability
├─ 被动收集系统输出(数据驱动)
├─ 可以发现未知问题
└─ 支持根因分析Root Cause Analysis
```
#### **三大支柱**
```
1. Metrics指标数值型数据
- Counter计数器请求数、错误数
- Gauge仪表盘CPU 使用率、内存使用量
- Histogram直方图请求延迟分布
2. Logs日志离散事件
- 应用日志:错误日志、调试日志
- 访问日志Nginx access.log
- 审计日志:操作记录
3. Traces追踪请求路径
- Trace一次完整的请求从客户端到后端
- Span单个服务的处理过程
- Span ID、Trace ID关联标识
```
---
### 2. 三大支柱详解
#### **Metrics指标**
```yaml
# Prometheus 指标示例
# 1. Counter只增不减
http_requests_total{method="GET",path="/api/users",status="200"} 12345
# 2. Gauge可增可减
memory_usage_bytes{instance="localhost:8080"} 1073741824
cpu_usage_percent{instance="localhost:8080"} 45.2
# 3. Histogram分布
http_request_duration_seconds_bucket{le="0.1"} 5000
http_request_duration_seconds_bucket{le="0.5"} 9500
http_request_duration_seconds_bucket{le="+Inf"} 10000
```
**代码示例Spring Boot Actuator**
```java
@RestController
public class UserController {
private final Counter requestCounter;
private final Gauge memoryGauge;
public UserController(MeterRegistry registry) {
this.requestCounter = Counter.builder("http.requests.total")
.tag("method", "GET")
.tag("path", "/api/users")
.register(registry);
this.memoryGauge = Gauge.builder("jvm.memory.used", Runtime.getRuntime(), Runtime::totalMemory)
.register(registry);
}
@GetMapping("/api/users")
public List<User> getUsers() {
requestCounter.increment();
return userService.findAll();
}
}
```
#### **Logs日志**
```java
// 结构化日志JSON 格式)
@Slf4j
@RestController
public class UserController {
@GetMapping("/api/users/{id}")
public User getUserById(@PathVariable Long id) {
log.info("Get user by id", logContext()
.with("userId", id)
.with("traceId", MDC.get("traceId"))
.with("spanId", MDC.get("spanId"))
);
User user = userService.findById(id);
if (user == null) {
log.warn("User not found", logContext()
.with("userId", id)
.with("traceId", MDC.get("traceId"))
);
throw new UserNotFoundException(id);
}
return user;
}
private LogContext logContext() {
return new LogContext();
}
}
// 日志输出
{
"timestamp": "2024-01-01T10:00:00Z",
"level": "INFO",
"logger": "com.example.UserController",
"message": "Get user by id",
"userId": 123,
"traceId": "a1b2c3d4e5f6g7h8",
"spanId": "i9j0k1l2m3n4o5p6",
"thread": "http-nio-8080-exec-1"
}
```
#### **Traces追踪**
```
Trace一次完整请求
Client → Gateway → Service A → Service B → Service C
│ │ │ │ │
└─────────┴───────────┴────────────┴────────────┘
Trace ID: abc123
Span单个服务处理
Gateway (Span 1)
├─ Service A (Span 2)
│ └─ Service B (Span 3)
│ └─ Service C (Span 4)
```
---
### 3. Prometheus + Grafana 架构
#### **架构图**
```
┌─────────────────┐
│ Applications │
│ ( exporters ) │
└─────────────────┘
│ /metrics
┌─────────────────┐
│ Prometheus │
│ (Pull 指标) │
└─────────────────┘
│ 存储
┌─────────────────┐
│ TSDB (时序数据库)│
└─────────────────┘
│ 查询
┌─────────────────┐
│ Grafana │
│ (可视化仪表盘) │
└─────────────────┘
│ 告警
┌─────────────────┐
│ Alertmanager │
│ (告警路由) │
└─────────────────┘
│ 通知
┌─────────────────┐
│ Email/Webhook │
│ 钉钉/企业微信 │
└─────────────────┘
```
#### **Prometheus 配置**
```yaml
# prometheus.yml
global:
scrape_interval: 15s # 每 15 秒采集一次
evaluation_interval: 15s # 每 15 秒评估告警规则
# 告警规则
rule_files:
- "alerts/*.yml"
# 抓取配置
scrape_configs:
# Spring Boot Actuator
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['localhost:8080']
# Kubernetes 服务发现
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 告警管理
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
```
#### **Spring Boot 集成**
```xml
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
```
```yaml
# application.yml
management:
endpoints:
web:
exposure:
include: prometheus,health,info
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
```
#### **Grafana Dashboard**
```json
{
"dashboard": {
"title": "Spring Boot Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[1m])",
"legendFormat": "{{method}} {{uri}}"
}
],
"type": "graph"
},
{
"title": "Request Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[1m]))",
"legendFormat": "P95 Latency"
}
],
"type": "graph"
},
{
"title": "JVM Memory Usage",
"targets": [
{
"expr": "jvm_memory_used_bytes{area=\"heap\"}",
"legendFormat": "Heap Used"
}
],
"type": "graph"
}
]
}
}
```
---
### 4. ELK 日志栈
#### **架构图**
```
┌─────────────────┐
│ Applications │
│ (日志输出) │
└─────────────────┘
│ Filebeat/Logstash
┌─────────────────┐
│ Logstash │
│ (日志处理) │
├─────────────────┤
│ - 过滤 │
│ - 转换 │
│ - 解析 │
└─────────────────┘
┌─────────────────┐
│ Elasticsearch │
│ (日志存储) │
└─────────────────┘
│ 查询
┌─────────────────┐
│ Kibana │
│ (日志可视化) │
└─────────────────┘
```
#### **Logstash 配置**
```conf
# logstash.conf
input {
file {
path => "/var/log/app/*.log"
start_position => "beginning"
codec => json
}
beats {
port => 5044
}
}
filter {
# 解析 JSON 日志
json {
source => "message"
}
# 提取时间戳
date {
match => ["timestamp", "ISO8601"]
}
# 提取 Trace ID
grok {
match => {
"message" => '"traceId":"%{DATA:traceId}"'
}
}
# 添加应用名称
mutate {
add_field => {
"application" => "my-app"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "my-app-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
```
#### **Filebeat 配置**
```yaml
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
json.keys_under_root: true
json.add_error_key: true
fields:
app: my-app
env: production
output.logstash:
hosts: ["logstash:5044"]
# 日志 multiline 处理
multiline.type: pattern
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
multiline.negate: true
multiline.match: after
```
#### **Kibana 查询**
```
# 1. 简单查询
level: "ERROR"
# 2. 范围查询
@timestamp: [now-1h TO now]
# 3. 通配符
message: "*NullPointerException*"
# 4. 正则表达式
message: /.*User \d+ not found.*/
# 5. 聚合查询
# 按错误级别统计
level: "ERROR" | stats count by level
# 按时间统计
# histogram @timestamp, interval 1m
# 按服务统计
# terms appName
# 6. 全链路追踪
# 查询同一 Trace ID 的所有日志
traceId: "a1b2c3d4e5f6g7h8"
```
---
### 5. 分布式追踪
#### **原理**
```
1. 客户端请求生成 Trace ID
2. 每个服务处理时生成 Span
3. Span 记录:
- Span ID当前 Span 唯一 ID
- Parent Span ID父 Span ID
- Trace ID全局 Trace ID
- Timestamp开始时间
- Duration耗时
- Tags标签
- Logs日志
4. Span 上报到 Jaeger/Zipkin
5. 追踪系统构建调用链
```
#### **Jaeger 架构**
```
┌─────────────────┐
│ Applications │
│ (Jaeger Client)│
└─────────────────┘
│ UDP/HTTP
┌─────────────────┐
│ Agent │
│ (数据采集) │
└─────────────────┘
┌─────────────────┐
│ Collector │
│ (数据处理) │
└─────────────────┘
┌──────────────────┼──────────────────┐
│ │ │
┌─────────┐ ┌──────────┐ ┌──────────┐
│Elasticsearch│ │ Cassandra│ │ Kafka │
└─────────┘ └──────────┘ └──────────┘
│ 查询
┌─────────┐
│ Query │
│ Service │
└─────────┘
│ Web UI
┌─────────┐
│ Web │
│ UI │
└─────────┘
```
#### **Spring Boot 集成 Jaeger**
```xml
<!-- pom.xml -->
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>opentracing-spring-jaeger-web-starter</artifactId>
</dependency>
```
```yaml
# application.yml
opentracing:
jaeger:
enabled: true
service-name: my-app
udp-sender:
host: jaeger-agent
port: 6831
sampler:
probability: 0.1 # 10% 采样
```
**代码示例**
```java
@RestController
public class UserController {
private final Tracer tracer;
@GetMapping("/api/users/{id}")
public User getUserById(@PathVariable Long id) {
// 创建自定义 Span
Span span = tracer.buildSpan("getUserById")
.withTag("userId", id)
.start();
try (Scope scope = tracer.scopeManager().activate(span)) {
User user = userService.findById(id);
if (user == null) {
span.setTag("error", true);
span.log("User not found");
throw new UserNotFoundException(id);
}
return user;
} finally {
span.finish();
}
}
}
```
#### **Zipkin 集成**
```xml
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-zipkin</artifactId>
</dependency>
```
```yaml
# application.yml
spring:
zipkin:
base-url: http://zipkin:9411
sleuth:
sampler:
probability: 0.1 # 10% 采样
```
---
### 6. 监控告警规则
#### **Prometheus 告警规则**
```yaml
# alerts.yml
groups:
- name: application_alerts
interval: 30s
rules:
# 高错误率
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
# 高延迟
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }} seconds"
# 服务下线
- alert: ServiceDown
expr: up{job="spring-boot"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} is down"
# JVM 内存使用率高
- alert: HighMemoryUsage
expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Heap memory usage is {{ $value | humanizePercentage }}"
# 磁盘空间不足
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low"
description: "Disk space is {{ $value | humanizePercentage }} available"
```
#### **Alertmanager 配置**
```yaml
# alertmanager.yml
global:
resolve_timeout: 5m
# 路由配置
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
continue: true
- match:
severity: warning
receiver: 'warning'
# 接收器配置
receivers:
- name: 'default'
webhook_configs:
- url: 'http://webhook-server/default'
- name: 'critical'
webhook_configs:
- url: 'http://webhook-server/critical'
email_configs:
- to: 'oncall@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
- name: 'warning'
webhook_configs:
- url: 'http://webhook-server/warning'
# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
```
#### **钉钉告警**
```python
# 钉钉 Webhook 示例
from flask import Flask, request
import requests
import json
app = Flask(__name__)
@app.route('/alertmanager', methods=['POST'])
def alertmanager():
data = request.json
for alert in data.get('alerts', []):
status = alert.get('status')
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})
message = {
"msgtype": "markdown",
"markdown": {
"title": f"Alert: {labels.get('alertname')}",
"text": f"""
### {labels.get('alertname')}
**Status:** {status}
**Severity:** {labels.get('severity')}
**Instance:** {labels.get('instance')}
**Summary:** {annotations.get('summary')}
**Description:** {annotations.get('description')}
**Starts:** {alert.get('startsAt')}
"""
}
}
requests.post(
'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN',
json=message
)
return 'OK'
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
```
---
### 7. 全链路追踪
#### **实现方案**
```
1. 客户端生成 Trace ID
2. HTTP Header 传递 Trace ID
- X-Trace-Id
- X-Span-Id
3. 每个服务记录 Span
4. 异步上报到 Jaeger/Zipkin
5. 追踪系统构建调用链
```
#### **Spring Cloud Sleuth 实现**
```java
// 1. 配置 Sleuth
@Configuration
public class TracingConfig {
@Bean
public HttpTraceCustomizer httpTraceCustomizer() {
return (builder) -> builder.include(EVERYTHING);
}
}
// 2. RestTemplate 传递 Trace ID
@Configuration
public class RestTemplateConfig {
@Bean
public RestTemplate restTemplate() {
return new RestTemplate();
}
@Bean
public RestTemplateCustomizer restTemplateCustomizer(Tracer tracer) {
return restTemplate -> {
restTemplate.setInterceptors(Collections.singletonList(new ClientHttpRequestInterceptor() {
@Override
public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException {
Span span = tracer.activeSpan();
if (span != null) {
request.getHeaders().add("X-Trace-Id", span.context().traceId());
request.getHeaders().add("X-Span-Id", span.context().spanId());
}
return execution.execute(request, body);
}
}));
};
}
}
// 3. Kafka 消息传递 Trace ID
@Configuration
public class KafkaConfig {
@Bean
public ProducerFactory<String, String> producerFactory(Tracer tracer) {
Map<String, Object> configProps = new HashMap<>();
configProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
return new DefaultKafkaProducerFactory<>(configProps,
new StringSerializer(),
new StringSerializer());
}
}
// 4. 数据库查询传递 Trace ID
@Configuration
public class DatabaseConfig {
@Bean
public DataSource dataSource(Tracer tracer) {
HikariDataSource dataSource = new HikariDataSource();
dataSource.setJdbcUrl("jdbc:mysql://localhost:3306/db");
dataSource.setConnectionTestQuery("SELECT 1");
dataSource.setConnectionInitSql("SET @trace_id = '" + tracer.activeSpan().context().traceId() + "'");
return dataSource;
}
}
```
#### **Trace ID 关联日志**
```java
// 使用 MDC 传递 Trace ID
@Slf4j
@Component
public class TraceIdFilter implements Filter {
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
String traceId = request.getHeader("X-Trace-Id");
if (traceId == null) {
traceId = UUID.randomUUID().toString();
}
MDC.put("traceId", traceId);
try {
chain.doFilter(request, response);
} finally {
MDC.clear();
}
}
}
// Logback 配置
<configuration>
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - traceId=%X{traceId} - %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="CONSOLE" />
</root>
</configuration>
```
---
### 8. 性能瓶颈定位
#### **定位流程**
```
1. 监控告警Prometheus
- CPU 使用率高
- 内存使用率高
- 请求延迟高
2. 链路追踪Jaeger
- 定位慢请求
- 找出耗时最长的服务
3. 日志分析ELK
- 查找错误日志
- 分析异常堆栈
4. 性能分析Profiling
- CPU Profiling
- Memory Profiling
- Thread Dump
```
#### **案例 1慢查询定位**
```sql
-- 1. 开启 MySQL 慢查询日志
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
-- 2. 分析慢查询日志
pt-query-digest /var/log/mysql/slow.log
-- 3. 优化 SQL
-- 添加索引
CREATE INDEX idx_user_email ON users(email);
-- 重写查询
-- Before
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- After
SELECT * FROM users WHERE email = 'alice@example.com';
```
#### **案例 2内存泄漏定位**
```bash
# 1. 导出堆转储
jmap -dump:format=b,file=heap.hprof <pid>
# 2. 使用 MAT 分析
# - 查看 Dominator Tree
# - 查找 Leak Suspects
# - 查看 Histogram
# 3. 定位泄漏代码
# - 未关闭的资源Connection、Stream
# - 静态集合持有大对象
# - 缓存未设置过期时间
```
#### **案例 3CPU 高负载定位**
```bash
# 1. 查看 CPU 使用率
top -p <pid>
# 2. 导出线程快照
jstack <pid> > thread.dump
# 3. 查找繁忙线程
printf "%x\n" <tid> # 转换为 16 进制
grep -A 20 <tid-hex> thread.dump
# 4. 分析代码
# - 死循环
# - 正则表达式(回溯)
# - 大对象序列化
```
---
### 9. 监控指标体系
#### **分层指标**
```
1. 基础设施层Infrastructure
- CPU 使用率
- 内存使用率
- 磁盘 I/O
- 网络流量
2. 平台层Platform
- Kubernetes 集群健康
- Pod 数量
- Node 状态
3. 中间件层Middleware
- Redis连接数、命令执行时间、内存使用率
- MySQLQPS、慢查询、连接数、主从延迟
- Kafka消息积压、消费延迟
4. 应用层Application
- QPS每秒请求数
- Latency延迟 P50、P95、P99
- Error Rate错误率
- Saturation饱和度
5. 业务层Business
- 订单量
- 支付成功率
- 用户活跃度
```
#### **RED 方法**
```
R - Rate (请求速率)
- QPSQueries Per Second
- RPSRequests Per Second
E - Errors (错误率)
- HTTP 5xx 错误率
- 业务异常率
D - Duration (请求耗时)
- P50中位数
- P9595 分位)
- P9999 分位)
```
#### **USE 方法**
```
U - Utilization (资源利用率)
- CPU 使用率
- 内存使用率
- 磁盘使用率
S - Saturation (资源饱和度)
- CPU 运行队列长度
- 内存 Swap 使用量
- 磁盘 I/O 等待时间
E - Errors (错误数)
- 硬件错误ECC、磁盘坏道
- 软件错误OOM、连接超时
```
#### **Grafana Dashboard 示例**
```json
{
"dashboard": {
"title": "Service Overview",
"panels": [
{
"title": "QPS",
"targets": [
{
"expr": "sum(rate(http_requests_total[1m]))"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))"
}
]
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))"
}
]
},
{
"title": "CPU Usage",
"targets": [
{
"expr": "rate(process_cpu_seconds_total[1m])"
}
]
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"}"
}
]
}
]
}
}
```
---
### 10. 实际项目落地
#### **场景 1电商系统监控**
```
需求:
- 监控订单接口性能
- 发现慢查询并优化
- 监控支付成功率
方案:
1. Prometheus 监控
- QPS、延迟、错误率
- JVM 指标
- MySQL 慢查询
2. Jaeger 链路追踪
- 订单创建流程
- 支付流程
3. ELK 日志分析
- 订单日志
- 支付日志
4. Grafana 仪表盘
- 业务指标(订单量、支付成功率)
- 技术指标QPS、延迟
```
#### **场景 2微服务链路追踪**
```
需求:
- 追踪跨服务请求
- 定位性能瓶颈
- 分析服务依赖
方案:
1. Spring Cloud Sleuth 生成 Trace ID
2. Jaeger 收集 Span
3. Kibana 关联日志(通过 Trace ID
4. Prometheus 监控每个服务性能
示例:
用户下单
├─ 订单服务(创建订单)
├─ 库存服务(扣减库存)
├─ 支付服务(创建支付)
└─ 物流服务(分配物流)
通过 Trace ID 关联所有服务的日志
```
---
### 11. 阿里 P7 加分项
**架构设计能力**
- 设计过企业级可观测性平台(统一监控、日志、追踪)
- 有多集群、多地域的监控架构经验
- 实现过自定义监控 Agent 和 Collector
**深度理解**
- 熟悉 Prometheus 内部机制TSDB、存储引擎、查询引擎
- 理解 Elasticsearch 底层原理Lucene、分片、副本
- 有 Jaeger/Zipkin 源码阅读经验
**性能优化**
- 优化过 Prometheus 查询性能Recording Rules、联邦
- 优化过 Elasticsearch 索引性能分片策略、Mapping 设计)
- 优化过日志采集性能(采样率、批量上传)
**生产实践**
- 解决过海量数据存储和查询问题(数据降采样、冷热分离)
- 实现过智能告警(动态阈值、异常检测、机器学习)
- 有故障快速定位经验(根因分析、故障复盘)
**开源贡献**
- 向 Prometheus/Grafana/Jaeger 社区提交过 PR
- 开发过自定义 Exporter
- 编写过相关技术博客或演讲
**可观测性最佳实践**
- 实现 SLO/SLIService Level Objective/Indicator
- 使用 Error Budget 管理发布节奏
- 有混沌工程实践Chaos Engineering
- 实现 APMApplication Performance Monitoring
**业务监控**
- 设计过业务指标大盘
- 实现过实时数据大屏Druid、ClickHouse
- 有用户行为分析经验(埋点、漏斗分析)