Organized 50 interview questions into 12 categories: - 01-分布式系统 (9 files): 分布式事务, 分布式锁, 一致性哈希, CAP理论, etc. - 02-数据库 (2 files): MySQL索引优化, MyBatis核心原理 - 03-缓存 (5 files): Redis数据结构, 缓存问题, LRU算法, etc. - 04-消息队列 (1 file): RocketMQ/Kafka - 05-并发编程 (4 files): 线程池, 设计模式, 限流策略, etc. - 06-JVM (1 file): JVM和垃圾回收 - 07-系统设计 (8 files): 秒杀系统, 短链接, IM, Feed流, etc. - 08-算法与数据结构 (4 files): B+树, 红黑树, 跳表, 时间轮 - 09-网络与安全 (3 files): TCP/IP, 加密安全, 性能优化 - 10-中间件 (4 files): Spring Boot, Nacos, Dubbo, Nginx - 11-运维 (4 files): Kubernetes, CI/CD, Docker, 可观测性 - 12-面试技巧 (1 file): 面试技巧和职业规划 All files renamed to Chinese for better accessibility and organized into categorized folders for easier navigation. Generated with [Claude Code](https://claude.com/claude-code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
29 KiB
29 KiB
可观测性 (Observability)
问题
背景:在分布式系统中,如何快速定位和解决问题成为关键挑战。可观测性通过监控、日志和链路追踪三大支柱,帮助开发和运维团队理解系统内部状态。
问题:
- 什么是可观测性?它和监控有什么区别?
- 监控、日志、链路追踪三大支柱的作用是什么?
- Prometheus + Grafana 监控架构是如何设计的?
- ELK(Elasticsearch、Logstash、Kibana)日志栈如何搭建?
- 分布式追踪(Jaeger/Zipkin)的原理是什么?
- 如何设计监控告警规则?
- 如何实现全链路追踪?
- 如何定位性能瓶颈?
- 如何设计监控指标体系?
- 在实际项目中如何落地区观测性?
标准答案
1. 可观测性概述
定义:
可观测性(Observability):
通过系统外部输出(Metrics、Logs、Traces)推断系统内部状态的能力
监控(Monitoring):
通过预定义的指标检查系统是否正常运行
对比:
监控(Monitoring):
├─ 主动询问系统状态(预设规则)
├─ 关注已知问题(如 CPU 使用率 > 80%)
└─ 问题:无法发现未知问题
可观测性(Observability):
├─ 被动收集系统输出(数据驱动)
├─ 可以发现未知问题
└─ 支持根因分析(Root Cause Analysis)
三大支柱:
1. Metrics(指标):数值型数据
- Counter(计数器):请求数、错误数
- Gauge(仪表盘):CPU 使用率、内存使用量
- Histogram(直方图):请求延迟分布
2. Logs(日志):离散事件
- 应用日志:错误日志、调试日志
- 访问日志:Nginx access.log
- 审计日志:操作记录
3. Traces(追踪):请求路径
- Trace:一次完整的请求(从客户端到后端)
- Span:单个服务的处理过程
- Span ID、Trace ID:关联标识
2. 三大支柱详解
Metrics(指标):
# Prometheus 指标示例
# 1. Counter(只增不减)
http_requests_total{method="GET",path="/api/users",status="200"} 12345
# 2. Gauge(可增可减)
memory_usage_bytes{instance="localhost:8080"} 1073741824
cpu_usage_percent{instance="localhost:8080"} 45.2
# 3. Histogram(分布)
http_request_duration_seconds_bucket{le="0.1"} 5000
http_request_duration_seconds_bucket{le="0.5"} 9500
http_request_duration_seconds_bucket{le="+Inf"} 10000
代码示例(Spring Boot Actuator):
@RestController
public class UserController {
private final Counter requestCounter;
private final Gauge memoryGauge;
public UserController(MeterRegistry registry) {
this.requestCounter = Counter.builder("http.requests.total")
.tag("method", "GET")
.tag("path", "/api/users")
.register(registry);
this.memoryGauge = Gauge.builder("jvm.memory.used", Runtime.getRuntime(), Runtime::totalMemory)
.register(registry);
}
@GetMapping("/api/users")
public List<User> getUsers() {
requestCounter.increment();
return userService.findAll();
}
}
Logs(日志):
// 结构化日志(JSON 格式)
@Slf4j
@RestController
public class UserController {
@GetMapping("/api/users/{id}")
public User getUserById(@PathVariable Long id) {
log.info("Get user by id", logContext()
.with("userId", id)
.with("traceId", MDC.get("traceId"))
.with("spanId", MDC.get("spanId"))
);
User user = userService.findById(id);
if (user == null) {
log.warn("User not found", logContext()
.with("userId", id)
.with("traceId", MDC.get("traceId"))
);
throw new UserNotFoundException(id);
}
return user;
}
private LogContext logContext() {
return new LogContext();
}
}
// 日志输出
{
"timestamp": "2024-01-01T10:00:00Z",
"level": "INFO",
"logger": "com.example.UserController",
"message": "Get user by id",
"userId": 123,
"traceId": "a1b2c3d4e5f6g7h8",
"spanId": "i9j0k1l2m3n4o5p6",
"thread": "http-nio-8080-exec-1"
}
Traces(追踪):
Trace(一次完整请求):
Client → Gateway → Service A → Service B → Service C
│ │ │ │ │
└─────────┴───────────┴────────────┴────────────┘
Trace ID: abc123
Span(单个服务处理):
Gateway (Span 1)
├─ Service A (Span 2)
│ └─ Service B (Span 3)
│ └─ Service C (Span 4)
3. Prometheus + Grafana 架构
架构图:
┌─────────────────┐
│ Applications │
│ ( exporters ) │
└─────────────────┘
│
│ /metrics
│
┌─────────────────┐
│ Prometheus │
│ (Pull 指标) │
└─────────────────┘
│
│ 存储
▼
┌─────────────────┐
│ TSDB (时序数据库)│
└─────────────────┘
│
│ 查询
▼
┌─────────────────┐
│ Grafana │
│ (可视化仪表盘) │
└─────────────────┘
│
│ 告警
▼
┌─────────────────┐
│ Alertmanager │
│ (告警路由) │
└─────────────────┘
│
│ 通知
▼
┌─────────────────┐
│ Email/Webhook │
│ 钉钉/企业微信 │
└─────────────────┘
Prometheus 配置:
# prometheus.yml
global:
scrape_interval: 15s # 每 15 秒采集一次
evaluation_interval: 15s # 每 15 秒评估告警规则
# 告警规则
rule_files:
- "alerts/*.yml"
# 抓取配置
scrape_configs:
# Spring Boot Actuator
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['localhost:8080']
# Kubernetes 服务发现
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 告警管理
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Spring Boot 集成:
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.yml
management:
endpoints:
web:
exposure:
include: prometheus,health,info
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
Grafana Dashboard:
{
"dashboard": {
"title": "Spring Boot Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[1m])",
"legendFormat": "{{method}} {{uri}}"
}
],
"type": "graph"
},
{
"title": "Request Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[1m]))",
"legendFormat": "P95 Latency"
}
],
"type": "graph"
},
{
"title": "JVM Memory Usage",
"targets": [
{
"expr": "jvm_memory_used_bytes{area=\"heap\"}",
"legendFormat": "Heap Used"
}
],
"type": "graph"
}
]
}
}
4. ELK 日志栈
架构图:
┌─────────────────┐
│ Applications │
│ (日志输出) │
└─────────────────┘
│
│ Filebeat/Logstash
│
┌─────────────────┐
│ Logstash │
│ (日志处理) │
├─────────────────┤
│ - 过滤 │
│ - 转换 │
│ - 解析 │
└─────────────────┘
│
│
┌─────────────────┐
│ Elasticsearch │
│ (日志存储) │
└─────────────────┘
│
│ 查询
▼
┌─────────────────┐
│ Kibana │
│ (日志可视化) │
└─────────────────┘
Logstash 配置:
# logstash.conf
input {
file {
path => "/var/log/app/*.log"
start_position => "beginning"
codec => json
}
beats {
port => 5044
}
}
filter {
# 解析 JSON 日志
json {
source => "message"
}
# 提取时间戳
date {
match => ["timestamp", "ISO8601"]
}
# 提取 Trace ID
grok {
match => {
"message" => '"traceId":"%{DATA:traceId}"'
}
}
# 添加应用名称
mutate {
add_field => {
"application" => "my-app"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "my-app-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
Filebeat 配置:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
json.keys_under_root: true
json.add_error_key: true
fields:
app: my-app
env: production
output.logstash:
hosts: ["logstash:5044"]
# 日志 multiline 处理
multiline.type: pattern
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
multiline.negate: true
multiline.match: after
Kibana 查询:
# 1. 简单查询
level: "ERROR"
# 2. 范围查询
@timestamp: [now-1h TO now]
# 3. 通配符
message: "*NullPointerException*"
# 4. 正则表达式
message: /.*User \d+ not found.*/
# 5. 聚合查询
# 按错误级别统计
level: "ERROR" | stats count by level
# 按时间统计
# histogram @timestamp, interval 1m
# 按服务统计
# terms appName
# 6. 全链路追踪
# 查询同一 Trace ID 的所有日志
traceId: "a1b2c3d4e5f6g7h8"
5. 分布式追踪
原理:
1. 客户端请求生成 Trace ID
2. 每个服务处理时生成 Span
3. Span 记录:
- Span ID(当前 Span 唯一 ID)
- Parent Span ID(父 Span ID)
- Trace ID(全局 Trace ID)
- Timestamp(开始时间)
- Duration(耗时)
- Tags(标签)
- Logs(日志)
4. Span 上报到 Jaeger/Zipkin
5. 追踪系统构建调用链
Jaeger 架构:
┌─────────────────┐
│ Applications │
│ (Jaeger Client)│
└─────────────────┘
│
│ UDP/HTTP
│
┌─────────────────┐
│ Agent │
│ (数据采集) │
└─────────────────┘
│
│
┌─────────────────┐
│ Collector │
│ (数据处理) │
└─────────────────┘
│
│
┌──────────────────┼──────────────────┐
│ │ │
┌─────────┐ ┌──────────┐ ┌──────────┐
│Elasticsearch│ │ Cassandra│ │ Kafka │
└─────────┘ └──────────┘ └──────────┘
│
│ 查询
▼
┌─────────┐
│ Query │
│ Service │
└─────────┘
│
│ Web UI
▼
┌─────────┐
│ Web │
│ UI │
└─────────┘
Spring Boot 集成 Jaeger:
<!-- pom.xml -->
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>opentracing-spring-jaeger-web-starter</artifactId>
</dependency>
# application.yml
opentracing:
jaeger:
enabled: true
service-name: my-app
udp-sender:
host: jaeger-agent
port: 6831
sampler:
probability: 0.1 # 10% 采样
代码示例:
@RestController
public class UserController {
private final Tracer tracer;
@GetMapping("/api/users/{id}")
public User getUserById(@PathVariable Long id) {
// 创建自定义 Span
Span span = tracer.buildSpan("getUserById")
.withTag("userId", id)
.start();
try (Scope scope = tracer.scopeManager().activate(span)) {
User user = userService.findById(id);
if (user == null) {
span.setTag("error", true);
span.log("User not found");
throw new UserNotFoundException(id);
}
return user;
} finally {
span.finish();
}
}
}
Zipkin 集成:
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-zipkin</artifactId>
</dependency>
# application.yml
spring:
zipkin:
base-url: http://zipkin:9411
sleuth:
sampler:
probability: 0.1 # 10% 采样
6. 监控告警规则
Prometheus 告警规则:
# alerts.yml
groups:
- name: application_alerts
interval: 30s
rules:
# 高错误率
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
# 高延迟
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }} seconds"
# 服务下线
- alert: ServiceDown
expr: up{job="spring-boot"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} is down"
# JVM 内存使用率高
- alert: HighMemoryUsage
expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Heap memory usage is {{ $value | humanizePercentage }}"
# 磁盘空间不足
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low"
description: "Disk space is {{ $value | humanizePercentage }} available"
Alertmanager 配置:
# alertmanager.yml
global:
resolve_timeout: 5m
# 路由配置
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
continue: true
- match:
severity: warning
receiver: 'warning'
# 接收器配置
receivers:
- name: 'default'
webhook_configs:
- url: 'http://webhook-server/default'
- name: 'critical'
webhook_configs:
- url: 'http://webhook-server/critical'
email_configs:
- to: 'oncall@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
- name: 'warning'
webhook_configs:
- url: 'http://webhook-server/warning'
# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
钉钉告警:
# 钉钉 Webhook 示例
from flask import Flask, request
import requests
import json
app = Flask(__name__)
@app.route('/alertmanager', methods=['POST'])
def alertmanager():
data = request.json
for alert in data.get('alerts', []):
status = alert.get('status')
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})
message = {
"msgtype": "markdown",
"markdown": {
"title": f"Alert: {labels.get('alertname')}",
"text": f"""
### {labels.get('alertname')}
**Status:** {status}
**Severity:** {labels.get('severity')}
**Instance:** {labels.get('instance')}
**Summary:** {annotations.get('summary')}
**Description:** {annotations.get('description')}
**Starts:** {alert.get('startsAt')}
"""
}
}
requests.post(
'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN',
json=message
)
return 'OK'
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
7. 全链路追踪
实现方案:
1. 客户端生成 Trace ID
2. HTTP Header 传递 Trace ID
- X-Trace-Id
- X-Span-Id
3. 每个服务记录 Span
4. 异步上报到 Jaeger/Zipkin
5. 追踪系统构建调用链
Spring Cloud Sleuth 实现:
// 1. 配置 Sleuth
@Configuration
public class TracingConfig {
@Bean
public HttpTraceCustomizer httpTraceCustomizer() {
return (builder) -> builder.include(EVERYTHING);
}
}
// 2. RestTemplate 传递 Trace ID
@Configuration
public class RestTemplateConfig {
@Bean
public RestTemplate restTemplate() {
return new RestTemplate();
}
@Bean
public RestTemplateCustomizer restTemplateCustomizer(Tracer tracer) {
return restTemplate -> {
restTemplate.setInterceptors(Collections.singletonList(new ClientHttpRequestInterceptor() {
@Override
public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException {
Span span = tracer.activeSpan();
if (span != null) {
request.getHeaders().add("X-Trace-Id", span.context().traceId());
request.getHeaders().add("X-Span-Id", span.context().spanId());
}
return execution.execute(request, body);
}
}));
};
}
}
// 3. Kafka 消息传递 Trace ID
@Configuration
public class KafkaConfig {
@Bean
public ProducerFactory<String, String> producerFactory(Tracer tracer) {
Map<String, Object> configProps = new HashMap<>();
configProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
return new DefaultKafkaProducerFactory<>(configProps,
new StringSerializer(),
new StringSerializer());
}
}
// 4. 数据库查询传递 Trace ID
@Configuration
public class DatabaseConfig {
@Bean
public DataSource dataSource(Tracer tracer) {
HikariDataSource dataSource = new HikariDataSource();
dataSource.setJdbcUrl("jdbc:mysql://localhost:3306/db");
dataSource.setConnectionTestQuery("SELECT 1");
dataSource.setConnectionInitSql("SET @trace_id = '" + tracer.activeSpan().context().traceId() + "'");
return dataSource;
}
}
Trace ID 关联日志:
// 使用 MDC 传递 Trace ID
@Slf4j
@Component
public class TraceIdFilter implements Filter {
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
String traceId = request.getHeader("X-Trace-Id");
if (traceId == null) {
traceId = UUID.randomUUID().toString();
}
MDC.put("traceId", traceId);
try {
chain.doFilter(request, response);
} finally {
MDC.clear();
}
}
}
// Logback 配置
<configuration>
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - traceId=%X{traceId} - %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="CONSOLE" />
</root>
</configuration>
8. 性能瓶颈定位
定位流程:
1. 监控告警(Prometheus)
- CPU 使用率高
- 内存使用率高
- 请求延迟高
2. 链路追踪(Jaeger)
- 定位慢请求
- 找出耗时最长的服务
3. 日志分析(ELK)
- 查找错误日志
- 分析异常堆栈
4. 性能分析(Profiling)
- CPU Profiling
- Memory Profiling
- Thread Dump
案例 1:慢查询定位
-- 1. 开启 MySQL 慢查询日志
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
-- 2. 分析慢查询日志
pt-query-digest /var/log/mysql/slow.log
-- 3. 优化 SQL
-- 添加索引
CREATE INDEX idx_user_email ON users(email);
-- 重写查询
-- Before
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- After
SELECT * FROM users WHERE email = 'alice@example.com';
案例 2:内存泄漏定位
# 1. 导出堆转储
jmap -dump:format=b,file=heap.hprof <pid>
# 2. 使用 MAT 分析
# - 查看 Dominator Tree
# - 查找 Leak Suspects
# - 查看 Histogram
# 3. 定位泄漏代码
# - 未关闭的资源(Connection、Stream)
# - 静态集合持有大对象
# - 缓存未设置过期时间
案例 3:CPU 高负载定位
# 1. 查看 CPU 使用率
top -p <pid>
# 2. 导出线程快照
jstack <pid> > thread.dump
# 3. 查找繁忙线程
printf "%x\n" <tid> # 转换为 16 进制
grep -A 20 <tid-hex> thread.dump
# 4. 分析代码
# - 死循环
# - 正则表达式(回溯)
# - 大对象序列化
9. 监控指标体系
分层指标:
1. 基础设施层(Infrastructure)
- CPU 使用率
- 内存使用率
- 磁盘 I/O
- 网络流量
2. 平台层(Platform)
- Kubernetes 集群健康
- Pod 数量
- Node 状态
3. 中间件层(Middleware)
- Redis:连接数、命令执行时间、内存使用率
- MySQL:QPS、慢查询、连接数、主从延迟
- Kafka:消息积压、消费延迟
4. 应用层(Application)
- QPS(每秒请求数)
- Latency(延迟 P50、P95、P99)
- Error Rate(错误率)
- Saturation(饱和度)
5. 业务层(Business)
- 订单量
- 支付成功率
- 用户活跃度
RED 方法:
R - Rate (请求速率)
- QPS(Queries Per Second)
- RPS(Requests Per Second)
E - Errors (错误率)
- HTTP 5xx 错误率
- 业务异常率
D - Duration (请求耗时)
- P50(中位数)
- P95(95 分位)
- P99(99 分位)
USE 方法:
U - Utilization (资源利用率)
- CPU 使用率
- 内存使用率
- 磁盘使用率
S - Saturation (资源饱和度)
- CPU 运行队列长度
- 内存 Swap 使用量
- 磁盘 I/O 等待时间
E - Errors (错误数)
- 硬件错误(ECC、磁盘坏道)
- 软件错误(OOM、连接超时)
Grafana Dashboard 示例:
{
"dashboard": {
"title": "Service Overview",
"panels": [
{
"title": "QPS",
"targets": [
{
"expr": "sum(rate(http_requests_total[1m]))"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))"
}
]
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))"
}
]
},
{
"title": "CPU Usage",
"targets": [
{
"expr": "rate(process_cpu_seconds_total[1m])"
}
]
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"}"
}
]
}
]
}
}
10. 实际项目落地
场景 1:电商系统监控
需求:
- 监控订单接口性能
- 发现慢查询并优化
- 监控支付成功率
方案:
1. Prometheus 监控
- QPS、延迟、错误率
- JVM 指标
- MySQL 慢查询
2. Jaeger 链路追踪
- 订单创建流程
- 支付流程
3. ELK 日志分析
- 订单日志
- 支付日志
4. Grafana 仪表盘
- 业务指标(订单量、支付成功率)
- 技术指标(QPS、延迟)
场景 2:微服务链路追踪
需求:
- 追踪跨服务请求
- 定位性能瓶颈
- 分析服务依赖
方案:
1. Spring Cloud Sleuth 生成 Trace ID
2. Jaeger 收集 Span
3. Kibana 关联日志(通过 Trace ID)
4. Prometheus 监控每个服务性能
示例:
用户下单
├─ 订单服务(创建订单)
├─ 库存服务(扣减库存)
├─ 支付服务(创建支付)
└─ 物流服务(分配物流)
通过 Trace ID 关联所有服务的日志
11. 阿里 P7 加分项
架构设计能力:
- 设计过企业级可观测性平台(统一监控、日志、追踪)
- 有多集群、多地域的监控架构经验
- 实现过自定义监控 Agent 和 Collector
深度理解:
- 熟悉 Prometheus 内部机制(TSDB、存储引擎、查询引擎)
- 理解 Elasticsearch 底层原理(Lucene、分片、副本)
- 有 Jaeger/Zipkin 源码阅读经验
性能优化:
- 优化过 Prometheus 查询性能(Recording Rules、联邦)
- 优化过 Elasticsearch 索引性能(分片策略、Mapping 设计)
- 优化过日志采集性能(采样率、批量上传)
生产实践:
- 解决过海量数据存储和查询问题(数据降采样、冷热分离)
- 实现过智能告警(动态阈值、异常检测、机器学习)
- 有故障快速定位经验(根因分析、故障复盘)
开源贡献:
- 向 Prometheus/Grafana/Jaeger 社区提交过 PR
- 开发过自定义 Exporter
- 编写过相关技术博客或演讲
可观测性最佳实践:
- 实现 SLO/SLI(Service Level Objective/Indicator)
- 使用 Error Budget 管理发布节奏
- 有混沌工程实践(Chaos Engineering)
- 实现 APM(Application Performance Monitoring)
业务监控:
- 设计过业务指标大盘
- 实现过实时数据大屏(Druid、ClickHouse)
- 有用户行为分析经验(埋点、漏斗分析)