Add comprehensive interview materials for: - Service Mesh (Istio, Linkerd) - RPC Framework (Dubbo, gRPC) - Container Orchestration (Kubernetes) - CI/CD (Jenkins, GitLab CI, GitHub Actions) - Observability (Monitoring, Logging, Tracing) Each file includes: - 5-10 core questions - Detailed standard answers - Code examples - Real-world project experience - Alibaba P7 bonus points Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1021 lines
24 KiB
Markdown
1021 lines
24 KiB
Markdown
# 容器编排 (Kubernetes)
|
||
|
||
## 问题
|
||
|
||
**背景**:随着容器化技术的普及,容器编排成为管理大规模容器集群的关键。Kubernetes 作为事实上的标准,提供了自动化部署、扩展和管理容器化应用的能力。
|
||
|
||
**问题**:
|
||
1. 什么是容器编排?为什么需要 Kubernetes?
|
||
2. Kubernetes 的核心架构有哪些组件?
|
||
3. Pod、Deployment、Service 的关系是什么?
|
||
4. 请描述 Kubernetes 的网络模型
|
||
5. Kubernetes 如何实现服务发现和负载均衡?
|
||
6. 什么是 ConfigMap 和 Secret?如何使用?
|
||
7. Kubernetes 的存储卷(Volume)有哪些类型?
|
||
8. 请描述 Kubernetes 的调度流程
|
||
9. Ingress 是什么?它和 NodePort、LoadBalancer 的区别?
|
||
10. 在生产环境中使用 Kubernetes 遇到过哪些坑?
|
||
|
||
---
|
||
|
||
## 标准答案
|
||
|
||
### 1. 容器编排概述
|
||
|
||
#### **为什么需要容器编排**:
|
||
```
|
||
单机 Docker 的痛点:
|
||
├─ 容器生命周期管理复杂
|
||
├─ 服务发现和负载均衡困难
|
||
├─ 滚动更新和回滚复杂
|
||
├─ 资源调度和利用率低
|
||
├─ 高可用和故障自恢复难实现
|
||
└─ 多主机网络配置复杂
|
||
|
||
Kubernetes 的解决方案:
|
||
├─ 自动化部署和回滚
|
||
├─ 服务发现和负载均衡
|
||
├─ 自我修复(失败重启、节点迁移)
|
||
├─ 自动扩缩容(HPA)
|
||
├─ 存储编排
|
||
└─ 配置管理和密钥管理
|
||
```
|
||
|
||
---
|
||
|
||
### 2. Kubernetes 核心架构
|
||
|
||
#### **架构图**:
|
||
```
|
||
┌─────────────────────────────────┐
|
||
│ Control Plane │
|
||
│ (Master 节点) │
|
||
└─────────────────────────────────┘
|
||
│
|
||
┌─────────────────┼─────────────────┐
|
||
│ │ │
|
||
┌─────────┐ ┌──────────┐ ┌──────────┐
|
||
│API Server│ │Scheduler │ │Controller│
|
||
│ (apiserver)│ │ (调度器) │ │ Manager │
|
||
└─────────┘ └──────────┘ └──────────┘
|
||
│ │ │
|
||
┌─────────┐ ┌──────────┐ ┌──────────┐
|
||
│ etcd │ │Cloud Ctl │ │ kube- │
|
||
│ (存储) │ │ Manager │ │ proxy │
|
||
└─────────┘ └──────────┘ └──────────┘
|
||
│
|
||
│ HTTP/REST API
|
||
│
|
||
┌─────────────────────────────────────────────┐
|
||
│ Worker Nodes │
|
||
├─────────────────────────────────────────────┤
|
||
│ │
|
||
│ Node 1 Node 2 │
|
||
│ ┌────────────┐ ┌────────────┐ │
|
||
│ │ kubelet │ │ kubelet │ │
|
||
│ │ (Pod 代理) │ │ │ │
|
||
│ └────────────┘ └────────────┘ │
|
||
│ ┌────────────┐ ┌────────────┐ │
|
||
│ │kube-proxy │ │kube-proxy │ │
|
||
│ │ (网络代理) │ │ │ │
|
||
│ └────────────┘ └────────────┘ │
|
||
│ ┌────────────┐ ┌────────────┐ │
|
||
│ │Container │ │Container │ │
|
||
│ │Runtime │ │Runtime │ │
|
||
│ │(Docker/...)│ │ │ │
|
||
│ └────────────┘ └────────────┘ │
|
||
│ ┌────────────┐ ┌────────────┐ │
|
||
│ │ Pods │ │ Pods │ │
|
||
│ │ ┌────────┐ │ │ ┌────────┐ │ │
|
||
│ │ │ App 1 │ │ │ │ App 2 │ │ │
|
||
│ │ └────────┘ │ │ └────────┘ │ │
|
||
│ └────────────┘ └────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────┘
|
||
```
|
||
|
||
#### **核心组件详解**:
|
||
|
||
**1. API Server( apiserver)**
|
||
- Kubernetes 的入口,所有请求都通过 API Server
|
||
- 认证、授权、准入控制
|
||
- RESTful API
|
||
|
||
**2. etcd**
|
||
- 分布式键值存储
|
||
- 存储集群所有状态数据
|
||
- Watch 机制,推送变化
|
||
|
||
**3. Scheduler(调度器)**
|
||
- 负责决定 Pod 调度到哪个节点
|
||
- 调度算法:资源需求、硬件约束、亲和性/反亲和性
|
||
|
||
**4. Controller Manager**
|
||
- 维护集群状态
|
||
- 常见控制器:
|
||
- Node Controller:节点故障处理
|
||
- Replication Controller:副本管理
|
||
- Endpoint Controller:Service 端点管理
|
||
|
||
**5. kubelet**
|
||
- 运行在每个节点上
|
||
- 负责 Pod 的生命周期管理
|
||
- 上报节点状态
|
||
|
||
**6. kube-proxy**
|
||
- 维护网络规则
|
||
- 实现 Service 负载均衡
|
||
|
||
---
|
||
|
||
### 3. Pod、Deployment、Service
|
||
|
||
#### **关系图**:
|
||
```
|
||
Deployment (声明式部署)
|
||
│
|
||
├── 管理 ReplicaSet (副本集)
|
||
│ │
|
||
│ └── 管理 Pod (最小调度单元)
|
||
│ │
|
||
│ ├── Container 1 (应用容器)
|
||
│ ├── Container 2 (Sidecar)
|
||
│ └── Shared Volume (共享存储)
|
||
|
||
Service (服务发现)
|
||
│
|
||
├── 通过 Label Selector 选择 Pod
|
||
│
|
||
└── 提供稳定的访问入口(IP/DNS)
|
||
```
|
||
|
||
#### **Pod 示例**:
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: nginx-pod
|
||
labels:
|
||
app: nginx
|
||
env: prod
|
||
spec:
|
||
containers:
|
||
- name: nginx
|
||
image: nginx:1.21
|
||
ports:
|
||
- containerPort: 80
|
||
resources:
|
||
requests:
|
||
memory: "64Mi"
|
||
cpu: "250m"
|
||
limits:
|
||
memory: "128Mi"
|
||
cpu: "500m"
|
||
- name: sidecar
|
||
image: fluentd:1.12
|
||
volumeMounts:
|
||
- name: log-volume
|
||
mountPath: /var/log
|
||
volumes:
|
||
- name: log-volume
|
||
emptyDir: {}
|
||
```
|
||
|
||
#### **Deployment 示例**:
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata:
|
||
name: nginx-deployment
|
||
labels:
|
||
app: nginx
|
||
spec:
|
||
replicas: 3 # 3 个副本
|
||
selector:
|
||
matchLabels:
|
||
app: nginx
|
||
template: # Pod 模板
|
||
metadata:
|
||
labels:
|
||
app: nginx
|
||
spec:
|
||
containers:
|
||
- name: nginx
|
||
image: nginx:1.21
|
||
ports:
|
||
- containerPort: 80
|
||
livenessProbe:
|
||
httpGet:
|
||
path: /
|
||
port: 80
|
||
initialDelaySeconds: 3
|
||
periodSeconds: 3
|
||
readinessProbe:
|
||
httpGet:
|
||
path: /
|
||
port: 80
|
||
initialDelaySeconds: 3
|
||
periodSeconds: 3
|
||
```
|
||
|
||
#### **Service 示例**:
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: nginx-service
|
||
spec:
|
||
selector:
|
||
app: nginx # 选择 Pod
|
||
ports:
|
||
- protocol: TCP
|
||
port: 80 # Service 端口
|
||
targetPort: 80 # Pod 端口
|
||
type: ClusterIP # 服务类型
|
||
```
|
||
|
||
**三种 Service 类型**:
|
||
```yaml
|
||
# 1. ClusterIP(默认)
|
||
type: ClusterIP
|
||
# 仅集群内部访问
|
||
|
||
# 2. NodePort
|
||
type: NodePort
|
||
ports:
|
||
- port: 80
|
||
targetPort: 80
|
||
nodePort: 30080 # 每个节点都暴露 30080 端口
|
||
|
||
# 3. LoadBalancer
|
||
type: LoadBalancer
|
||
# 云服务商提供外部负载均衡器
|
||
```
|
||
|
||
---
|
||
|
||
### 4. Kubernetes 网络模型
|
||
|
||
#### **网络要求**:
|
||
```
|
||
1. 所有 Pod 可以不通过 NAT 直接通信
|
||
2. 所有 Node 可以与所有 Pod 通信
|
||
3. Pod 看到的自己 IP 和别人看到的 IP 一致
|
||
```
|
||
|
||
#### **网络架构**:
|
||
```
|
||
Internet
|
||
│
|
||
│
|
||
┌──────────┐
|
||
│ Ingress │
|
||
└──────────┘
|
||
│
|
||
Service (ClusterIP: 10.0.0.1)
|
||
│
|
||
┌──────────────┼──────────────┐
|
||
│ │ │
|
||
Pod (10.244.1.2) Pod (10.244.1.3) Pod (10.244.2.5)
|
||
Node 1 Node 1 Node 2
|
||
```
|
||
|
||
#### **网络插件(CNI)**:
|
||
|
||
| 插件 | 类型 | 特点 |
|
||
|------|------|------|
|
||
| Flannel | VxLAN/Host-GW | 简单,性能一般 |
|
||
| Calico | BGP | 性能好,支持网络策略 |
|
||
| Cilium | eBPF | 高性能,支持透明代理 |
|
||
| Weave | VxLAN | 简单,加密支持 |
|
||
|
||
**Calico 示例**:
|
||
```yaml
|
||
# 安装 Calico
|
||
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
|
||
|
||
# 网络策略
|
||
apiVersion: networking.k8s.io/v1
|
||
kind: NetworkPolicy
|
||
metadata:
|
||
name: deny-from-other-namespaces
|
||
namespace: default
|
||
spec:
|
||
podSelector:
|
||
matchLabels:
|
||
app: nginx
|
||
ingress:
|
||
- from:
|
||
- podSelector: {}
|
||
```
|
||
|
||
---
|
||
|
||
### 5. 服务发现和负载均衡
|
||
|
||
#### **服务发现**:
|
||
```yaml
|
||
# 1. 环境变量
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: my-app
|
||
spec:
|
||
containers:
|
||
- name: app
|
||
image: my-app
|
||
env:
|
||
- name: DB_SERVICE_HOST
|
||
value: "mysql-service"
|
||
- name: DB_SERVICE_PORT
|
||
value: "3306"
|
||
|
||
# 2. DNS(推荐)
|
||
# Pod 可以通过 DNS 名称访问 Service
|
||
# mysql-service.default.svc.cluster.local
|
||
```
|
||
|
||
**Kubernetes DNS 架构**:
|
||
```
|
||
Pod 启动 → /etc/resolv.conf 配置
|
||
nameserver 10.96.0.10 # kube-dns 的 ClusterIP
|
||
search default.svc.cluster.local svc.cluster.local cluster.local
|
||
|
||
↓
|
||
|
||
解析域名
|
||
mysql-service.default.svc.cluster.local
|
||
↓
|
||
返回 Service ClusterIP (10.0.0.1)
|
||
↓
|
||
kube-proxy 负载均衡到 Pod
|
||
```
|
||
|
||
#### **负载均衡策略**:
|
||
|
||
**kube-proxy 三种模式**:
|
||
```yaml
|
||
# 1. Userspace(旧版,性能差)
|
||
mode: userspace
|
||
|
||
# 2. iptables(默认)
|
||
mode: iptables
|
||
# 使用 iptables 规则实现负载均衡
|
||
|
||
# 3. ipvs(推荐)
|
||
mode: ipvs
|
||
# 使用 IPVS,性能更好
|
||
```
|
||
|
||
**Service 负载均衡算法**:
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: nginx-service
|
||
spec:
|
||
sessionAffinity: ClientIP # 会话保持
|
||
sessionAffinityConfig:
|
||
clientIP:
|
||
timeoutSeconds: 10800 # 3 小时
|
||
```
|
||
|
||
---
|
||
|
||
### 6. ConfigMap 和 Secret
|
||
|
||
#### **ConfigMap(配置管理)**:
|
||
```yaml
|
||
# 1. 创建 ConfigMap
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata:
|
||
name: app-config
|
||
data:
|
||
application.properties: |
|
||
server.port=8080
|
||
spring.datasource.url=jdbc:mysql://localhost:3306/db
|
||
log-level: "info"
|
||
feature-flags: |
|
||
featureA=true
|
||
featureB=false
|
||
|
||
# 2. 使用 ConfigMap
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: my-app
|
||
spec:
|
||
containers:
|
||
- name: app
|
||
image: my-app
|
||
env:
|
||
- name: LOG_LEVEL
|
||
valueFrom:
|
||
configMapKeyRef:
|
||
name: app-config
|
||
key: log-level
|
||
volumeMounts:
|
||
- name: config-volume
|
||
mountPath: /etc/config
|
||
volumes:
|
||
- name: config-volume
|
||
configMap:
|
||
name: app-config
|
||
```
|
||
|
||
#### **Secret(密钥管理)**:
|
||
```yaml
|
||
# 1. 创建 Secret
|
||
apiVersion: v1
|
||
kind: Secret
|
||
metadata:
|
||
name: db-secret
|
||
type: Opaque
|
||
data:
|
||
username: YWRtaW4= # base64 编码
|
||
password: MWYyZDFlMmU2N2Rm
|
||
|
||
# 2. 使用 Secret
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: my-app
|
||
spec:
|
||
containers:
|
||
- name: app
|
||
image: my-app
|
||
env:
|
||
- name: DB_USERNAME
|
||
valueFrom:
|
||
secretKeyRef:
|
||
name: db-secret
|
||
key: username
|
||
- name: DB_PASSWORD
|
||
valueFrom:
|
||
secretKeyRef:
|
||
name: db-secret
|
||
key: password
|
||
```
|
||
|
||
**从文件创建 Secret**:
|
||
```bash
|
||
# 创建 TLS Secret
|
||
kubectl create secret tls my-tls-secret \
|
||
--cert=path/to/cert.crt \
|
||
--key=path/to/cert.key
|
||
|
||
# 创建 Docker Registry Secret
|
||
kubectl create secret docker-registry my-registry-secret \
|
||
--docker-server=registry.example.com \
|
||
--docker-username=user \
|
||
--docker-password=password
|
||
```
|
||
|
||
---
|
||
|
||
### 7. 存储卷(Volume)
|
||
|
||
#### **常见 Volume 类型**:
|
||
|
||
| 类型 | 说明 | 适用场景 |
|
||
|------|------|----------|
|
||
| emptyDir | 临时目录,Pod 删除后数据丢失 | 临时缓存 |
|
||
| hostPath | 主机路径,Pod 删除后数据保留 | 日志收集、监控 |
|
||
| PersistentVolumeClaim | 持久化存储 | 数据库、应用数据 |
|
||
| ConfigMap | 配置文件 | 应用配置 |
|
||
| Secret | 敏感数据 | 密钥、证书 |
|
||
|
||
#### **PV/PVC 示例**:
|
||
```yaml
|
||
# 1. PersistentVolume (PV)
|
||
apiVersion: v1
|
||
kind: PersistentVolume
|
||
metadata:
|
||
name: pv-example
|
||
spec:
|
||
capacity:
|
||
storage: 10Gi
|
||
accessModes:
|
||
- ReadWriteOnce
|
||
persistentVolumeReclaimPolicy: Retain
|
||
nfs:
|
||
server: 192.168.1.100
|
||
path: /data/nfs
|
||
|
||
# 2. PersistentVolumeClaim (PVC)
|
||
apiVersion: v1
|
||
kind: PersistentVolumeClaim
|
||
metadata:
|
||
name: pvc-example
|
||
spec:
|
||
accessModes:
|
||
- ReadWriteOnce
|
||
resources:
|
||
requests:
|
||
storage: 5Gi
|
||
|
||
# 3. 使用 PVC
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: my-app
|
||
spec:
|
||
containers:
|
||
- name: app
|
||
image: my-app
|
||
volumeMounts:
|
||
- name: data-volume
|
||
mountPath: /data
|
||
volumes:
|
||
- name: data-volume
|
||
persistentVolumeClaim:
|
||
claimName: pvc-example
|
||
```
|
||
|
||
**StorageClass(动态存储分配)**:
|
||
```yaml
|
||
apiVersion: storage.k8s.io/v1
|
||
kind: StorageClass
|
||
metadata:
|
||
name: fast-ssd
|
||
provisioner: kubernetes.io/aws-ebs
|
||
parameters:
|
||
type: gp2
|
||
iopsPerGB: "10"
|
||
reclaimPolicy: Delete
|
||
volumeBindingMode: Immediate
|
||
```
|
||
|
||
---
|
||
|
||
### 8. Kubernetes 调度流程
|
||
|
||
#### **调度流程图**:
|
||
```
|
||
1. Pod 创建
|
||
↓
|
||
2. API Server 接收请求,写入 etcd
|
||
↓
|
||
3. Scheduler 监听到未调度的 Pod
|
||
↓
|
||
4. 预选(Predicate):过滤掉不符合条件的节点
|
||
- 资源是否足够(CPU、内存)
|
||
- 节点选择器(nodeSelector)
|
||
- 亲和性/反亲和性
|
||
- Taints 和 Tolerations
|
||
↓
|
||
5. 优选(Priority):给符合条件的节点打分
|
||
- 资源利用率
|
||
- 镜像本地缓存
|
||
- Pod 分散性
|
||
↓
|
||
6. 选择得分最高的节点
|
||
↓
|
||
7. 绑定(Binding):将 Pod 绑定到节点
|
||
↓
|
||
8. API Server 更新 Pod 状态
|
||
↓
|
||
9. kubelet 监听到 Pod 分配到自己,启动容器
|
||
```
|
||
|
||
#### **调度约束示例**:
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: my-app
|
||
spec:
|
||
# 1. nodeSelector(节点选择器)
|
||
nodeSelector:
|
||
disktype: ssd
|
||
|
||
# 2. Node Affinity(节点亲和性)
|
||
affinity:
|
||
nodeAffinity:
|
||
requiredDuringSchedulingIgnoredDuringExecution:
|
||
nodeSelectorTerms:
|
||
- matchExpressions:
|
||
- key: disktype
|
||
operator: In
|
||
values:
|
||
- ssd
|
||
preferredDuringSchedulingIgnoredDuringExecution:
|
||
- weight: 100
|
||
preference:
|
||
matchExpressions:
|
||
- key: zone
|
||
operator: In
|
||
values:
|
||
- cn-shanghai-a
|
||
|
||
# 3. Pod Affinity(Pod 亲和性)
|
||
affinity:
|
||
podAffinity:
|
||
requiredDuringSchedulingIgnoredDuringExecution:
|
||
- labelSelector:
|
||
matchExpressions:
|
||
- key: app
|
||
operator: In
|
||
values:
|
||
- nginx
|
||
topologyKey: kubernetes.io/hostname
|
||
|
||
# 4. Tolerations(容忍度)
|
||
tolerations:
|
||
- key: "dedicated"
|
||
operator: "Equal"
|
||
value: "gpu"
|
||
effect: "NoSchedule"
|
||
```
|
||
|
||
---
|
||
|
||
### 9. Ingress vs NodePort vs LoadBalancer
|
||
|
||
#### **对比表**:
|
||
|
||
| 类型 | 适用场景 | 优点 | 缺点 |
|
||
|------|----------|------|------|
|
||
| NodePort | 测试、开发 | 简单 | 端口管理复杂、性能一般 |
|
||
| LoadBalancer | 生产环境(云服务商) | 自动负载均衡 | 成本高、依赖云厂商 |
|
||
| Ingress | 生产环境(推荐) | 灵活、支持 7 层路由 | 配置复杂 |
|
||
|
||
#### **NodePort 示例**:
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: nginx-nodeport
|
||
spec:
|
||
type: NodePort
|
||
ports:
|
||
- port: 80
|
||
targetPort: 80
|
||
nodePort: 30080 # 30000-32767
|
||
selector:
|
||
app: nginx
|
||
```
|
||
|
||
#### **LoadBalancer 示例**:
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: nginx-lb
|
||
spec:
|
||
type: LoadBalancer
|
||
ports:
|
||
- port: 80
|
||
targetPort: 80
|
||
selector:
|
||
app: nginx
|
||
```
|
||
|
||
#### **Ingress 示例**:
|
||
```yaml
|
||
# 1. 安装 Ingress Controller(如 Nginx)
|
||
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml
|
||
|
||
# 2. 创建 Ingress
|
||
apiVersion: networking.k8s.io/v1
|
||
kind: Ingress
|
||
metadata:
|
||
name: nginx-ingress
|
||
annotations:
|
||
nginx.ingress.kubernetes.io/rewrite-target: /
|
||
spec:
|
||
rules:
|
||
- host: example.com # 域名
|
||
http:
|
||
paths:
|
||
- path: /
|
||
pathType: Prefix
|
||
backend:
|
||
service:
|
||
name: nginx-service
|
||
port:
|
||
number: 80
|
||
- host: api.example.com
|
||
http:
|
||
paths:
|
||
- path: /v1
|
||
pathType: Prefix
|
||
backend:
|
||
service:
|
||
name: api-service
|
||
port:
|
||
number: 8080
|
||
```
|
||
|
||
**Ingress 高级配置**:
|
||
```yaml
|
||
apiVersion: networking.k8s.io/v1
|
||
kind: Ingress
|
||
metadata:
|
||
name: nginx-ingress
|
||
annotations:
|
||
# TLS
|
||
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
||
# 限流
|
||
nginx.ingress.kubernetes.io/limit-rps: "10"
|
||
# 超时
|
||
nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
|
||
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
|
||
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
|
||
spec:
|
||
tls:
|
||
- hosts:
|
||
- example.com
|
||
secretName: example-tls
|
||
rules:
|
||
- host: example.com
|
||
http:
|
||
paths:
|
||
- path: /
|
||
pathType: Prefix
|
||
backend:
|
||
service:
|
||
name: nginx-service
|
||
port:
|
||
number: 80
|
||
```
|
||
|
||
---
|
||
|
||
### 10. 生产环境踩坑经验
|
||
|
||
#### **坑 1:Pod 无法启动(ImagePullBackOff)**
|
||
```bash
|
||
# 问题:镜像拉取失败
|
||
kubectl get pods
|
||
NAME READY STATUS RESTARTS AGE
|
||
nginx-pod 0/1 ImagePullBackOff 0 2m
|
||
|
||
# 排查
|
||
kubectl describe pod nginx-pod
|
||
# Events: Failed to pull image "nginx:latest": rpc error: code = Unknown
|
||
|
||
# 解决
|
||
# 1. 检查镜像名称和标签
|
||
# 2. 检查私有仓库凭证
|
||
kubectl create secret docker-registry my-registry-secret \
|
||
--docker-server=registry.example.com \
|
||
--docker-username=user \
|
||
--docker-password=password
|
||
|
||
# 3. 在 Pod 中引用 Secret
|
||
spec:
|
||
imagePullSecrets:
|
||
- name: my-registry-secret
|
||
```
|
||
|
||
#### **坑 2:CrashLoopBackOff**
|
||
```bash
|
||
# 问题:Pod 不断重启
|
||
kubectl get pods
|
||
NAME READY STATUS RESTARTS AGE
|
||
nginx-pod 0/1 CrashLoopBackOff 5 10m
|
||
|
||
# 排查
|
||
kubectl logs nginx-pod
|
||
# Error: Cannot connect to database
|
||
|
||
# 解决
|
||
# 1. 检查应用日志
|
||
# 2. 检查配置文件
|
||
# 3. 检查依赖服务(数据库、Redis)
|
||
kubectl describe pod nginx-pod
|
||
# 检查 Events
|
||
```
|
||
|
||
#### **坑 3:资源限制设置不当**
|
||
```yaml
|
||
# 问题:Pod 被杀(OOMKilled)
|
||
# 原因:内存限制太小
|
||
|
||
# 解决
|
||
resources:
|
||
requests:
|
||
memory: "256Mi" # 保证最小内存
|
||
cpu: "500m" # 保证最小 CPU
|
||
limits:
|
||
memory: "512Mi" # 最大内存
|
||
cpu: "1000m" # 最大 CPU
|
||
|
||
# 监控资源使用
|
||
kubectl top pods
|
||
kubectl top nodes
|
||
```
|
||
|
||
#### **坑 4:滚动更新失败**
|
||
```bash
|
||
# 问题:更新后,所有 Pod 都不可用
|
||
kubectl rollout status deployment/nginx-deployment
|
||
# Waiting for deployment "nginx-deployment" to progress
|
||
|
||
# 解决
|
||
# 1. 回滚到上一版本
|
||
kubectl rollout undo deployment/nginx-deployment
|
||
|
||
# 2. 查看历史版本
|
||
kubectl rollout history deployment/nginx-deployment
|
||
|
||
# 3. 设置健康检查
|
||
livenessProbe:
|
||
httpGet:
|
||
path: /health
|
||
port: 8080
|
||
initialDelaySeconds: 30
|
||
periodSeconds: 10
|
||
readinessProbe:
|
||
httpGet:
|
||
path: /ready
|
||
port: 8080
|
||
initialDelaySeconds: 5
|
||
periodSeconds: 5
|
||
```
|
||
|
||
#### **坑 5:DNS 解析失败**
|
||
```bash
|
||
# 问题:Pod 无法访问 Service
|
||
curl http://nginx-service.default.svc.cluster.local
|
||
# curl: (6) Could not resolve host
|
||
|
||
# 排查
|
||
kubectl exec -it my-app -- cat /etc/resolv.conf
|
||
# nameserver 10.96.0.10
|
||
|
||
# 解决
|
||
# 1. 检查 kube-dns/CoreDNS 是否运行
|
||
kubectl get pods -n kube-system
|
||
|
||
# 2. 检查 DNS 配置
|
||
kubectl get configmap coredns -n kube-system -o yaml
|
||
|
||
# 3. 重启 DNS
|
||
kubectl rollout restart deployment/coredns -n kube-system
|
||
```
|
||
|
||
---
|
||
|
||
### 11. 实际项目经验
|
||
|
||
#### **场景 1:高可用部署**
|
||
```yaml
|
||
# 需求:保证服务高可用
|
||
# 方案:
|
||
# 1. 多副本
|
||
replicas: 3
|
||
|
||
# 2. Pod 反亲和性(分散到不同节点)
|
||
affinity:
|
||
podAntiAffinity:
|
||
preferredDuringSchedulingIgnoredDuringExecution:
|
||
- weight: 100
|
||
podAffinityTerm:
|
||
labelSelector:
|
||
matchExpressions:
|
||
- key: app
|
||
operator: In
|
||
values:
|
||
- nginx
|
||
topologyKey: kubernetes.io/hostname
|
||
|
||
# 3. 健康检查
|
||
livenessProbe:
|
||
httpGet:
|
||
path: /health
|
||
port: 8080
|
||
initialDelaySeconds: 30
|
||
periodSeconds: 10
|
||
readinessProbe:
|
||
httpGet:
|
||
path: /ready
|
||
port: 8080
|
||
initialDelaySeconds: 5
|
||
periodSeconds: 5
|
||
|
||
# 4. 资源限制
|
||
resources:
|
||
requests:
|
||
memory: "256Mi"
|
||
cpu: "500m"
|
||
limits:
|
||
memory: "512Mi"
|
||
cpu: "1000m"
|
||
```
|
||
|
||
#### **场景 2:自动扩缩容(HPA)**
|
||
```yaml
|
||
# 1. 安装 Metrics Server
|
||
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
|
||
|
||
# 2. 创建 HPA
|
||
apiVersion: autoscaling/v2
|
||
kind: HorizontalPodAutoscaler
|
||
metadata:
|
||
name: nginx-hpa
|
||
spec:
|
||
scaleTargetRef:
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
name: nginx-deployment
|
||
minReplicas: 2
|
||
maxReplicas: 10
|
||
metrics:
|
||
- type: Resource
|
||
resource:
|
||
name: cpu
|
||
target:
|
||
type: Utilization
|
||
averageUtilization: 50 # CPU 使用率超过 50% 时扩容
|
||
- type: Resource
|
||
resource:
|
||
name: memory
|
||
target:
|
||
type: Utilization
|
||
averageUtilization: 80 # 内存使用率超过 80% 时扩容
|
||
```
|
||
|
||
#### **场景 3:配置管理**
|
||
```yaml
|
||
# 需求:不同环境使用不同配置
|
||
# 方案:使用 ConfigMap
|
||
|
||
# 开发环境
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata:
|
||
name: app-config-dev
|
||
namespace: dev
|
||
data:
|
||
spring.profiles.active: "dev"
|
||
spring.datasource.url: "jdbc:mysql://dev-mysql:3306/db"
|
||
|
||
# 生产环境
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata:
|
||
name: app-config-prod
|
||
namespace: prod
|
||
data:
|
||
spring.profiles.active: "prod"
|
||
spring.datasource.url: "jdbc:mysql://prod-mysql:3306/db"
|
||
|
||
# Pod 使用
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: my-app
|
||
namespace: prod
|
||
spec:
|
||
containers:
|
||
- name: app
|
||
image: my-app
|
||
envFrom:
|
||
- configMapRef:
|
||
name: app-config-prod
|
||
```
|
||
|
||
---
|
||
|
||
### 12. 阿里 P7 加分项
|
||
|
||
**架构设计能力**:
|
||
- 设计过大规模 Kubernetes 集群(1000+ 节点)
|
||
- 有多集群/多云 Kubernetes 管理经验
|
||
- 实现过自定义 Controller 和 Operator
|
||
|
||
**深度理解**:
|
||
- 熟悉 Kubernetes 源码(调度器、控制器、网络模型)
|
||
- 理解 Container Runtime(Docker、Containerd、CRI-O)
|
||
- 有 CNI 插件开发经验
|
||
|
||
**性能调优**:
|
||
- 优化过 etcd 性能(存储压缩、快照策略)
|
||
- 调整过 kubelet 参数(最大 Pod 数、镜像垃圾回收)
|
||
- 优化过网络性能(CNI 插件选择、MTU 配置)
|
||
|
||
**生产实践**:
|
||
- 主导过从 Docker Swarm 迁移到 Kubernetes
|
||
- 解决过生产环境的疑难问题(网络分区、etcd 数据恢复)
|
||
- 实现过 Kubernetes 多租户隔离
|
||
|
||
**云原生生态**:
|
||
- 熟悉 Helm Chart 开发和模板化部署
|
||
- 使用过 Prometheus + Grafana 监控 Kubernetes
|
||
- 实现过 Kubernetes CI/CD 流程(GitOps、ArgoCD)
|
||
|
||
**安全实践**:
|
||
- 实现 Pod Security Standards(Pod Security Policy)
|
||
- 有 RBAC 权限管理经验
|
||
- 使用过 Falco/Kyverno 实现安全策略
|
||
- 实现过镜像签名和验证(Notary)
|
||
|
||
**成本优化**:
|
||
- 使用 Cluster Autoscaler 自动扩缩节点
|
||
- 实现过 Pod 优先级和抢占机制
|
||
- 使用 Spot 实例降低成本
|
||
- 资源配额和限制范围管理
|