Tools

Tools: Latest: Docker+Kubernetes部署AI模型：从开发到生产的MLOps实战指南

2026-05-28 0 views admin

为什么AI模型部署如此困难？

整体架构概览

第一步：容器化AI模型

编写FastAPI推理服务

gpu-node-label.yaml

为GPU节点打标签，用于Pod调度

k8s/deployment.yaml

第三步：监控与可观测性

Prometheus指标采集

.github/workflows/ml-deploy.yaml

成本优化策略将一个训练好的AI模型从Jupyter Notebook推向生产环境，往往是整个机器学习生命周期中最具挑战性的环节。模型推理需要处理GPU资源调度、高并发请求、版本管理、灰度发布、监控告警等一系列运维问题。本文将手把手带你完成从容器化AI模型到Kubernetes集群部署的全流程，构建一套生产级的MLOps部署方案。首先，我们创建一个基于FastAPI的模型推理服务： apiVersion: v1

kind: Nodemetadata: name: gpu-node-01 labels: node-type: gpu gpu-type: nvidia-a100 nvidia.com/gpu.present: "true" apiVersion: apps/v1kind: Deploymentmetadata: name: ai-model-service namespace: ml-serving labels: app: ai-model-service version: v1

spec: replicas: 2 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # 滚动更新时最多多出1个Pod maxUnavailable: 0 # 更新时不允许有Pod不可用 selector: matchLabels: app: ai-model-service template: metadata: labels: app: ai-model-service version: v1 annotations: prometheus.io/scrape: "true" prometheus.io/port: "8000" prometheus.io/path: "/metrics" spec: # GPU资源限制 nodeSelector: node-type: gpu tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: model-server image: your-registry.com/ai-model-service:v1.0.0 imagePullPolicy: IfNotPresent ports: - containerPort: 8000 protocol: TCP env: - name: MODEL_PATH value: "/app/models" - name: MODEL_VERSION valueFrom: configMapKeyRef: name: model-config key: model-version - name: LOG_LEVEL value: "info" resources: requests: cpu: "2" memory: "4Gi" nvidia.com/gpu: "1" # 请求1块GPU limits: cpu: "4" memory: "8Gi" nvidia.com/gpu: "1" # 限制最多1块GPU # 就绪探针 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3 # 存活探针 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 30 failureThreshold: 5 # 启动探针（给模型加载更多时间） startupProbe: httpGet: path: /health port: 8000 failureThreshold: 30 periodSeconds: 10 在FastAPI应用中添加Prometheus指标： on: push: branches: [main] paths: - 'app/' - 'models/' - 'Dockerfile' jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: pip install -r requirements.txt - name: Run unit tests run: pytest tests/ -v --cov=app - name: Run model validation run: python scripts/validate_model.py build-and-push: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Build Docker image run: | docker build -t your-registry.com/ai-model-service:${{ github.sha }} . - name: Push to registry run: | echo ${{ secrets.REGISTRY_PASSWORD }} | docker login your-registry.com -u ${{ secrets.REGISTRY_USER }} --password-stdin docker push your-registry.com/ai-model-service:${{ github.sha }} deploy-staging: needs: build-and-push GPU资源昂贵，以下是几个关键的优化方向： Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

第二步：Kubernetes部署
GPU节点配置首先确保Kubernetes集群已安装GPU插件： # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载（实际项目中替换为真实模型） model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功，版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理（替换为真实模型推理逻辑） prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)
第二步：Kubernetes部署
GPU节点配置首先确保Kubernetes集群已安装GPU插件： # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载（实际项目中替换为真实模型） model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功，版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理（替换为真实模型推理逻辑） prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)
第二步：Kubernetes部署
GPU节点配置首先确保Kubernetes集群已安装GPU插件：" style="background: linear-gradient(135deg, #9d4edd 0%, #8d3ecd 100%); color: #fff; border: none; padding: 6px 12px; border-radius: 6px; cursor: pointer; font-size: 12px; font-weight: 600; transition: all 0.3s ease; display: flex; align-items: center; gap: 6px; box-shadow: 0 2px 8px rgba(157, 77, 221, 0.3);">
Copy

┌─────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Ingress │→ │ Service │→ │ Model Deployment │ │ │ │ Controller│ │ (LB) │ │ (Pods + GPU) │ │ │ └──────────┘ └──────────┘ └──────────────────┘ │ │ ↓ │ │ ┌──────────────┐ │ │ │ Prometheus │ │ │ │ + Grafana │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Ingress │→ │ Service │→ │ Model Deployment │ │ │ │ Controller│ │ (LB) │ │ (Pods + GPU) │ │ │ └──────────┘ └──────────┘ └──────────────────┘ │ │ ↓ │ │ ┌──────────────┐ │ │ │ Prometheus │ │ │ │ + Grafana │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Ingress │→ │ Service │→ │ Model Deployment │ │ │ │ Controller│ │ (LB) │ │ (Pods + GPU) │ │ │ └──────────┘ └──────────┘ └──────────────────┘ │ │ ↓ │ │ ┌──────────────┐ │ │ │ Prometheus │ │ │ │ + Grafana │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────┘ # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载（实际项目中替换为真实模型） model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功，版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理（替换为真实模型推理逻辑） prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)

`第二步：Kubernetes部署`

GPU节点配置首先确保Kubernetes集群已安装GPU插件： # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载（实际项目中替换为真实模型） model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功，版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理（替换为真实模型推理逻辑） prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)

`第二步：Kubernetes部署`

GPU节点配置首先确保Kubernetes集群已安装GPU插件： # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载（实际项目中替换为真实模型） model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功，版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理（替换为真实模型推理逻辑） prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)

`第二步：Kubernetes部署`

`GPU节点配置首先确保Kubernetes集群已安装GPU插件：`

Deployment配置

Command

Copy

$

Deployment配置

Command

Copy

$

Deployment配置

Command

第四步：CI/CD流水线 # app/metrics.py from prometheus_fastapi_instrumentator import Instrumentator from prometheus_client import Counter, Histogram, Gauge import time # 自定义指标 PREDICTION_COUNT = Counter( "model_predictions_total", "Total number of predictions", ["model_version", "status"] ) PREDICTION_LATENCY = Histogram( "model_prediction_latency_seconds", "Prediction latency in seconds", ["model_version"], buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] ) MODEL_INFERENCE_ACTIVE = Gauge( "model_inference_active", "Number of active inference requests" ) GPU_MEMORY_USAGE = Gauge( "gpu_memory_usage_bytes",
第四步：CI/CD流水线 # app/metrics.py from prometheus_fastapi_instrumentator import Instrumentator from prometheus_client import Counter, Histogram, Gauge import time # 自定义指标 PREDICTION_COUNT = Counter( "model_predictions_total", "Total number of predictions", ["model_version", "status"] ) PREDICTION_LATENCY = Histogram( "model_prediction_latency_seconds", "Prediction latency in seconds", ["model_version"], buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] ) MODEL_INFERENCE_ACTIVE = Gauge( "model_inference_active", "Number of active inference requests" ) GPU_MEMORY_USAGE = Gauge( "gpu_memory_usage_bytes",
第四步：CI/CD流水线 python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def __init__(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求，等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))
总结本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点： 1. **多阶段构建**的Dockerfile能有效减小镜像体积并利用层缓存 2. **GPU资源管理**是AI部署的核心挑战，需要合理配置requests和limits --- > 📢 **本文为精简版，完整版包含独家工具推荐和深度分析，请访问 [WD Tech Blog](https://wdsega.github.io) 查看！** *关注我的博客获取最新科技资讯、AI教程和效率工具推荐！* python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def __init__(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求，等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))
总结本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点： 1. **多阶段构建**的Dockerfile能有效减小镜像体积并利用层缓存 2. **GPU资源管理**是AI部署的核心挑战，需要合理配置requests和limits --- > 📢 **本文为精简版，完整版包含独家工具推荐和深度分析，请访问 [WD Tech Blog](https://wdsega.github.io) 查看！** *关注我的博客获取最新科技资讯、AI教程和效率工具推荐！* python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def __init__(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求，等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))
总结本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点： 1. **多阶段构建**的Dockerfile能有效减小镜像体积并利用层缓存 2. **GPU资源管理**是AI部署的核心挑战，需要合理配置requests和limits --- > 📢 **本文为精简版，完整版包含独家工具推荐和深度分析，请访问 [WD Tech Blog](https://wdsega.github.io) 查看！** *关注我的博客获取最新科技资讯、AI教程和效率工具推荐！* - 自动扩缩容：设置合理的HPA策略，低峰期自动缩减副本数 - GPU共享：使用NVIDIA MIG技术将A100切分为多个小GPU实例 - 模型量化：使用INT8/FP16量化减少显存占用，提升吞吐量 - 请求批处理：在服务端实现动态批处理，提高GPU利用率 - Spot实例：非关键推理任务使用云厂商的Spot实例降低成本" style="background: linear-gradient(135deg, #6a5acd 0%, #5a4abd 100%); color: #fff; border: none; padding: 6px 12px; border-radius: 8px; cursor: pointer; font-size: 12px; font-weight: 600; transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1); display: flex; align-items: center; gap: 8px; box-shadow: 0 4px 12px rgba(106, 90, 205, 0.4), inset 0 1px 0 rgba(255, 255, 255, 0.1); position: relative; overflow: hidden;">
Copy

# app/metrics.py from prometheus_fastapi_instrumentator import Instrumentator from prometheus_client import Counter, Histogram, Gauge import time # 自定义指标 PREDICTION_COUNT = Counter( "model_predictions_total", "Total number of predictions", ["model_version", "-weight: 500;">status"] ) PREDICTION_LATENCY = Histogram( "model_prediction_latency_seconds", "Prediction latency in seconds", ["model_version"], buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] ) MODEL_INFERENCE_ACTIVE = Gauge( "model_inference_active", "Number of active inference requests" ) GPU_MEMORY_USAGE = Gauge( "gpu_memory_usage_bytes",

第四步：CI/CD流水线 # app/metrics.py from prometheus_fastapi_instrumentator import Instrumentator from prometheus_client import Counter, Histogram, Gauge import time # 自定义指标 PREDICTION_COUNT = Counter( "model_predictions_total", "Total number of predictions", ["model_version", "-weight: 500;">status"] ) PREDICTION_LATENCY = Histogram( "model_prediction_latency_seconds", "Prediction latency in seconds", ["model_version"], buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] ) MODEL_INFERENCE_ACTIVE = Gauge( "model_inference_active", "Number of active inference requests" ) GPU_MEMORY_USAGE = Gauge( "gpu_memory_usage_bytes",

第四步：CI/CD流水线 python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def init(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求，等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))

总结本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点： 1. 多阶段构建的Dockerfile能有效减小镜像体积并利用层缓存 2. GPU资源管理是AI部署的核心挑战，需要合理配置requests和limits --- > 📢 本文为精简版，完整版包含独家工具推荐和深度分析，请访问 [WD Tech Blog](https://wdsega.github.io) 查看！关注我的博客获取最新科技资讯、AI教程和效率工具推荐！ python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def init(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求，等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))

总结本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点： 1. 多阶段构建的Dockerfile能有效减小镜像体积并利用层缓存 2. GPU资源管理是AI部署的核心挑战，需要合理配置requests和limits --- > 📢 本文为精简版，完整版包含独家工具推荐和深度分析，请访问 [WD Tech Blog](https://wdsega.github.io) 查看！关注我的博客获取最新科技资讯、AI教程和效率工具推荐！ - 自动扩缩容：设置合理的HPA策略，低峰期自动缩减副本数 - GPU共享：使用NVIDIA MIG技术将A100切分为多个小GPU实例 - 模型量化：使用INT8/FP16量化减少显存占用，提升吞吐量 - 请求批处理：在服务端实现动态批处理，提高GPU利用率 - Spot实例：非关键推理任务使用云厂商的Spot实例降低成本

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolslatestdockerkubernetesmlopsfastapi

More from Tools

Tools: Stop Flying Blind in Production: Add a Live Observability Dashboard to Your ASP.NET Core App in 5 Minutes (2026)

2026-05-28 0

Tools: Multi-Architecture Docker Builds for Node.js: From Apple Silicon to AWS Graviton - Expert Insights

2026-05-28 0

Tools: Ultimate Guide: Deskbrid: A Linux Desktop HAL Built Entirely by AI Agents

2026-05-28 0

Tools: Amazon Bedrock AgentCore Payments: The Spending Limit Is the Product

2026-05-27 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Latest: Docker+Kubernetes部署AI模型：从开发到生产的MLOps实战指南

为什么AI模型部署如此困难？

整体架构概览

第一步：容器化AI模型

编写FastAPI推理服务

gpu-node-label.yaml

为GPU节点打标签，用于Pod调度

k8s/deployment.yaml

第三步：监控与可观测性

Prometheus指标采集

.github/workflows/ml-deploy.yaml

第二步：Kubernetes部署

第二步：Kubernetes部署

第二步：Kubernetes部署

第二步：Kubernetes部署

第二步：Kubernetes部署

GPU节点配置 首先确保Kubernetes集群已安装GPU插件：

Deployment配置

Deployment配置

Deployment配置

🏷️ Tags

More from Tools

Tools: Stop Flying Blind in Production: Add a Live Observability Dashboard to Your ASP.NET Core App in 5 Minutes (2026)

Tools: Multi-Architecture Docker Builds for Node.js: From Apple Silicon to AWS Graviton - Expert Insights

Tools: Ultimate Guide: Deskbrid: A Linux Desktop HAL Built Entirely by AI Agents

Tools: Amazon Bedrock AgentCore Payments: The Spending Limit Is the Product

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

`第二步：Kubernetes部署`

`第二步：Kubernetes部署`

`第二步：Kubernetes部署`

`GPU节点配置首先确保Kubernetes集群已安装GPU插件：`