Tools: Latest: Docker+Kubernetes部署AI模型:从开发到生产的MLOps实战指南
为什么AI模型部署如此困难?
整体架构概览
第一步:容器化AI模型
编写FastAPI推理服务
gpu-node-label.yaml
为GPU节点打标签,用于Pod调度
k8s/deployment.yaml
第三步:监控与可观测性
Prometheus指标采集
.github/workflows/ml-deploy.yaml
成本优化策略 将一个训练好的AI模型从Jupyter Notebook推向生产环境,往往是整个机器学习生命周期中最具挑战性的环节。模型推理需要处理GPU资源调度、高并发请求、版本管理、灰度发布、监控告警等一系列运维问题。 本文将手把手带你完成从容器化AI模型到Kubernetes集群部署的全流程,构建一套生产级的MLOps部署方案。 首先,我们创建一个基于FastAPI的模型推理服务: apiVersion: v1
kind: Nodemetadata: name: gpu-node-01 labels: node-type: gpu gpu-type: nvidia-a100 nvidia.com/gpu.present: "true" apiVersion: apps/v1kind: Deploymentmetadata: name: ai-model-service namespace: ml-serving labels: app: ai-model-service version: v1
spec: replicas: 2 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # 滚动更新时最多多出1个Pod maxUnavailable: 0 # 更新时不允许有Pod不可用 selector: matchLabels: app: ai-model-service template: metadata: labels: app: ai-model-service version: v1 annotations: prometheus.io/scrape: "true" prometheus.io/port: "8000" prometheus.io/path: "/metrics" spec: # GPU资源限制 nodeSelector: node-type: gpu tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: model-server image: your-registry.com/ai-model-service:v1.0.0 imagePullPolicy: IfNotPresent ports: - containerPort: 8000 protocol: TCP env: - name: MODEL_PATH value: "/app/models" - name: MODEL_VERSION valueFrom: configMapKeyRef: name: model-config key: model-version - name: LOG_LEVEL value: "info" resources: requests: cpu: "2" memory: "4Gi" nvidia.com/gpu: "1" # 请求1块GPU limits: cpu: "4" memory: "8Gi" nvidia.com/gpu: "1" # 限制最多1块GPU # 就绪探针 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3 # 存活探针 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 30 failureThreshold: 5 # 启动探针(给模型加载更多时间) startupProbe: httpGet: path: /health port: 8000 failureThreshold: 30 periodSeconds: 10 在FastAPI应用中添加Prometheus指标: on: push: branches: [main] paths: - 'app/' - 'models/' - 'Dockerfile' jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: pip install -r requirements.txt - name: Run unit tests run: pytest tests/ -v --cov=app - name: Run model validation run: python scripts/validate_model.py build-and-push: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Build Docker image run: | docker build -t your-registry.com/ai-model-service:${{ github.sha }} . - name: Push to registry run: | echo ${{ secrets.REGISTRY_PASSWORD }} | docker login your-registry.com -u ${{ secrets.REGISTRY_USER }} --password-stdin docker push your-registry.com/ai-model-service:${{ github.sha }} deploy-staging: needs: build-and-push GPU资源昂贵,以下是几个关键的优化方向: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
GPU节点配置 首先确保Kubernetes集群已安装GPU插件: # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载(实际项目中替换为真实模型) model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功,版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理(替换为真实模型推理逻辑) prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)
第二步:Kubernetes部署
GPU节点配置 首先确保Kubernetes集群已安装GPU插件: # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载(实际项目中替换为真实模型) model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功,版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理(替换为真实模型推理逻辑) prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)
第二步:Kubernetes部署
GPU节点配置 首先确保Kubernetes集群已安装GPU插件:" style="background: linear-gradient(135deg, #9d4edd 0%, #8d3ecd 100%); color: #fff; border: none; padding: 6px 12px; border-radius: 6px; cursor: pointer; font-size: 12px; font-weight: 600; transition: all 0.3s ease; display: flex; align-items: center; gap: 6px; box-shadow: 0 2px 8px rgba(157, 77, 221, 0.3);">Copy
┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Ingress │→ │ Service │→ │ Model Deployment │ │
│ │ Controller│ │ (LB) │ │ (Pods + GPU) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ ↓ │
│ ┌──────────────┐ │
│ │ Prometheus │ │
│ │ + Grafana │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Ingress │→ │ Service │→ │ Model Deployment │ │
│ │ Controller│ │ (LB) │ │ (Pods + GPU) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ ↓ │
│ ┌──────────────┐ │
│ │ Prometheus │ │
│ │ + Grafana │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Ingress │→ │ Service │→ │ Model Deployment │ │
│ │ Controller│ │ (LB) │ │ (Pods + GPU) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ ↓ │
│ ┌──────────────┐ │
│ │ Prometheus │ │
│ │ + Grafana │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────┘
# app/main.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
import numpy as np
import logging
import time
from typing import List, Optional
import mlflow
import os # 配置日志
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs",
) # 全局模型实例
model = None
model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup")
async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载(实际项目中替换为真实模型) model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功,版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse)
async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理(替换为真实模型推理逻辑) prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch")
async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)
第二步:Kubernetes部署
GPU节点配置 首先确保Kubernetes集群已安装GPU插件: # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载(实际项目中替换为真实模型) model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功,版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理(替换为真实模型推理逻辑) prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)
第二步:Kubernetes部署
GPU节点配置 首先确保Kubernetes集群已安装GPU插件: # app/main.py from fastapi import FastAPI, HTTPException from fastapi.responses import JSONResponse from pydantic import BaseModel, Field import numpy as np import logging import time from typing import List, Optional import mlflow import os # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="AI Model Serving", version="1.0.0", docs_url="/docs", ) # 全局模型实例 model = None model_version = os.getenv("MODEL_VERSION", "latest") class PredictionRequest(BaseModel): """预测请求""" features: List[float] = Field(..., description="输入特征向量") model_version: Optional[str] = Field(None, description="指定模型版本") class PredictionResponse(BaseModel): """预测响应""" prediction: float confidence: float model_version: str latency_ms: float class BatchPredictionRequest(BaseModel): """批量预测请求""" batch_features: List[List[float]] = Field(..., description="批量输入特征") model_version: Optional[str] = Field(None, description="指定模型版本") class HealthResponse(BaseModel): """健康检查响应""" status: str model_loaded: bool model_version: str gpu_available: bool @app.on_event("startup") async def load_model(): """启动时加载模型""" global model try: # 从MLflow或本地路径加载模型 model_path = os.getenv("MODEL_PATH", "./models/model.pkl") logger.info(f"正在加载模型: {model_path}") # 模拟模型加载(实际项目中替换为真实模型) model = {"version": model_version, "loaded": True} logger.info(f"模型加载成功,版本: {model_version}") except Exception as e: logger.error(f"模型加载失败: {e}") raise @app.get("/health", response_model=HealthResponse) async def health_check(): """健康检查端点""" import torch gpu_available = torch.cuda.is_available() return HealthResponse( status="healthy" if model else "unhealthy", model_loaded=model is not None, model_version=model_version, gpu_available=gpu_available, ) @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): """单条预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() try: features = np.array(request.features).reshape(1, -1) # 模拟推理(替换为真实模型推理逻辑) prediction = float(np.random.random()) confidence = float(np.random.uniform(0.7, 0.99)) latency = (time.time() - start_time) * 1000 logger.info( f"预测完成 - 延迟: {latency:.2f}ms, " f"结果: {prediction:.4f}" ) return PredictionResponse( prediction=prediction, confidence=confidence, model_version=model_version, latency_ms=round(latency, 2), ) except Exception as e: logger.error(f"预测失败: {e}") raise HTTPException(status_code=500, detail=str(e)) @app.post("/predict/batch") async def batch_predict(request: BatchPredictionRequest): """批量预测""" if not model: raise HTTPException(status_code=503, detail="模型未加载") start_time = time.time() results = [] try: batch = np.array(request.batch_features)
第二步:Kubernetes部署
GPU节点配置 首先确保Kubernetes集群已安装GPU插件:
Deployment配置
$
Deployment配置
$
Deployment配置
第四步:CI/CD流水线 # app/metrics.py from prometheus_fastapi_instrumentator import Instrumentator from prometheus_client import Counter, Histogram, Gauge import time # 自定义指标 PREDICTION_COUNT = Counter( "model_predictions_total", "Total number of predictions", ["model_version", "status"] ) PREDICTION_LATENCY = Histogram( "model_prediction_latency_seconds", "Prediction latency in seconds", ["model_version"], buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] ) MODEL_INFERENCE_ACTIVE = Gauge( "model_inference_active", "Number of active inference requests" ) GPU_MEMORY_USAGE = Gauge( "gpu_memory_usage_bytes",
第四步:CI/CD流水线 python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def __init__(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求,等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))
总结 本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点: 1. **多阶段构建**的Dockerfile能有效减小镜像体积并利用层缓存 2. **GPU资源管理**是AI部署的核心挑战,需要合理配置requests和limits --- > 📢 **本文为精简版,完整版包含独家工具推荐和深度分析,请访问 [WD Tech Blog](https://wdsega.github.io) 查看!** *关注我的博客获取最新科技资讯、AI教程和效率工具推荐!* python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def __init__(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求,等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))
总结 本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点: 1. **多阶段构建**的Dockerfile能有效减小镜像体积并利用层缓存 2. **GPU资源管理**是AI部署的核心挑战,需要合理配置requests和limits --- > 📢 **本文为精简版,完整版包含独家工具推荐和深度分析,请访问 [WD Tech Blog](https://wdsega.github.io) 查看!** *关注我的博客获取最新科技资讯、AI教程和效率工具推荐!* python # 动态批处理示例 from collections import deque import asyncio import threading class DynamicBatcher: """动态批处理推理引擎""" def __init__(self, model, max_batch_size=32, max_wait_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms self.queue = deque() self.lock = threading.Lock() async def predict(self, features): """提交推理请求,等待批处理结果""" future = asyncio.get_event_loop().create_future() with self.lock: self.queue.append((features, future))
总结 本文覆盖了AI模型从容器化到Kubernetes生产部署的完整链路。关键要点: 1. **多阶段构建**的Dockerfile能有效减小镜像体积并利用层缓存 2. **GPU资源管理**是AI部署的核心挑战,需要合理配置requests和limits --- > 📢 **本文为精简版,完整版包含独家工具推荐和深度分析,请访问 [WD Tech Blog](https://wdsega.github.io) 查看!** *关注我的博客获取最新科技资讯、AI教程和效率工具推荐!* - 自动扩缩容:设置合理的HPA策略,低峰期自动缩减副本数 - GPU共享:使用NVIDIA MIG技术将A100切分为多个小GPU实例 - 模型量化:使用INT8/FP16量化减少显存占用,提升吞吐量 - 请求批处理:在服务端实现动态批处理,提高GPU利用率 - Spot实例:非关键推理任务使用云厂商的Spot实例降低成本" style="background: linear-gradient(135deg, #6a5acd 0%, #5a4abd 100%); color: #fff; border: none; padding: 6px 12px; border-radius: 8px; cursor: pointer; font-size: 12px; font-weight: 600; transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1); display: flex; align-items: center; gap: 8px; box-shadow: 0 4px 12px rgba(106, 90, 205, 0.4), inset 0 1px 0 rgba(255, 255, 255, 0.1); position: relative; overflow: hidden;">Copy
# app/metrics.py
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Counter, Histogram, Gauge
import time # 自定义指标
PREDICTION_COUNT = Counter( "model_predictions_total", "Total number of predictions", ["model_version", "-weight: 500;">status"]
)
PREDICTION_LATENCY = Histogram( "model_prediction_latency_seconds", "Prediction latency in seconds", ["model_version"], buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
MODEL_INFERENCE_ACTIVE = Gauge( "model_inference_active", "Number of active inference requests"
)
GPU_MEMORY_USAGE = Gauge( "gpu_memory_usage_bytes",