Tools
Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA
2025-12-15
0 views
admin
Phase 1: The FastAPI Inference Service ## Phase 2: Docker and Kubernetes Deployment ## Phase 3: Implementing Horizontal Pod Autoscaler (HPA) The path to scalable ML deployment requires high-performance APIs and robust orchestration. This post walks through setting up a local, highly available, and auto-scaling inference service using FastAPI for speed and Kind for Kubernetes orchestration. Our Python service handles ONNX model inference. The critical component for K8s stability is the /health endpoint: After building the image (clothing-classifier:latest) and loading it into Kind, we define the Deployment. Note the crucial resource constraints and probes. Scalability is handled by the HPA, which requires the Metrics Server to be running. Result: Under load, the HPA dynamically adjusts replica count. This is the definition of elastic, cost-effective MLOps. Read the full guide here. If you're deploying any Python API, adopting this pattern for resource management and scaling will save you major headaches down the road. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK:
Python # app.py snippet
# ... model loading logic ... @app.get("/health")
def health_check(): # K8s Probes will hit this endpoint frequently return {"status": "ok", "model_loaded": True} # ... /predict endpoint ... Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
Python # app.py snippet
# ... model loading logic ... @app.get("/health")
def health_check(): # K8s Probes will hit this endpoint frequently return {"status": "ok", "model_loaded": True} # ... /predict endpoint ... COMMAND_BLOCK:
Python # app.py snippet
# ... model loading logic ... @app.get("/health")
def health_check(): # K8s Probes will hit this endpoint frequently return {"status": "ok", "model_loaded": True} # ... /predict endpoint ... COMMAND_BLOCK:
YAML # deployment.yaml (Snippet focusing on probes and resources) resources: requests: cpu: "250m" # For scheduling memory: "500Mi" limits: cpu: "500m" # To prevent monopolizing the node memory: "1Gi" livenessProbe: httpGet: {path: /health, port: 8000} initialDelaySeconds: 5 readinessProbe: httpGet: {path: /health, port: 8000} initialDelaySeconds: 5 # Gives time for the ONNX model to load Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
YAML # deployment.yaml (Snippet focusing on probes and resources) resources: requests: cpu: "250m" # For scheduling memory: "500Mi" limits: cpu: "500m" # To prevent monopolizing the node memory: "1Gi" livenessProbe: httpGet: {path: /health, port: 8000} initialDelaySeconds: 5 readinessProbe: httpGet: {path: /health, port: 8000} initialDelaySeconds: 5 # Gives time for the ONNX model to load COMMAND_BLOCK:
YAML # deployment.yaml (Snippet focusing on probes and resources) resources: requests: cpu: "250m" # For scheduling memory: "500Mi" limits: cpu: "500m" # To prevent monopolizing the node memory: "1Gi" livenessProbe: httpGet: {path: /health, port: 8000} initialDelaySeconds: 5 readinessProbe: httpGet: {path: /health, port: 8000} initialDelaySeconds: 5 # Gives time for the ONNX model to load COMMAND_BLOCK:
YAML # hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: name: clothing-classifier-hpa
spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: clothing-classifier-deployment minReplicas: 2 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 # Scale up if CPU exceeds 50% Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
YAML # hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: name: clothing-classifier-hpa
spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: clothing-classifier-deployment minReplicas: 2 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 # Scale up if CPU exceeds 50% COMMAND_BLOCK:
YAML # hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: name: clothing-classifier-hpa
spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: clothing-classifier-deployment minReplicas: 2 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 # Scale up if CPU exceeds 50%
how-totutorialguidedev.toaimlserverdockernodepythonkubernetesk8s