Tools: Serverless ML Inference with AWS Lambda + Docker

Tools: Serverless ML Inference with AWS Lambda + Docker

The Problem with "Always-On" ML Inference

The Stack

Project Structure

The Dockerfile

The Handler

Model Caching: The Most Important Optimization

Building and Pushing to ECR

CDK Definition

Handling Cold Starts

Updating the Model

The Numbers

When NOT to Use Lambda

What I'd Do Differently Running ML models in production sounds simple until you realize you're paying for servers 24/7 even when nobody is using them. That was my situation. I had a model running on EC2, serving predictions through Flask. It worked. It also quietly burned money every hour of the day. So I rebuilt the entire inference pipeline using AWS Lambda and reduced costs to almost zero during idle time. This post walks through exactly how I did it. When I first deployed a machine learning model, I followed the standard approach: For systems like AquaChain, inference is event-driven: Running a server continuously for this pattern is wasteful. Enter: Serverless ML Inference

With AWS Lambda: The key is using AWS's official Lambda base image. It includes the Lambda runtime interface client, so your container behaves exactly like a standard Lambda function. The requirements.txt for the ML stack: One important detail: put COPY requirements.txt and RUN pip install before copying your application code. Docker caches each layer — if your code changes but your dependencies don't, the pip install layer is reused and your build takes seconds instead of minutes. Lambda's /tmp directory persists across warm invocations of the same container instance. Loading a model from S3 on every request would add 200–500ms of latency and unnecessary S3 GET costs. Cache it in /tmp on first load: Two levels of caching here: On a cold start you pay the S3 download once. Every subsequent warm invocation skips it entirely. Then deploy the Lambda pointing at the ECR image: Memory sizing matters here. I started at 512MB and saw ~180ms inference times. Bumping to 1024MB dropped it to ~85ms — Lambda allocates CPU proportionally to memory, so more memory = faster CPU = faster inference. Run a few tests at different memory sizes; the cost difference is often negligible compared to the latency improvement. Cold starts for container-based Lambdas are longer than zip-based ones — typically 2–5 seconds for a 500MB image. For AquaChain this is acceptable because inference is triggered asynchronously (the data processing Lambda doesn't wait for the result). But if you need synchronous inference with strict latency SLAs, two options: 1. Provisioned Concurrency — keeps N container instances warm at all times. Eliminates cold starts, but you pay for idle time. Only worth it if your p99 latency requirement is under 500ms and you have consistent traffic. 2. Scheduled warm-up ping — an EventBridge rule that invokes the function every 5 minutes with a dummy payload. Cheap, effective for low-traffic functions, but not a guarantee. For most ML inference use cases, async invocation + accepting occasional cold starts is the right trade-off. One of the best things about this setup: updating the model doesn't require a code deployment. You just upload a new model.joblib to S3 with the same key. The next cold start picks it up automatically. For versioned rollouts, use S3 versioning and point the Lambda env var at a specific version ID: Running in production on AquaChain: Compare that to a t3.small EC2 instance running 24/7: ~$15/month regardless of traffic. At our current inference volume, Lambda costs under $1/month. Serverless ML is not a silver bullet. In those cases, a dedicated endpoint (SageMaker / ECS / EC2) is a better fit. 1. Use multi-stage Docker builds. The current image includes build tools that aren't needed at runtime. A multi-stage build copies only the installed packages into the final image, reducing image size by 30–40% and speeding up cold starts. 2. Pin the base image digest, not just the tag. python:3.11 tags can change. Use the SHA256 digest for reproducible builds in production. 3. Add model validation on load. Before caching the model, run a quick sanity check — predict on a known input and assert the output is in the expected range. Catches corrupted model files before they serve bad predictions. Serverless ML inference isn’t for every system.But for event-driven workloads — like AquaChain — it hits a rare sweet spot: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ ml_inference/ ├── handler.py # Lambda entry point ├── model_loader.py # S3 model caching logic ├── feature_extractor.py ├── Dockerfile └── requirements.txt ml_inference/ ├── handler.py # Lambda entry point ├── model_loader.py # S3 model caching logic ├── feature_extractor.py ├── Dockerfile └── requirements.txt ml_inference/ ├── handler.py # Lambda entry point ├── model_loader.py # S3 model caching logic ├── feature_extractor.py ├── Dockerfile └── requirements.txt FROM public.ecr.aws/lambda/python:3.11 # Copy requirements first for layer caching COPY requirements.txt . RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir -r requirements.txt # Copy function code COPY handler.py model_loader.py feature_extractor.py ./ # Lambda handler entrypoint CMD ["handler.lambda_handler"] FROM public.ecr.aws/lambda/python:3.11 # Copy requirements first for layer caching COPY requirements.txt . RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir -r requirements.txt # Copy function code COPY handler.py model_loader.py feature_extractor.py ./ # Lambda handler entrypoint CMD ["handler.lambda_handler"] FROM public.ecr.aws/lambda/python:3.11 # Copy requirements first for layer caching COPY requirements.txt . RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir -r requirements.txt # Copy function code COPY handler.py model_loader.py feature_extractor.py ./ # Lambda handler entrypoint CMD ["handler.lambda_handler"] scikit-learn==1.4.0 xgboost==2.0.3 numpy==1.26.3 pandas==2.1.4 boto3==1.34.34 joblib==1.3.2 scikit-learn==1.4.0 xgboost==2.0.3 numpy==1.26.3 pandas==2.1.4 boto3==1.34.34 joblib==1.3.2 scikit-learn==1.4.0 xgboost==2.0.3 numpy==1.26.3 pandas==2.1.4 boto3==1.34.34 joblib==1.3.2 import json import logging from model_loader import get_model from feature_extractor import extract_features logger = logging.getLogger() logger.setLevel(logging.INFO) def lambda_handler(event, context): try: readings = event.get('readings', {}) device_id = event.get('deviceId', 'unknown') # Validate inputs before touching the model required = ['pH', 'turbidity', 'tds', 'temperature'] missing = [f for f in required if f not in readings] if missing: return { 'statusCode': 400, 'body': json.dumps({ 'error': f"Missing fields: {missing}", 'code': 'VALIDATION_ERROR' }) } # Extract features (includes trend calculations) features = extract_features(readings) # Get model — cached in /tmp after first load model = get_model() # Run inference wqi = float(model.predict([features])[0]) confidence = float(model.predict_proba([features]).max()) quality = classify_wqi(wqi) logger.info(f"Inference complete", extra={ 'deviceId': device_id, 'wqi': wqi, 'quality': quality, 'confidence': confidence }) return { 'statusCode': 200, 'body': json.dumps({ 'wqi': round(wqi, 2), 'quality': quality, 'confidence': round(confidence, 4), 'deviceId': device_id }) } except Exception as e: logger.error(f"Inference error: {e}", exc_info=True) return { 'statusCode': 500, 'body': json.dumps({ 'error': 'Inference failed', 'code': 'INFERENCE_ERROR' }) } def classify_wqi(wqi: float) -> str: if wqi >= 90: return 'Excellent' if wqi >= 70: return 'Good' if wqi >= 50: return 'Fair' if wqi >= 25: return 'Poor' return 'Very Poor' import json import logging from model_loader import get_model from feature_extractor import extract_features logger = logging.getLogger() logger.setLevel(logging.INFO) def lambda_handler(event, context): try: readings = event.get('readings', {}) device_id = event.get('deviceId', 'unknown') # Validate inputs before touching the model required = ['pH', 'turbidity', 'tds', 'temperature'] missing = [f for f in required if f not in readings] if missing: return { 'statusCode': 400, 'body': json.dumps({ 'error': f"Missing fields: {missing}", 'code': 'VALIDATION_ERROR' }) } # Extract features (includes trend calculations) features = extract_features(readings) # Get model — cached in /tmp after first load model = get_model() # Run inference wqi = float(model.predict([features])[0]) confidence = float(model.predict_proba([features]).max()) quality = classify_wqi(wqi) logger.info(f"Inference complete", extra={ 'deviceId': device_id, 'wqi': wqi, 'quality': quality, 'confidence': confidence }) return { 'statusCode': 200, 'body': json.dumps({ 'wqi': round(wqi, 2), 'quality': quality, 'confidence': round(confidence, 4), 'deviceId': device_id }) } except Exception as e: logger.error(f"Inference error: {e}", exc_info=True) return { 'statusCode': 500, 'body': json.dumps({ 'error': 'Inference failed', 'code': 'INFERENCE_ERROR' }) } def classify_wqi(wqi: float) -> str: if wqi >= 90: return 'Excellent' if wqi >= 70: return 'Good' if wqi >= 50: return 'Fair' if wqi >= 25: return 'Poor' return 'Very Poor' import json import logging from model_loader import get_model from feature_extractor import extract_features logger = logging.getLogger() logger.setLevel(logging.INFO) def lambda_handler(event, context): try: readings = event.get('readings', {}) device_id = event.get('deviceId', 'unknown') # Validate inputs before touching the model required = ['pH', 'turbidity', 'tds', 'temperature'] missing = [f for f in required if f not in readings] if missing: return { 'statusCode': 400, 'body': json.dumps({ 'error': f"Missing fields: {missing}", 'code': 'VALIDATION_ERROR' }) } # Extract features (includes trend calculations) features = extract_features(readings) # Get model — cached in /tmp after first load model = get_model() # Run inference wqi = float(model.predict([features])[0]) confidence = float(model.predict_proba([features]).max()) quality = classify_wqi(wqi) logger.info(f"Inference complete", extra={ 'deviceId': device_id, 'wqi': wqi, 'quality': quality, 'confidence': confidence }) return { 'statusCode': 200, 'body': json.dumps({ 'wqi': round(wqi, 2), 'quality': quality, 'confidence': round(confidence, 4), 'deviceId': device_id }) } except Exception as e: logger.error(f"Inference error: {e}", exc_info=True) return { 'statusCode': 500, 'body': json.dumps({ 'error': 'Inference failed', 'code': 'INFERENCE_ERROR' }) } def classify_wqi(wqi: float) -> str: if wqi >= 90: return 'Excellent' if wqi >= 70: return 'Good' if wqi >= 50: return 'Fair' if wqi >= 25: return 'Poor' return 'Very Poor' import os import boto3 import joblib import logging logger = logging.getLogger() MODEL_S3_BUCKET = os.environ['MODEL_BUCKET'] MODEL_S3_KEY = os.environ['MODEL_KEY'] LOCAL_MODEL_PATH = '/tmp/model.joblib' _model_cache = None # Module-level cache — survives across warm invocations def get_model(): global _model_cache if _model_cache is not None: logger.debug("Using in-memory model cache") return _model_cache # Check /tmp first (warm container, model already downloaded) if os.path.exists(LOCAL_MODEL_PATH): logger.info("Loading model from /tmp cache") _model_cache = joblib.load(LOCAL_MODEL_PATH) return _model_cache # Cold -weight: 500;">start — download from S3 logger.info(f"Downloading model from s3://{MODEL_S3_BUCKET}/{MODEL_S3_KEY}") s3 = boto3.client('s3') s3.download_file(MODEL_S3_BUCKET, MODEL_S3_KEY, LOCAL_MODEL_PATH) _model_cache = joblib.load(LOCAL_MODEL_PATH) logger.info("Model loaded and cached") return _model_cache import os import boto3 import joblib import logging logger = logging.getLogger() MODEL_S3_BUCKET = os.environ['MODEL_BUCKET'] MODEL_S3_KEY = os.environ['MODEL_KEY'] LOCAL_MODEL_PATH = '/tmp/model.joblib' _model_cache = None # Module-level cache — survives across warm invocations def get_model(): global _model_cache if _model_cache is not None: logger.debug("Using in-memory model cache") return _model_cache # Check /tmp first (warm container, model already downloaded) if os.path.exists(LOCAL_MODEL_PATH): logger.info("Loading model from /tmp cache") _model_cache = joblib.load(LOCAL_MODEL_PATH) return _model_cache # Cold -weight: 500;">start — download from S3 logger.info(f"Downloading model from s3://{MODEL_S3_BUCKET}/{MODEL_S3_KEY}") s3 = boto3.client('s3') s3.download_file(MODEL_S3_BUCKET, MODEL_S3_KEY, LOCAL_MODEL_PATH) _model_cache = joblib.load(LOCAL_MODEL_PATH) logger.info("Model loaded and cached") return _model_cache import os import boto3 import joblib import logging logger = logging.getLogger() MODEL_S3_BUCKET = os.environ['MODEL_BUCKET'] MODEL_S3_KEY = os.environ['MODEL_KEY'] LOCAL_MODEL_PATH = '/tmp/model.joblib' _model_cache = None # Module-level cache — survives across warm invocations def get_model(): global _model_cache if _model_cache is not None: logger.debug("Using in-memory model cache") return _model_cache # Check /tmp first (warm container, model already downloaded) if os.path.exists(LOCAL_MODEL_PATH): logger.info("Loading model from /tmp cache") _model_cache = joblib.load(LOCAL_MODEL_PATH) return _model_cache # Cold -weight: 500;">start — download from S3 logger.info(f"Downloading model from s3://{MODEL_S3_BUCKET}/{MODEL_S3_KEY}") s3 = boto3.client('s3') s3.download_file(MODEL_S3_BUCKET, MODEL_S3_KEY, LOCAL_MODEL_PATH) _model_cache = joblib.load(LOCAL_MODEL_PATH) logger.info("Model loaded and cached") return _model_cache # Authenticate Docker with ECR aws ecr get-login-password --region ap-south-1 | \ -weight: 500;">docker login --username AWS --password-stdin \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com # Build the image -weight: 500;">docker build -t aquachain-ml-inference . # Tag for ECR -weight: 500;">docker tag aquachain-ml-inference:latest \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest # Push -weight: 500;">docker push \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest # Authenticate Docker with ECR aws ecr get-login-password --region ap-south-1 | \ -weight: 500;">docker login --username AWS --password-stdin \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com # Build the image -weight: 500;">docker build -t aquachain-ml-inference . # Tag for ECR -weight: 500;">docker tag aquachain-ml-inference:latest \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest # Push -weight: 500;">docker push \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest # Authenticate Docker with ECR aws ecr get-login-password --region ap-south-1 | \ -weight: 500;">docker login --username AWS --password-stdin \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com # Build the image -weight: 500;">docker build -t aquachain-ml-inference . # Tag for ECR -weight: 500;">docker tag aquachain-ml-inference:latest \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest # Push -weight: 500;">docker push \ 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest aws lambda -weight: 500;">update-function-code \ --function-name aquachain-function-ml-inference-dev \ --image-uri 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest \ --region ap-south-1 aws lambda -weight: 500;">update-function-code \ --function-name aquachain-function-ml-inference-dev \ --image-uri 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest \ --region ap-south-1 aws lambda -weight: 500;">update-function-code \ --function-name aquachain-function-ml-inference-dev \ --image-uri 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest \ --region ap-south-1 from aws_cdk import ( aws_lambda as lambda_, aws_ecr as ecr, aws_iam as iam, Duration ) # Reference existing ECR repo repo = ecr.Repository.from_repository_name( self, "MLInferenceRepo", "aquachain-ml-inference" ) ml_inference_fn = lambda_.DockerImageFunction( self, "MLInferenceFunction", function_name="aquachain-function-ml-inference-dev", code=lambda_.DockerImageCode.from_ecr( repo, tag_or_digest="latest" ), memory_size=1024, # ML models benefit from more memory timeout=Duration.seconds(30), environment={ "MODEL_BUCKET": "aquachain-models-dev", "MODEL_KEY": "wqi/model_v2.joblib", "LOG_LEVEL": "INFO" } ) # Grant S3 read access for model download model_bucket.grant_read(ml_inference_fn) from aws_cdk import ( aws_lambda as lambda_, aws_ecr as ecr, aws_iam as iam, Duration ) # Reference existing ECR repo repo = ecr.Repository.from_repository_name( self, "MLInferenceRepo", "aquachain-ml-inference" ) ml_inference_fn = lambda_.DockerImageFunction( self, "MLInferenceFunction", function_name="aquachain-function-ml-inference-dev", code=lambda_.DockerImageCode.from_ecr( repo, tag_or_digest="latest" ), memory_size=1024, # ML models benefit from more memory timeout=Duration.seconds(30), environment={ "MODEL_BUCKET": "aquachain-models-dev", "MODEL_KEY": "wqi/model_v2.joblib", "LOG_LEVEL": "INFO" } ) # Grant S3 read access for model download model_bucket.grant_read(ml_inference_fn) from aws_cdk import ( aws_lambda as lambda_, aws_ecr as ecr, aws_iam as iam, Duration ) # Reference existing ECR repo repo = ecr.Repository.from_repository_name( self, "MLInferenceRepo", "aquachain-ml-inference" ) ml_inference_fn = lambda_.DockerImageFunction( self, "MLInferenceFunction", function_name="aquachain-function-ml-inference-dev", code=lambda_.DockerImageCode.from_ecr( repo, tag_or_digest="latest" ), memory_size=1024, # ML models benefit from more memory timeout=Duration.seconds(30), environment={ "MODEL_BUCKET": "aquachain-models-dev", "MODEL_KEY": "wqi/model_v2.joblib", "LOG_LEVEL": "INFO" } ) # Grant S3 read access for model download model_bucket.grant_read(ml_inference_fn) # Upload new model version aws s3 cp model_v3.joblib s3://aquachain-models-dev/wqi/model_v2.joblib # If you need to roll back, just -weight: 500;">update the env var to point at the previous version aws lambda -weight: 500;">update-function-configuration \ --function-name aquachain-function-ml-inference-dev \ --environment "Variables={MODEL_KEY=wqi/model_v1.joblib,MODEL_BUCKET=aquachain-models-dev}" \ --region ap-south-1 # Upload new model version aws s3 cp model_v3.joblib s3://aquachain-models-dev/wqi/model_v2.joblib # If you need to roll back, just -weight: 500;">update the env var to point at the previous version aws lambda -weight: 500;">update-function-configuration \ --function-name aquachain-function-ml-inference-dev \ --environment "Variables={MODEL_KEY=wqi/model_v1.joblib,MODEL_BUCKET=aquachain-models-dev}" \ --region ap-south-1 # Upload new model version aws s3 cp model_v3.joblib s3://aquachain-models-dev/wqi/model_v2.joblib # If you need to roll back, just -weight: 500;">update the env var to point at the previous version aws lambda -weight: 500;">update-function-configuration \ --function-name aquachain-function-ml-inference-dev \ --environment "Variables={MODEL_KEY=wqi/model_v1.joblib,MODEL_BUCKET=aquachain-models-dev}" \ --region ap-south-1 - EC2 instance - Load model at startup - Serve predictions over HTTP - Paying for compute 24/7 - Even at 3AM when traffic = 0 - Bursts of requests from devices - Long idle periods - You pay only when your model runs - No idle infrastructure - Fully event-driven execution - scikit-learn 1.4.0 - XGBoost 2.0.3 - numpy 1.26.3 + pandas 2.1.4 - Python 3.11 - AWS Lambda (container image) - Amazon ECR (container registry) - S3 (model artifact storage) - _model_cache — in-memory, fastest possible, survives as long as the container is warm - /tmp/model.joblib — survives container reuse even if the Python process restarts - You need ultra-low latency (<50ms) - You have constant high traffic - Your model is extremely large (>5GB and slow to load) - low cost, zero idle infrastructure, and production-grade performance. - If your model doesn’t need to run 24/7, your infrastructure shouldn’t either.