Tools

LLM Model Storage with NFS: Download Once, Infer Everywhere

2025-12-23 0 views admin

By Joe Keegan and Anish Singh Walia Your vLLM pods are probably downloading the same massive model file every time they start. If you’ve deployed LLM inference on Kubernetes, you may have taken the straightforward path: point vLLM at HuggingFace and let it download the model when the pod starts. It works. But here’s what happens next: There’s a better way: download the model once to shared storage, then let every pod load directly from that source. No redundant downloads. No external runtime dependencies. Fast access for any new pod you deploy. In this guide, you’ll deploy vLLM on DigitalOcean Kubernetes Service (DOKS) using Managed NFS for model storage. We’ll use a single H100 GPU node to keep things simple, but the pattern scales to as many nodes as you need and that’s the point. Once your model is on NFS, adding GPU capacity means instant model access, not another lengthy download. Let’s be specific about the costs of downloading models at pod startup. Model files are large. Mistral-7B-Instruct-v0.3, the model we’ll use in this guide, is approximately 15GB. Larger models like Llama 70B can exceed 140GB. Every time a pod starts and downloads from HuggingFace, that’s 15GB (or more) traversing the internet. Every pod restart means another download. Pod crashes happen. Node maintenance happens. Deployments happen. With the “download every time” approach, each of these events triggers a fresh download. If your inference pod crashes and restarts three times in a day, you’ve downloaded the same model files three times. Scaling becomes a bandwidth competition. When your Horizontal Pod Autoscaler adds replicas during a traffic spike, each new pod downloads the model simultaneously. Three new pods means three concurrent multi-gigabyte downloads, all competing for bandwidth. Instead of responding to increased demand, you’re waiting for downloads to complete. HuggingFace becomes a runtime dependency. This is the subtle one. During normal operations, HuggingFace availability doesn’t seem like a concern as it’s almost always up. But consider the 2 AM scenario: your GPU node fails, Kubernetes schedules a replacement pod, and HuggingFace is rate-limiting your IP or experiencing an outage. Your ability to recover from a failure now depends on an external service you don’t control. The principle here is simple: control your dependencies. External services like HuggingFace should be sources for initial acquisition, not runtime dependencies. When a pod needs to start, whether due to scaling, deployment, or failure recovery, it should pull from infrastructure you control. The pattern is straightforward: This approach eliminates the problems we just discussed: ReadWriteMany access: NFS supports concurrent read access from all pods. Whether you have one replica or ten, they all read from the same source simultaneously. Persistence: Model files survive pod restarts, node replacements, and cluster upgrades. Download the model once and it’s available until you explicitly remove it. In-region, no external dependency: Your NFS share is in the same DigitalOcean region as your cluster. Loading happens over the private network which is fast, reliable, and independent of external services. Managed infrastructure: DigitalOcean handles the NFS servers and availability. You don’t need to manage NFS infrastructure yourself. The scaling story is where this really shines. When you add a new GPU node tomorrow and deploy another vLLM replica, that pod has instant access to the model. No download step. The pod starts, mounts NFS, loads the model into GPU memory, and starts serving requests. Startup time is just load time, not download time plus load time. Contrast this with the per-pod download approach: adding a new replica means waiting many minutes for yet another download before that capacity is usable. Here’s what we’re building: The data flow has two distinct phases: One-time setup (happens once): Every pod startup (happens on each start): Compare this to the “download every time” flow, where every startup includes the HuggingFace download: Before starting this tutorial, you’ll need: DigitalOcean account with H100 GPU quota approved HuggingFace account with access to your chosen model kubectl installed locally Basic Kubernetes familiarity We need three DigitalOcean resources: a VPC, a DOKS cluster with a GPU node, and an NFS share. All must be in the same region. Not all DigitalOcean regions have both H100 GPUs and Managed NFS. At the time of writing, NYC2 and ATL1 support both. Choose one of these for your deployment. Check the DigitalOcean Control Panel for all the latest NFS regions at the DigitalOcean NFS Panel. The VPC provides private networking between your cluster and NFS share. For detailed instructions, see How to Create a VPC. The GPU node pool configuration page. Select the H100 droplet type for LLM inference workloads. For detailed instructions, see How to Create a Kubernetes Cluster. Once your cluster is running, configure kubectl access: Expected output (node names will vary): You should see at least two nodes: management nodes and one GPU node. For detailed instructions, see How to Connect to a Cluster. This is the key step as you’re creating the persistent storage that will hold your model files. The NFS creation page. Choose a descriptive name and select your VPC. Click Create NFS Share and wait for the status to become ACTIVE For detailed instructions, see How to Create an NFS Volume. Once active, note the Mount Source value from the NFS list. This contains the host IP and mount path you’ll need in the next step. The NFS list shows your share’s status and Mount Source. Note this value—it contains the host IP and path needed for Kubernetes configuration. The Mount Source has the format <HOST>:<PATH> (e.g., 10.100.32.2:/2633050/7d1686e4-9212-420f-a593-ab544993d99b). You’ll split this into two parts for the PersistentVolume configuration: Why this matters: This NFS share is now your persistent model library. Any model you download here is accessible to every pod in your cluster, today and in the future. There’s no need to re-download when pods restart or when you scale up. Now we’ll create the Kubernetes resources that let pods access your NFS share. First, create a dedicated namespace for your vLLM resources: The PersistentVolume (PV) tells Kubernetes how to connect to your NFS share. Using the Mount Source you noted in Step 2, replace <NFS_HOST> with the IP address and <NFS_MOUNT_PATH> with the path: A few things to note: The PersistentVolumeClaim (PVC) is what pods actually reference to access storage: Apply both resources: Verify the PVC is bound: The STATUS should show Bound. If it shows Pending, double-check that your PV configuration matches the PVC selector and that the NFS host/path are correct. Now Kubernetes knows how to access your NFS share. Any pod that mounts vllm-models-pvc gets access to the shared storage. This step is the key to the “download once” pattern. We’ll use a Kubernetes Job to download the model to NFS. A Job runs once and completes, it’s not a long-running deployment. First, create a secret containing your HuggingFace token. Replace <YOUR_HUGGINGFACE_TOKEN> with your actual token: Now create the job that downloads the model: Watch the job progress: You’ll see pip installing huggingface_hub, then progress as the model files download. The download takes approximately 5-10 minutes depending on network conditions. Wait for the job to complete: Verify the job succeeded: This is the crucial point: This download happens once. Every pod you deploy from now on, today, tomorrow, next month, will use these same model files. No more waiting for downloads on pod restarts. Now we deploy vLLM itself. The pod will mount NFS and load the model directly. No download step from HuggingFace required. Key configuration notes: A note on container images: This tutorial pulls the vLLM image from Docker Hub for simplicity. For production deployments, mirror the image to DigitalOcean Container Registry (DOCR) and reference it from there. This applies the same “control your dependencies” principle as Docker Hub becomes a one-time source rather than a runtime dependency. Check the pod status: Notice what’s not happening here: There’s no download step. vLLM starts, mounts NFS, and loads the model directly into GPU memory. If the pod had to download the model, you’d be waiting several additional minutes. Let’s verify everything works by sending requests to your vLLM deployment. Since we’re using a ClusterIP service, we’ll use port-forwarding to access it locally: Keep this running in a terminal. Open another terminal for the following commands. You have a working LLM inference endpoint backed by shared NFS storage. For production deployments, you’d expose this via Gateway API or a LoadBalancer service instead of port-forwarding. DOKS includes a pre-configured Cilium integration that makes setting up Gateway API straightforward. See How To Route HTTPS Traffic Using Gateway API and Cilium on DigitalOcean Kubernetes for a detailed walkthrough. For more advanced vLLM deployment strategies and model caching techniques, see vLLM Kubernetes: Model Loading & Caching Strategies. Keep in mind that this tutorial focuses specifically on model storage. A production LLM deployment involves many other considerations such as authentication, rate limiting, efficient request routing across replicas, observability, and more. All beyond the scope of this guide. This is where the NFS pattern pays off. Imagine you’ve been running a single vLLM replica and traffic is increasing. You need more capacity. First, add another GPU node to your cluster through the DigitalOcean Control Panel. Once it’s ready, scale your deployment: Watch the new pod come up: Notice the timing: The new pod went from Pending to Running in about 45 seconds. That’s the model load time, no several minutes download time. With the per-pod download approach, this would have taken much longer. The new pod would have downloaded the entire model from HuggingFace while your first pod handled all the traffic alone. The same principle applies when you add more GPU nodes tomorrow, next week, or next month. The model is already on NFS. New pods have instant access: No downloads. No waiting. No bandwidth competition. For deployments with multiple replicas, you’ll want to put a load balancer in front of vLLM. See the Gateway API tutorial for how to set this up with DOKS. To learn more about Kubernetes storage fundamentals, see How to Use Persistent Volumes in DigitalOcean Kubernetes. GPU nodes are expensive. When you’re done testing, clean up your resources to avoid unnecessary charges. Through the DigitalOcean Control Panel: Delete resources in this order to avoid dependency issues. NFS provides ReadWriteMany access, meaning multiple pods can read the same model files simultaneously. This is essential for horizontal scaling of LLM inference workloads. Block storage options like DigitalOcean Block Storage only support ReadWriteOnce, which limits you to one pod per volume. Object storage like DigitalOcean Spaces could work, but requires additional tooling to mount as a filesystem and typically has higher latency than NFS for model loading operations. Yes. This pattern works with any LLM inference framework that can load models from a filesystem path, including TensorRT-LLM, llama.cpp, and others. The key requirement is that your inference container can mount the NFS PersistentVolumeClaim and read model files from the mounted path. Simply adjust the model path in your deployment configuration to point to the NFS mount location. DigitalOcean Managed NFS shares can be resized at any time through the Control Panel or API. When you resize, the additional space becomes available immediately without downtime. For this tutorial, we started with 100GB, which is sufficient for Mistral-7B-Instruct-v0.3 (approximately 15GB). Larger models like Llama 70B (140GB+) will require more space. Plan your initial NFS size based on your model requirements, and scale up as needed. See the NFS resize documentation for details. For model loading operations, NFS performance is typically sufficient because loading happens once per pod startup, not during inference. The model files are read into GPU memory at startup, and subsequent inference operations use the in-memory model, not the NFS share. However, if you’re doing frequent model reloads or checkpointing during training, you may see better performance with local NVMe storage. For inference workloads where models load once and stay in memory, NFS provides the scalability benefits without noticeable performance impact. To update a model, you have a few options. First, you can download a new model version to a different directory on the same NFS share (e.g., /models/Mistral-7B-Instruct-v0.4), then update your vLLM deployment to point to the new path. This allows you to test the new model while keeping the old one available for rollback. Alternatively, you can delete the old model directory and download the new version to the same path. Since the download happens via a Kubernetes Job, you can run the download job again with updated parameters. The NFS share persists across pod restarts, so your model updates remain available. We’ve gone from “download the model every time a pod starts” to “download once, infer everywhere.” The principle behind this approach, control your dependencies, applies beyond just model storage. External services should be sources for initial acquisition, not runtime dependencies. When your infrastructure needs to respond to failures, traffic spikes, or routine deployments, it should rely only on components you control. Now that you’ve set up shared model storage with NFS, explore these related resources to build out your production LLM deployment: Ready to deploy your LLM inference workloads? Get started with DigitalOcean Managed Kubernetes and Managed NFS to build scalable, production-ready AI infrastructure. With features like automated backups, high availability, and seamless scaling, DigitalOcean provides the foundation you need for enterprise LLM deployments. Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases. Learn more about our products A Senior Solutions Architect at DigitalOcean focusing on Cloud Architecture, Kubernetes, Automation and Infrastructure-as-Code. I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix This textbox defaults to using Markdown to format your answer. You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link! Please complete your information! Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation. Full documentation for every DigitalOcean product. The Wave has everything you need to know about building a business, from raising funding to marketing your product. Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter. New accounts only. By submitting your email you agree to our Privacy Policy Scale up as you grow — whether you're running one virtual machine or ten thousand. Sign up and get $200 in credit for your first 60 days with DigitalOcean.* *This promotional offer applies to new accounts only.

🏷️ Tags

how-totutorialguidedigitaloceanaimlllmservernetworknetworkingroutingdockernodedatabasekubernetes

LLM Model Storage with NFS: Download Once, Infer Everywhere

🏷️ Tags

More from Tools

Tools: How to generate a PDF from HTML in Node.js (without Puppeteer)

Tools: How I Manage AI Coding Rules Across Claude Code, Cursor, and Codex With One CLI

Tools: Your Dev Tools Are Leaking Data. Here’s Why I Built Mine to Run Entirely in the Browser.

Tools: Vibe Coding is best for repid development but, most of programmer don't knows about .

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting