Tools

Tools: The Math Behind VM Right-Sizing (Stop guessing your Azure SKU)

2026-02-13 0 views admin

Tools: The Math Behind VM Right-Sizing (Stop guessing your Azure SKU)

Source: Dev.to

Problem Statement: The Cost of Static Sizing ## Core Metrics Required (CPU is not enough) ## 1. CPU Utilization ## 2. Memory Utilization ## 3. Disk IOPS and Throughput ## 4. Network Throughput ## Sizing Decision Logic ## Example Scenarios ## Data Engineering Considerations ## Conclusion We have all done this at some point. You are deploying a new application, and the manager asks, "What size VM do we need?" You don't want to be the person who crashed the production server because of low RAM. So, what do you do? You take the estimated requirement and multiply it by 2 or 4. "Just to be safe." If the load test hit 60% CPU on 4 vCPUs, we request 8 vCPUs. The VM goes live, runs at 12% utilization, and we never look at it again. This "Safety-margin culture" is the single biggest reason for cloud waste. I am currently building CloudSavvy.io to automate this problem, but today I want to share the core engineering logic and the math you need to implement right-sizing yourself without breaking production. Most organizations size VMs at deployment time and never revisit the decision. This is a structural issue. Consider a D8s_v5 (8 vCPU, 32 GiB) in East US. A D4s_v5 (4 vCPU, 16 GiB) costs ~$140/month. It would handle that load with plenty of buffer. If you have 200 VMs like this, the annual waste reaches six figures. The problem is not that engineers over-provision deliberately. The problem is that right-sizing requires continuous, metrics-driven evaluation—and most teams lack the instrumentation to do it systematically. Many scripts just look at "Average CPU" and suggest a downsize. This is dangerous. You need to analyze four resource dimensions over a 30-day window. Raw average is insufficient. You need three statistical views: This is the most neglected metric. A VM can run at 10% CPU while using 85% of available memory (common for databases and caching workloads). Formula: memory_utilization_pct = ((total_memory - available_memory) / total_memory) * 100 If average memory utilization exceeds 80% sustained, the VM is a candidate for Upsizing or a family change (e.g., to E-series), regardless of CPU. If you ignore this, you risk Out Of Memory (OOM) crashes. Disk performance constrains VM sizing independently of CPU. Azure VM SKUs have hard ceilings. If your workload sustains 5,800 IOPS and you downsize to a D2s because "CPU is low," you will hit I/O throttling and the application will lag. Always compare P95 IOPS against the target SKU limit. Similar to disk, network bandwidth is SKU-dependent. If sustained network throughput exceeds 60% of the target SKU's ceiling, block the downsize. Network-bound workloads (like API gateways) often have low CPU but cannot tolerate bandwidth reduction. You cannot rely on simple thresholds. You need a decision framework. Here is the logic flow: Step 1: Coverage Gate If cpu_hours < 648 (90% of 720 hours/30 days), BLOCK. Do not guess with insufficient data. Step 2: Classification Step 3: Action Determination IF cpu_sustained_low AND memory_low: IF cpu_sustained_low AND memory_high: IF cpu_high AND memory_low: IF CPU variability is high (stddev/mean > 0.6): Scenario A: The Memory-Bound Database Scenario B: The GPU Mistake If you are implementing this, keep in mind: Right-sizing is not just about cost minimization—it is cost-to-performance optimization. The goal is to eliminate waste without introducing performance risk. A one-time audit is not enough because workloads change. If you automate this logic effectively, you can maintain performance while significantly reducing your Azure bill. If you are looking for a tool that automates this entire decision framework, do check out CloudSavvy.io. Let me know in the comments if you have faced issues with IOPS throttling after resizing! Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Cost: ~$280/month. - Actual Usage: 11% CPU, 22% Memory. - Average: If it is below 20% sustained for 30 days, it is a downsizing candidate. - P95 (95th Percentile): This captures the realistic peak. If P95 is below 50%, you are definitely over-provisioned. - Peak (P99/Max): If Peak is high (90%+) but P95 is low, the workload is "bursty." Do not switch to a smaller fixed SKU; consider a B-series (Burstable) instead. - Standard_D4s_v5: Max 6,400 IOPS. - Standard_D2s_v5: Max 3,200 IOPS. - cpu_sustained_low = (cpu_p95 < 20%) AND (cpu_avg < 15%) - memory_low = (memory_p95 < 40%) - memory_high = (memory_p95 >= 75%) - IF cpu_sustained_low AND memory_low: Action: DOWNSIZE within same family. Example: D8s_v5 → D4s_v5. - Action: DOWNSIZE within same family. - Example: D8s_v5 → D4s_v5. - IF cpu_sustained_low AND memory_high: Action: SWITCH FAMILY to memory-optimized (E-series). Example: D8s_v5 → E4s_v5 (fewer vCPUs, same memory). - Action: SWITCH FAMILY to memory-optimized (E-series). - Example: D8s_v5 → E4s_v5 (fewer vCPUs, same memory). - IF cpu_high AND memory_low: Action: SWITCH FAMILY to compute-optimized (F-series). Example: D8s_v5 → F8s_v2 (same vCPU, less memory, higher clock speed). - Action: SWITCH FAMILY to compute-optimized (F-series). - Example: D8s_v5 → F8s_v2 (same vCPU, less memory, higher clock speed). - IF CPU variability is high (stddev/mean > 0.6): Action: RECOMMEND BURSTABLE (B-series). Example: D4s_v5 → B4ms. - Action: RECOMMEND BURSTABLE (B-series). - Example: D4s_v5 → B4ms. - Action: DOWNSIZE within same family. - Example: D8s_v5 → D4s_v5. - Action: SWITCH FAMILY to memory-optimized (E-series). - Example: D8s_v5 → E4s_v5 (fewer vCPUs, same memory). - Action: SWITCH FAMILY to compute-optimized (F-series). - Example: D8s_v5 → F8s_v2 (same vCPU, less memory, higher clock speed). - Action: RECOMMEND BURSTABLE (B-series). - Example: D4s_v5 → B4ms. - IOPS Safety: IF target_sku_max_iops < current_disk_iops_p95 * 1.2 → BLOCK. - Production Tag: IF resource is tagged "Production" → apply 30% stricter headroom margins. - Compliance: IF tagged "PCI-DSS" or "HIPAA" → BLOCK automated resize. - Current: Standard_D8s_v5 (8 vCPU, 32 GiB) — USD 280/month - Metrics: CPU avg 12%, P95 25% | Memory avg 78%, P95 89% - Analysis: CPU is underutilized, but memory is near capacity. Downsizing D-series reduces RAM, risking OOM. - Recommendation: Switch to Standard_E4s_v5 (4 vCPU, 32 GiB). - Savings: USD 85/month. Memory preserved, CPU reduced to match actual utilization. - Current: Standard_NC24s_v3 (24 vCPU, 4x V100 GPUs) — USD 9,204/month - Metrics: GPU utilization avg 22% (single GPU active). - Analysis: Only 1 of 4 GPUs is active. The workload is a single-model inference service that does not parallelize. - Recommendation: Downsize to Standard_NC6s_v3 (6 vCPU, 1x V100). - Savings: USD 6,903/month. - Telemetry: Use Azure Monitor Metrics API (Microsoft.Compute/virtualMachines), not Resource Graph. Resource Graph provides metadata, not performance history. - Sampling Window: 30 days is the mandatory minimum to capture monthly batch jobs. 7 days is too risky. - Missing Data: Missing metric hours are not zero-utilization hours. If the agent was down, do not interpolate. Block the recommendation. - ROI Check: Calculate the exact monthly cost delta. If savings < USD 5/month, skip it. It's not worth the engineering effort.

🏷️ Tags

how-totutorialguidedev.toaiservernetworkswitchdatabase