SQL Server Performance Tuning

How We Found the VM Cap Costing Our Client $30K in Wasted SQL Server Spend

Updated
9 min read
Written by
Saulius Baskevicius
Reviewed by
Mark Varnas
TL;DR

An Azure VM instance type caps total disk throughput across all attached disks, so fast provisioned disks get silently throttled until you move to a VM with a higher throughput ceiling.

  • The VM-level throughput cap overrides per-disk IOPS settings, and although Azure exposes disk throughput metrics and limits, nothing raises an explicit alert when your provisioned disks exceed the VM cap.
  • Switching to an E5_BDS instance gave 4x the disk throughput for $400 to $500 more per month and beat the original $30,000/month RAID servers.
  • Disk latency dropped from 300 milliseconds to around 37 milliseconds after the instance change.
  • To check your own VMs, total the provisioned throughput across every attached disk and compare it to the instance type’s max uncached disk throughput.

The Starting Point: $30,000/Month in SQL Server Azure Cost With Zero Optimization

This client was running SQL Server on Azure with hosting costs of around $30,000 per month.

The infrastructure was built to handle peak workloads with no optimization applied afterward. The team specced out what they thought the environment needed, built the servers, and let them run for years.

The storage configuration was impressive. Custom RAID arrays built on the fastest disks available in Azure. From a database perspective, the setup was well-engineered. IOPS were strong. The RAID configuration increased overall throughput by 2 to 4x compared to standalone disks.

But the environment was over-provisioned. The servers had far more capacity than the workloads required. And nobody had reviewed the configuration since it was originally deployed.

Our goal was to cut SQL Server Azure cost by 50 to 70%. That meant moving off the expensive RAID arrays, right-sizing the instances, and optimizing the queries running on them.

PCI DSS compliance requirements meant every change needed approval chains that stretched weeks. This project started in August and was still moving through approvals months later. Azure SQL cost optimization in regulated industries takes patience. The ROI is there, but the compliance process does not move at the speed of technology.

The First Attempt: Standard Disks. It Broke.

The first move was replacing the custom RAID arrays with standard Azure disks. On paper, this was straightforward. The RAID arrays were expensive. Standard disks with high IOPS provisioning should handle the workload at a fraction of the cost.

It did not work.

The standard disks could not deliver results in time. Database responsiveness dropped. The system was hammering the disk subsystem and struggling to keep up. End users were waiting for query results that previously returned in acceptable timeframes.

The RAID arrays had been masking inefficiencies in the workload. Poorly optimized queries that generated excessive IO were running fine because the storage was fast enough to brute-force through them. Remove that raw storage power, and those same queries choked the system.

Storage configuration details like disk sector alignment also compound these issues — a misconfigured disk adds latency before you even touch VM-level caps.

The Second Attempt: Upgraded Disks. Hit a Different Wall.

We upgraded to a faster disk type. Performance improved, but not enough. Queries were still slower than expected. The disk IOPS should have been sufficient based on the provisioned configuration.

This is where we found the problem that most teams miss.

Azure VM instance types have a maximum aggregate disk throughput. This is a hard cap on the total bandwidth available across all attached disks combined.

Here is what that means in practice.

You configure a single disk with 5 gigabits per second of throughput. Looks good in the Azure portal. But your VM instance type has a total throughput cap of 1.5 gigabits per second across all disks. And you have other disks attached that are already consuming part of that bandwidth.

Your disk will never reach the promised performance. The VM-level cap silently throttles everything.

This is not surfaced prominently in the Azure portal. You configure each disk individually. The settings look correct. The provisioned IOPS look right. But the VM instance type is the bottleneck, and nothing in the standard monitoring dashboards flags it.

You can provision 4 gigabits on one disk, but if the VM throughput cap is 4 gigabits total and your other disks are already consuming bandwidth, that single disk will never get what you paid for.

The Fix: Change the VM Instance Type

We changed the VM to an E5_BDS instance type. This provided four times the disk throughput capacity compared to the previous instance type.

The cost increase was $400 to $500 per month.

The resulting Azure SQL database performance exceeded the original expensive RAID-based servers.

Read that again.

$400-500 per month more on the VM instance type, and the new environment outperformed the setup that cost tens of thousands per month in SQL Server Azure pricing.

Disk latency, which had been spiking to 300 milliseconds on the worst-performing disks, dropped to around 37 milliseconds. Queue length, previously growing to 250, stabilized. CPU utilization showed significant headroom on the new instances.

One of the two original servers was decommissioned. The client is now running two new servers with one old server kept as a backup during the transition period. The final cutover is expected within weeks.

Performance Tuning on the Smaller Footprint

With the infrastructure stabilized, we moved into the next phase: query optimization on the right-sized environment.

The results from index optimization were immediate and significant.

On one query, a single index reduced IO reads from approximately 8.5 million to 319,000. That index took about 20 minutes to create.

On another query, a single index dropped IO reads from 38 million to 5,500. This fix also required changing a WHERE condition in the application code. Average query duration dropped instantly after implementation.

These are not marginal improvements. IO reads dropping from 38 million to 5,500 is a 99.98% reduction. The Azure SQL database performance impact was immediate — the database is doing almost no unnecessary work on that query path.

And this is where Azure SQL cost optimization compounds. The right-sized infrastructure costs less. The optimized queries need fewer resources. Which means the infrastructure can be right-sized again. Each cycle delivers additional savings on a smaller base cost.

We’ve seen similar compounding effects in other cases — for example, fixing a single data type issue saved $50K+ in hardware costs by eliminating unnecessary IO at the query level.

We estimate the next round of downsizing will save this client an additional $30,000 per year once the current performance tuning work is fully implemented.

Why This Pattern Repeats Across Azure SQL Environments

This is not an isolated case. The pattern shows up in almost every SQL Server Azure cost project we run.

  • Step one: the environment is over-provisioned because it was built for peak capacity and never reviewed.
  • Step two: someone tries to downsize by swapping to cheaper storage or a smaller instance. Performance breaks because the optimization was never done to make the workload efficient enough for a smaller infrastructure.
  • Step three: the team assumes the workload needs the expensive infrastructure and gives up.

The missing step is always the same.

Tune the workload first, or at least understand the infrastructure constraints before making changes. The Azure VM throughput limit is one example of a constraint that is invisible until you hit it. Disk latency, memory pressure, and CPU throttling all have similar gotchas that don’t surface in basic monitoring. In another engagement, a licensing and architecture review saved a client $117K/year on Azure SQL — a different root cause, the same pattern of unchecked infrastructure spend.

Workload validation before and after changes is critical. We use distributed replay and workload testing on Azure Hyperscale migrations to catch exactly these kinds of performance regressions before they reach production.

How to Check Your Own Environment

If you’re running SQL Server on Azure VMs, this is a 15-minute check worth doing today.

  1. Look up your VM instance type in the Azure VM sizes documentation. Find the “Max uncached disk throughput” and “Max uncached disk IOPS” columns.
  2. Add up the provisioned throughput across every attached disk — OS, data, log, tempdb.
  3. If the total provisioned throughput exceeds the VM-level cap, your disks are being throttled. You’re paying for IOPS you can’t use.
  4. Check whether a different VM instance type with higher throughput would cost less than the over-provisioned disk IOPS you’re currently running.

If you’d like us to run this check on your environment, our SQL performance tuning services include a full infrastructure and query review — we identify exactly where your Azure SQL spend is going and what to cut first.

Frequently Asked Questions

How do I know if my Azure SQL environment is over-provisioned?

Check CPU utilization, memory usage, and disk IOPS against provisioned capacity over a 30-day window. If average utilization is below 30-40% across all metrics, you are likely over-provisioned. A proper health check will identify where the waste is.

Can I downsize Azure SQL without performance testing first?

You can, but we don’t recommend it. Workloads that look lightweight in monitoring dashboards often have peak windows or batch processes that need the extra capacity. Test changes in a non-production environment first, or capture a workload baseline and validate Azure SQL database performance after the change.

What’s the difference between disk-level IOPS and VM-level throughput limits?

Disk-level IOPS is the performance you provision on each individual disk. VM-level throughput is the total bandwidth the instance type allows across all attached disks combined. The VM limit overrides the disk-level settings. You can provision more disk IOPS than the VM can deliver, and Azure will not warn you.

How long does a SQL Server Azure cost optimization project take in a regulated industry?

For regulated industries with PCI DSS requirements, plan for 3 to 6 months. The technical work (analysis, tuning, migration) takes weeks. The compliance approvals take months. Start early and build the approval timeline into your project plan.

Should I optimize queries before or after right-sizing the infrastructure?

Both approaches work, but the most efficient path is: right-size first to a level the workload can tolerate, then tune queries on the smaller footprint, then right-size again. Each cycle delivers compounding savings. The first right-sizing removes obvious waste. The query tuning makes the workload efficient. The second right-sizing captures the efficiency gains.

What is a realistic SQL Server Azure cost reduction target?

We typically see 50 to 70% reductions in over-provisioned environments. The savings come from three areas: instance right-sizing, storage optimization, and licensing changes (Enterprise to Standard edition where enterprise features are not being used). Query tuning adds additional savings by reducing the resources the workload needs. For more on this, see our breakdown of SQL Server Enterprise vs Standard edition savings — switching editions is often the fastest win on Azure SQL cost.

Speak with a SQL Expert

In just 30 minutes, we will show you how we can eliminate your SQL Server headaches and provide 
operational peace of mind

Article by
Saulius Baskevicius
Hey, I’m Saulius, part of the team behind Red9. SQL Server is my thing. Complex challenges - my passion.

Discover More

SQL Server Health Check SQL Server Migrations & Upgrades SQL Server Performance Tuning SQL Server Security SQL Server Tips

Discover what clients are saying about Red9

Red9 has incredible expertise both in SQL migration and performance tuning.

The biggest benefit has been performance gains and tuning associated with migrating to AWS and a newer version of SQL Server with Always On clustering. Red9 was integral to this process. The deep knowledge of MSSQL and combined experience of Red9 have been a huge asset during a difficult migration. Red9 found inefficient indexes and performance bottlenecks that improved latency by over 400%.

Rich Staats 5 stars
Rich Staats
Cloud Engineer
MetalToad

Always willing to go an extra mile

Working with Red9 DBAs has been a pleasure. They are great team players and have an expert knowledge of SQL Server database administration. And are always willing to go the extra mile to get the project done.
5 stars
Evelyn A.
Sr. Database Administrator

Boosts server health and efficiency for enhanced customer satisfaction

Since adding Red9 to the reporting and DataWarehousing team, Red9 has done a good job coming up to speed on our environments and helping ensure we continue to meet our customer's needs. Red9 has taken ownership of our servers ensuring they remain healthy by monitoring and tuning inefficient queries.
5 stars
Andrew F.
Datawarehousing Manager
See more testimonials