VM Snapshots Are Not SQL Server Backups: A Recovery Story

TL;DR

A VM snapshot is not a SQL Server backup unless it’s application-consistent. One client found out when storage corruption hit the master database and their only “backups” were nightly snapshots that didn’t coordinate with SQL Server.

A crash-consistent snapshot loses the dirty pages SQL Server hadn’t flushed yet, like pulling the power cord. An application-consistent one uses VSS to freeze I/O and flush buffers first, so it restores reliably.
Native BACKUP DATABASE and BACKUP LOG are the only methods SQL Server’s recovery engine fully trusts. An untested backup is just a file, so restore it to a spare server and run DBCC CHECKDB against it.
Back up master, msdb, and model too, and script your logins (SIDs and password hashes) to a file off the server. That turns a master loss into a 5-minute rebuild instead of database surgery.

What Happened

A new client called because their SQL Server cluster was down. Their words: “We rebooted both servers but now nothing works.”

The event logs showed DNS registration errors, but that was a symptom. The real problem was underneath: shared storage corruption. It had hit both the master database and a critical user database.

If your master database is corrupted, SQL Server won’t start. There’s no workaround, no degraded mode, no “start anyway and we’ll fix it later.” Master holds the server configuration, all database metadata, and every login account. Without it, you have nothing.

So the first question: where are the backups?

“Oh yeah, we do VM snapshots every night.”

Why VM Snapshots Aren’t SQL Server Backups

VM snapshots can work for SQL Server disaster recovery. But only if they’re configured for application consistency.

Here’s the problem. SQL Server has data files (.mdf/.ndf) and transaction log files (.ldf). At any given moment, these files are in different states — committed transactions in the data files, in-flight transactions in the log, dirty pages in memory that haven’t been flushed to disk yet. SQL Server manages the relationship between these files to guarantee transactional consistency.

A crash-consistent VM snapshot captures the disk state at a point in time, but it doesn’t coordinate with SQL Server. The data files and transaction logs may reflect different in-flight states, and dirty pages in memory that were not flushed at the time of the snapshot are lost. The result is the equivalent of pulling the power cord — SQL Server might recover from it through crash recovery, or it might not.

An application-consistent snapshot uses VSS (Volume Shadow Copy Service) to tell SQL Server “freeze I/O, flush your buffers, we’re taking a snapshot.”SQL Server cooperates, the snapshot captures an application-consistent state, and restoration is significantly more reliable.

This client’s snapshots weren’t application-consistent. Some were partially usable — likely captured during low-activity periods — but none provided the same restoration guarantees as native SQL Server backups.

As one colleague put it: “backups exist” is not the same as “backups work.” VM snapshots without application consistency are hope-based disaster recovery.

Recovery

The client wanted to “upgrade” to SQL Server 2022 during the recovery. During active disaster recovery is not the time to test version compatibility. We matched what they had — SQL Server 2014 — to eliminate variables.

Step 1: Build a new server. Fresh SQL Server 2014 installation with a clean master database.

Step 2: Attempt file-level recovery. We stopped SQL Server, swapped in database files salvaged from the VM snapshots, and attempted to start the instance. SQL Server refused — the master database files were too damaged.

Step 3: Restore master as a user database. This is an unusual technique, but it works. Instead of trying to use the corrupted master as the system database (which SQL Server won’t allow if it can’t pass recovery), we attached it as a regular user database on the new instance. This let us run DBCC CHECKDB against it to see exactly what was damaged, and — more importantly — query the system tables directly.

Step 4: Extract logins and credentials. The corrupted master database still had the system catalog tables containing all user logins and their password hashes. We queried sys.sql_logins and sys.server_principals from the attached database, extracted the SIDs and password hashes, and built scripts to recreate hundreds of user accounts on the new server with their original credentials intact.

Step 5: Restore user databases. Using the salvageable VM snapshot files, we brought the user databases online. Some required repair operations (with potential data loss).

Step 6: Reconnect everything. With logins restored and databases online, applications could connect again.

Everything came back. All users reconnected. But this recovery took far longer and carried far more risk than it should have.

Backup Anti-Patterns We See Constantly

This client’s situation was extreme, but the individual mistakes are common. Here are the backup anti-patterns we run into regularly:

“We have VM snapshots.” MSPs especially tend to treat VM snapshots as gospel. For file servers and application servers, crash-consistent snapshots are often fine. For database servers, they’re unreliable without VSS integration. SQL Server has its own backup mechanism for a reason — it’s the product of decades of lessons learned from failed improvisation.
Backups on the same storage as the database. We’ve seen backup files stored on the same disk subsystem as the database. If the storage fails, you lose both. One client had backup retention configured for 7 days, but the backups grew until they couldn’t even keep one — all on the same disk that eventually failed. Backups need to be on separate storage, ideally offsite or in a separate fault domain.
No system database backups. “Those are just small databases that don’t change much.” Until they do. The master database holds every login, every server configuration, every linked server definition. The msdb database holds every SQL Agent job, every backup history record, every maintenance plan. Losing master without a backup is how you end up doing the kind of surgery we did for this client.
No backup validation. An untested backup is just a file. You don’t know if it’s restorable until you try. RESTORE VERIFYONLY is a start (it checks the backup media is readable and the structure is valid), but the only real test is restoring to a separate server and running DBCC CHECKDB against the result.
No transaction log backups. Full backups every night with no log backups in between means your recovery point objective (RPO) is up to 24 hours of data loss. For many environments, that’s unacceptable. Log backups every 15–60 minutes close that gap to minutes instead of hours.
“The tape guy handles it.” One case involved an accounting firm during tax season. The “backup plan” was a secretary swapping tapes at the front desk every night with no success verification. When the database needed restoration, the most recent recoverable backup was several days old. During tax season.

What a Proper SQL Server Backup Strategy Looks Like

This doesn’t need to be complicated. The basics:

Native SQL Server backups. Use BACKUP DATABASE and BACKUP LOG. These are the only backup methods that SQL Server’s recovery engine fully trusts. They handle the coordination between data files, transaction logs, and in-flight transactions correctly every time.

A backup schedule that matches your RPO.

Full backups: weekly or daily, depending on database size and change rate
Differential backups: daily (if fulls are weekly)
Transaction log backups: every 15–60 minutes for databases in Full recovery model

Backup storage on a separate infrastructure. Not the same disk. Not the same SAN. Ideally offsite or in cloud storage. If the thing that kills your database also kills your backups, you don’t have backups.

System database backups. Master, msdb, and model — backed up regularly. They’re small. There’s no reason to skip them.

DBCC CHECKDB on a schedule. Run integrity checks regularly — weekly at minimum, daily if your maintenance window allows it. Catching corruption early is the difference between restoring one page and rebuilding a server.

Tested restores. Periodically restore your backups to a test server and validate them. Automate this if you can. The worst time to discover your backups don’t work is when you need them.

Script your logins. Regularly export your login definitions (including SIDs and password hashes) to a script file stored separately from the server. If you lose master, this script lets you recreate every account in minutes instead of performing database surgery.

Upgrade Question

Worth addressing directly: the client wanted to jump from SQL Server 2014 to 2022 during disaster recovery.

Don’t do this.

Version upgrades can introduce compatibility issues — query behavior changes, deprecated features, cardinality estimator differences. In normal circumstances, you’d test the upgrade in a non-production environment, validate application compatibility, and plan a rollback strategy.

During disaster recovery, you have none of that. You’re trying to restore service as fast as possible. Every variable you add increases the risk that something else breaks and extends the outage.

Match what you had. Get back online. Plan the upgrade as a separate project when you’re stable.

Bottom Line

This recovery worked. Everything came back online. But it required techniques that should never be necessary — restoring system databases as user databases, extracting password hashes from corrupted catalog tables, manually rebuilding server security from scratch.

All of it is preventable with basics that cost almost nothing:

Native SQL Server backups on a schedule
Transaction log backups for point-in-time recovery
System database backups
DBCC CHECKDB on a schedule
Tested restores
Scripted login exports

Dealing with backups is a lot like being a pilot. It’s a lonely job until an engine starts to misfire. Then everybody is suddenly very interested in what you do.

Don’t wait for the engine to misfire.

Speak with a SQL Expert

In just 30 minutes, we will show you how we can eliminate your SQL Server headaches and provide  operational peace of mind

Schedule My Call Now ➜

Article by

Mark Varnas

Founder | CEO | SQL Veteran

Hey, I'm Mark, one of the guys behind Red9. I make a living performance tuning SQL Servers and making them more stable.