SQL Server Health Check

VM Snapshots Are Not SQL Server Backups: A Disaster Recovery Story

Updated
11 min read
Written by
Mark Varnas
Reviewed by
Saulius Baskevicius
TLDR: A client's SQL Server cluster went down after routine maintenance. Shared storage corruption hit the master database and a critical user database. Their only backups were VM snapshots — not configured for application consistency. No native SQL Server backups. No integrity checks. 
Recovery required restoring the corrupted master as a user database to extract login credentials, then rebuilding from scratch. Everything came back online — but native backups and basic integrity checks would have made this a routine recovery.

What Happened

A new client called because their SQL Server cluster was down. Their words: “We rebooted both servers but now nothing works.”

The event logs showed DNS registration errors, but that was a symptom. The real problem was underneath: shared storage corruption. It had hit both the master database and a critical user database.

If your master database is corrupted, SQL Server won’t start. There’s no workaround, no degraded mode, no “start anyway and we’ll fix it later.” Master holds the server configuration, all database metadata, and every login account. Without it, you have nothing.

So the first question: where are the backups?

“Oh yeah, we do VM snapshots every night.”

Why VM Snapshots Aren’t SQL Server Backups

VM snapshots can work for SQL Server disaster recovery. But only if they’re configured for application consistency.

Here’s the problem. SQL Server has data files (.mdf/.ndf) and transaction log files (.ldf). At any given moment, these files are in different states — committed transactions in the data files, in-flight transactions in the log, dirty pages in memory that haven’t been flushed to disk yet. SQL Server manages the relationship between these files to guarantee transactional consistency.

A crash-consistent VM snapshot captures the disk state at a point in time, but it doesn’t coordinate with SQL Server. The data files and transaction logs may reflect different in-flight states, and dirty pages in memory that were not flushed at the time of the snapshot are lost. The result is the equivalent of pulling the power cord — SQL Server might recover from it through crash recovery, or it might not.

An application-consistent snapshot uses VSS (Volume Shadow Copy Service) to tell SQL Server “freeze I/O, flush your buffers, we’re taking a snapshot.”SQL Server cooperates, the snapshot captures an application-consistent state, and restoration is significantly more reliable.

This client’s snapshots weren’t application-consistent. Some were partially usable — likely captured during low-activity periods — but none provided the same restoration guarantees as native SQL Server backups.

As one colleague put it: “backups exist” is not the same as “backups work.” VM snapshots without application consistency are hope-based disaster recovery.

Other Problems

The storage corruption and bad snapshots weren’t the only issues. The full picture:

  • No native SQL Server backups. No BACKUP DATABASE. No BACKUP LOG. No differential backups. Nothing that SQL Server’s own recovery engine could use.
  • No database integrity checks. DBCC CHECKDB had never been run. There was no way to know whether corruption had been creeping in before the failure.
  • No DBAs on staff. Nobody on the team had SQL Server expertise. The cluster was managed by general IT staff.
  • Hundreds of user accounts with forgotten passwords. All stored in the master database that was now corrupted.

This combination — no native backups, no integrity checks, no database expertise — turns a recoverable incident into an emergency.

Recovery

The client wanted to “upgrade” to SQL Server 2022 during the recovery. During active disaster recovery is not the time to test version compatibility. We matched what they had — SQL Server 2014 — to eliminate variables.

Step 1: Build a new server. Fresh SQL Server 2014 installation with a clean master database.

Step 2: Attempt file-level recovery. We stopped SQL Server, swapped in database files salvaged from the VM snapshots, and attempted to start the instance. SQL Server refused — the master database files were too damaged.

Step 3: Restore master as a user database. This is an unusual technique, but it works. Instead of trying to use the corrupted master as the system database (which SQL Server won’t allow if it can’t pass recovery), we attached it as a regular user database on the new instance. This let us run DBCC CHECKDB against it to see exactly what was damaged, and — more importantly — query the system tables directly.

Step 4: Extract logins and credentials. The corrupted master database still had the system catalog tables containing all user logins and their password hashes. We queried sys.sql_logins and sys.server_principals from the attached database, extracted the SIDs and password hashes, and built scripts to recreate hundreds of user accounts on the new server with their original credentials intact.

Step 5: Restore user databases. Using the salvageable VM snapshot files, we brought the user databases online. Some required repair operations (with potential data loss).

Step 6: Reconnect everything. With logins restored and databases online, applications could connect again.

Everything came back. All users reconnected. But this recovery took far longer and carried far more risk than it should have.

Backup Anti-Patterns We See Constantly

This client’s situation was extreme, but the individual mistakes are common. Here are the backup anti-patterns we run into regularly:

  • “We have VM snapshots.” MSPs especially tend to treat VM snapshots as gospel. For file servers and application servers, crash-consistent snapshots are often fine. For database servers, they’re unreliable without VSS integration. SQL Server has its own backup mechanism for a reason — it’s the product of decades of lessons learned from failed improvisation.
  • Backups on the same storage as the database. We’ve seen backup files stored on the same disk subsystem as the database. If the storage fails, you lose both. One client had backup retention configured for 7 days, but the backups grew until they couldn’t even keep one — all on the same disk that eventually failed. Backups need to be on separate storage, ideally offsite or in a separate fault domain.
  • No system database backups. “Those are just small databases that don’t change much.” Until they do. The master database holds every login, every server configuration, every linked server definition. The msdb database holds every SQL Agent job, every backup history record, every maintenance plan. Losing master without a backup is how you end up doing the kind of surgery we did for this client.
  • No backup validation. An untested backup is just a file. You don’t know if it’s restorable until you try. RESTORE VERIFYONLY is a start (it checks the backup media is readable and the structure is valid), but the only real test is restoring to a separate server and running DBCC CHECKDB against the result.
  • No transaction log backups. Full backups every night with no log backups in between means your recovery point objective (RPO) is up to 24 hours of data loss. For many environments, that’s unacceptable. Log backups every 15–60 minutes close that gap to minutes instead of hours.
  • “The tape guy handles it.” One case involved an accounting firm during tax season. The “backup plan” was a secretary swapping tapes at the front desk every night with no success verification. When the database needed restoration, the most recent recoverable backup was several days old. During tax season.

What a Proper SQL Server Backup Strategy Looks Like

This doesn’t need to be complicated. The basics:

Native SQL Server backups. Use BACKUP DATABASE and BACKUP LOG. These are the only backup methods that SQL Server’s recovery engine fully trusts. They handle the coordination between data files, transaction logs, and in-flight transactions correctly every time.

A backup schedule that matches your RPO.

  • Full backups: weekly or daily, depending on database size and change rate
  • Differential backups: daily (if fulls are weekly)
  • Transaction log backups: every 15–60 minutes for databases in Full recovery model

Backup storage on a separate infrastructure. Not the same disk. Not the same SAN. Ideally offsite or in cloud storage. If the thing that kills your database also kills your backups, you don’t have backups.

System database backups. Master, msdb, and model — backed up regularly. They’re small. There’s no reason to skip them.

DBCC CHECKDB on a schedule. Run integrity checks regularly — weekly at minimum, daily if your maintenance window allows it. Catching corruption early is the difference between restoring one page and rebuilding a server.

Tested restores. Periodically restore your backups to a test server and validate them. Automate this if you can. The worst time to discover your backups don’t work is when you need them.

Script your logins. Regularly export your login definitions (including SIDs and password hashes) to a script file stored separately from the server. If you lose master, this script lets you recreate every account in minutes instead of performing database surgery.

Upgrade Question

Worth addressing directly: the client wanted to jump from SQL Server 2014 to 2022 during disaster recovery.

Don’t do this.

Version upgrades can introduce compatibility issues — query behavior changes, deprecated features, cardinality estimator differences. In normal circumstances, you’d test the upgrade in a non-production environment, validate application compatibility, and plan a rollback strategy.

During disaster recovery, you have none of that. You’re trying to restore service as fast as possible. Every variable you add increases the risk that something else breaks and extends the outage.

Match what you had. Get back online. Plan the upgrade as a separate project when you’re stable.

Bottom Line

This recovery worked. Everything came back online. But it required techniques that should never be necessary — restoring system databases as user databases, extracting password hashes from corrupted catalog tables, manually rebuilding server security from scratch.

All of it is preventable with basics that cost almost nothing:

  • Native SQL Server backups on a schedule
  • Transaction log backups for point-in-time recovery
  • System database backups
  • DBCC CHECKDB on a schedule
  • Tested restores
  • Scripted login exports

Dealing with backups is a lot like being a pilot. It’s a lonely job until an engine starts to misfire. Then everybody is suddenly very interested in what you do.

Don’t wait for the engine to misfire.

Speak with a SQL Expert

In just 30 minutes, we will show you how we can eliminate your SQL Server headaches and provide 
operational peace of mind

Article by
Mark Varnas
Founder | CEO | SQL Veteran
Hey, I'm Mark, one of the guys behind Red9. I make a living performance tuning SQL Servers and making them more stable.

Discover More

SQL Server Health Check SQL Server Migrations & Upgrades SQL Server Performance Tuning SQL Server Security SQL Server Tips

Discover what clients are saying about Red9

Red9 has incredible expertise both in SQL migration and performance tuning.

The biggest benefit has been performance gains and tuning associated with migrating to AWS and a newer version of SQL Server with Always On clustering. Red9 was integral to this process. The deep knowledge of MSSQL and combined experience of Red9 have been a huge asset during a difficult migration. Red9 found inefficient indexes and performance bottlenecks that improved latency by over 400%.

Rich Staats 5 stars
Rich Staats
Cloud Engineer
MetalToad

Always willing to go an extra mile

Working with Red9 DBAs has been a pleasure. They are great team players and have an expert knowledge of SQL Server database administration. And are always willing to go the extra mile to get the project done.
5 stars
Evelyn A.
Sr. Database Administrator

Boosts server health and efficiency for enhanced customer satisfaction

Since adding Red9 to the reporting and DataWarehousing team, Red9 has done a good job coming up to speed on our environments and helping ensure we continue to meet our customer's needs. Red9 has taken ownership of our servers ensuring they remain healthy by monitoring and tuning inefficient queries.
5 stars
Andrew F.
Datawarehousing Manager
See more testimonials