SQL Server Tips

A Client Lost 1.5 Months of Data and It Wasn’t a Cyberattack

Updated
7 min read
Written by
Mark Varnas
Reviewed by
Saulius Baskevicius
TLDR: A client panicked when their database wouldn’t come online after a physical server move. They forced it into emergency mode, overwrote the MDF file, and discovered their backup chain was broken. Six weeks of production data, gone. Every step of the disaster was preventable.

What Happened

The client moved a physical SQL Server to a new location. When they brought it back up, the database was stuck in RESTORING/recovery mode.

This is normal. SQL Server replays transaction logs during recovery to bring the database to a consistent state. The correct response is to let it finish. Depending on the volume of uncommitted transactions, this can take a while — and in the absence of corruption or storage issues, it will complete on its own.

They didn’t wait. Instead, they:

  1. Forced the database into emergency mode
  2. Purchased third-party recovery tools
  3. Accidentally overwrote the original MDF file with a recovery attempt

At that point, the original data file was gone.

Backup Chain Was Already Broken

When we got the call, they had 12TB of backup files on hand. Plenty of transaction log backups. But no usable recent full backup.

Why? Two decisions made weeks earlier:

  1. They stopped running full backups because of “performance issues”
  2. Their cleanup job deleted anything older than two weeks — including the last full backup

Transaction log backups contain all log records generated since the previous log backup. They only work when you can restore them on top of a full backup baseline. Without that baseline, 12TB of log files is just disk space.

They eventually found an older full backup from November buried in a staging environment refresh. That got them back online — minus six weeks of production data.

The Panic Problem

The database would have come back on its own. SQL Server’s crash recovery process is designed exactly for this scenario — unclean shutdowns, power losses, server moves without a clean stop. The engine replays the write-ahead log, rolls back incomplete transactions, and brings the database to a consistent state.

Setting a database to EMERGENCY mode changes its state and bypasses normal recovery behavior, and should only be used in corruption scenarios. It’s a last resort for corruption scenarios, not an impatience shortcut. And once recovery tools overwrote the MDF, the original data was unrecoverable.

Lesson: when a database is in recovery, let it recover. Monitor the error log for progress. Don’t touch it.

“Stop Backups for Performance” Is a Governance Failure

Pressure to disable backups because they “slow things down” is one of the most common anti-patterns we encounter.

The performance impact of a backup is temporary. The impact of not having one is permanent.

If backups are causing measurable performance issues, investigate why. Compress them. Schedule them during low-activity windows. Move them to a different storage target. Reduce full backup frequency and lean on differentials. There are solutions that don’t involve disabling the one safety net that matters.

When non-technical leadership pressures a DBA to stop backups, the risk needs to be communicated in writing. If the decision stands, document it. When something goes wrong, there should be a paper trail showing who made that call.

Your Cleanup Job Might Be Deleting Your Last Restore Point

A retention policy that deletes “anything older than 14 days” without logic to preserve the last full backup is a ticking clock. The moment full backups stop — for any reason — the countdown begins. After 14 days, the cleanup job removes the last viable restore point and the entire backup chain collapses.

At minimum, cleanup logic should preserve at least one valid full backup and its associated differential and log chain required for restore. Date-based retention alone isn’t enough.

Schrödinger’s Backups

Backups exist in an uncertain state: they’re both valid and corrupt until you actually restore one. One DBA we spoke with described a company that kept tape cartridges in a cabinet bolted to a wall — on the other side of which sat a rack of industrial magnets. Those backups were worthless for years before anyone checked.

Testing restores is how you collapse that uncertainty.

For non-DBAs wondering what “test your restores” looks like in practice: take your most recent backups and restore them to a separate server on a regular schedule. Validate that the data is intact and free of corruption. The backup itself is just a file. Proving you can turn that file back into a working database is the part that matters.

Weekly restore tests for critical systems. Monthly at minimum for everything else. Automate where possible — tools like the dbatools PowerShell module can run restore validation on a schedule — but perform manual restores periodically so the process stays familiar.

Related: SQL Server Backup Best Practices Worth Bookmarking

Layers Worth Having

Backups are the foundation. They’re also the last line of defense.

  • Always On Availability Groups give you a live copy of the database on a secondary replica. If the primary goes down, the secondary is already caught up. An AG could have significantly reduced the impact, depending on the nature of the storage failure.
  • Multiple backup targets — don’t store backups only on the same server. Copy them to separate storage, a different site, or cloud. A single point of failure for your backups is a single point of failure for your business.
  • Storage-level snapshots add another recovery layer on virtualized infrastructure, though they supplement native backups — they don’t replace them.

Takeaway

Three rules that would have prevented this entire incident:

  1. Let recovery finish. SQL Server’s crash recovery is built for exactly this scenario. Monitor the error log and wait.
  2. Monitor your backup chain. Know when your last full backup ran. Know where it is. Get alerted if one fails.
  3. Test restores regularly. The only backup you can trust is the one you’ve successfully restored.

Your backup strategy exists for the day everything goes wrong. Build it for that day.

Related: Another Backup Failure Story — This Time It Was a Bank

Speak with a SQL Expert

In just 30 minutes, we will show you how we can eliminate your SQL Server headaches and provide 
operational peace of mind

Article by
Mark Varnas
Founder | CEO | SQL Veteran
Hey, I'm Mark, one of the guys behind Red9. I make a living performance tuning SQL Servers and making them more stable.

Discover More

SQL Server Health Check SQL Server Migrations & Upgrades SQL Server Performance Tuning SQL Server Security SQL Server Tips

Discover what clients are saying about Red9

Red9 has incredible expertise both in SQL migration and performance tuning.

The biggest benefit has been performance gains and tuning associated with migrating to AWS and a newer version of SQL Server with Always On clustering. Red9 was integral to this process. The deep knowledge of MSSQL and combined experience of Red9 have been a huge asset during a difficult migration. Red9 found inefficient indexes and performance bottlenecks that improved latency by over 400%.

Rich Staats 5 stars
Rich Staats
Cloud Engineer
MetalToad

Always willing to go an extra mile

Working with Red9 DBAs has been a pleasure. They are great team players and have an expert knowledge of SQL Server database administration. And are always willing to go the extra mile to get the project done.
5 stars
Evelyn A.
Sr. Database Administrator

Boosts server health and efficiency for enhanced customer satisfaction

Since adding Red9 to the reporting and DataWarehousing team, Red9 has done a good job coming up to speed on our environments and helping ensure we continue to meet our customer's needs. Red9 has taken ownership of our servers ensuring they remain healthy by monitoring and tuning inefficient queries.
5 stars
Andrew F.
Datawarehousing Manager
See more testimonials