Service King reduces SQL Server outages from 90 per quarter down to zero

Service King

Challenges

  • Unstable SQL Server
  • Slowness
  • Lack of experience

Solution

Benefits

  • Stability
  • SQL Speed
  • Improved SLAs
  • Faster processing
  • Increased SQL Server capacity
  • Savings in hardware upgrade

Background

Service King Collision is one of the largest and most respected automotive repair shops in the US.  They have been in business for over 40 years starting with a single location, and have now grown to 345 locations across 24 states.  

When Service King approached Red9, they had opened 68 locations and were looking to expand further. 

A couple of major technical issues were holding them back – daily SQL server outages and severe performance issues. Their SQL Server would fail on a nearly daily basis causing 30-45 minute outages (sometimes multiple times per day). During these down periods, agents across all of their locations would be locked out from checking in new customers or executing any of their other administrative tasks. 

The lost productivity was not only hurting their bottom line, it was inconveniencing their customers and preventing them from expanding further. On top of that, performance issues were causing every action taken in their systems to move at a snail’s pace, further adding to employee and customer frustration. 

Service King’s leadership team was worried that these outages and performance issues would hold them back from continuing to acquire new locations and expand their business if left unresolved. At the time, they believed their SQL Server had hit capacity and were considering making a $1m+ hardware purchase. 

Red9 was able to identify and resolve the core issues causing the downtime and slowness without an additional hardware purchase and ultimately saved Service King Collision hundreds of thousands of dollars in the process.

Problem

There were many.

Unscheduled Windows Cluster failovers in the middle of business hours. Cluster failovers were taking 30-45 minutes. Only 25% of Windows Cluster nodes were functional. Out of four node Windows Cluster, only one was capable of running production workload. SAN storage array was misconfigured and working at 40MB/s (or USB2 speeds). SQL Backups could not be taken because it was taking longer than 24h to run a full backup. SQL Filestreaming feature was being used but outgrown manageability. There was SQL corruption in multiple databases. SQL Server was not configured properly, leaving significant gaps in maintenance, monitoring and alerting, and feature usage. Massive security problems, which accidentally gave admin-level access to everyone in the company. VMWare was not configured properly causing further stability and performance issues. Severe SQL Server performance problems

At Red9 we always start with our proprietary 145 point SQL Server Health Check which we used to uncover a variety of technical issues.

Solution

Unlike clients initial instinct to buy more hardware, we fixed many problems: Windows Failover Cluster misconfiguration, networking configuration changes, SAN storage speed problems, backups and restores, SQL data corruption in one of the key databases (which Microsoft Support could not even fix), third-party software corrupting databases, tuned SQL Server with best practices, fixed misconfiguration, started utilizing all SQL Server features available, fixed numerous SQL performance problems, configured VMWare host and VMWare guess for best practices, addressed lack of monitoring, alerting, and proactive care.

Results

  • Unscheduled Windows Cluster Failovers from about 90 per quarter down to zero. 
  • Full backups were taking 30h+. We were able to cut backup duration by 2x and set up a proper backup schedule for fulls, differentials, and transactional backups.
  • SAN storage speeds were increased 11-15x. Storage was further improved. We  created multiple data file groups. We then handpicked the most used tables and indexes and placed those on the new filegroups, and we placed those files groups on three SAN speed layers: fast, medium, and slowest. The most frequently accessed data was layered on the fastest layer of the SAN (SSD array). Data accessed with medium frequency was placed on mid-layer (10,000 RPM drives). Data rarely accessed, was put on the slowest SAN layer (7,500 RPM drives). We then created an automated process to migrate data that was no longer needed out of production databases to keep db files as small and as fast as possible.
  • We fixed data corruption – one corrupted db page at a time. Microsoft Support’s recommended solution would have caused too much downtime and was not practical. Luckily corruption was all in filestream data.
  • We reconfigured third-party antivirus software which was the culprit of corruption.
  • Multiple SQL Server settings changed, making SQL more stable, and faster.
  • Massively improved RPO (recovery point objective) and RTO (recovery time objective) times 
  • Fixed massive security issue, where one of them was causing 1,000+ employees to have sysadmin rights on the server.
  • Implemented multiple new SQL Server features, such as Compression, partial availability, AlwaysOn, filestream optimizations, filegroup backups, resource governor, column store indexing, partitioning
  • Old data was migrated out of production database into data warehouse.
  • Deployed proper monitoring and alerting.
  • We used SQL performance tuning and improved the speed of about 200 most frequently run and most resources intensive TSQL queries. SQL performance improvements ranged from 10x-5,000x.

Testimonials

Contact Us

Schedule a call to discuss your SQL Server needs.