Service King reduces SQL Server outages from 90 per quarter down to zero

Challenges

  • Unstable SQL Server
  • Slowness
  • Lack of experience

Solution

Benefits

  • Stable SQL Server
  • Significantly faster SQL processing
  • Satisfactory SLAs
  • Faster key business processing
  • More SQL Server capacity
  • Savings in hardware upgrade

Background

Service King Collision is one of the largest and most respected automotive repair shops in the US.  They have been in business for over 40 years starting with a single location, and have now grown to 345 locations across 24 states.  

When Service King approached Red9, they had opened 68 locations and were looking to expand further. 

A couple of major technical issues were holding them back – daily SQL server outages and severe performance issues. Their SQL Server would fail on a nearly daily basis causing 30-45 minute outages (sometimes multiple times per day). During these down periods, agents across all of their locations would be locked out from checking in new customers or executing any of their other administrative tasks. 

The lost productivity was not only hurting their bottom line, it was inconveniencing their customers and preventing them from expanding further. On top of that, performance issues were causing every action taken in their systems to move at a snail’s pace, further adding to employee and customer frustration. 

Service King’s leadership team was worried that these outages and performance issues would hold them back from continuing to acquire new locations and expand their business if left unresolved. At the time, they believed their SQL Server had hit capacity and were considering making a $1m+ hardware purchase. 

Red9 was able to identify and resolve the core issues causing the downtime and slowness without an additional hardware purchase and ultimately saved Service King Collision hundreds of thousands of dollars in the process.

Problem

There were many.

Unscheduled Windows Cluster failovers in the middle of business hours. Cluster failovers were taking 30-45 minutes. Only 25% of Windows Cluster nodes are functional. Out of 4 node Windows cluster, only one was capable of running production workload. SAN storage array was misconfigured and working at 40MB/s (or USB2 speeds). SQL Backups could not be taken because it was taking longer than 24h run a full backup. SQL Filestreaming feature implemented but outgrown manageability. SQL Corruption in multiple databases. SQL Server was not configured properly, leaving significant gaps in maintenance, monitoring and alerting, and feature usage. Massive security problems, which accidentally gave admin-level access to everyone in the company. VMWare not configured properly causing further stability and performance issues. Severe SQL Server performance problems

At Red9 we always start with our proprietary 145 point SQL Server Health Check which we used to uncover a variety of technical issues.

Solution

Unlike clients initial instinct to buy more hardware, we fixed many problems: Windows Failover Cluster misconfiguration, networking configuration changes, SAN storage speed problems were addressed, db backing up and restore because of time and size, SQL data corruption in one of the key databases, which Microsoft Support could not even fix, third-party software corrupting databases, SQL Server settings, misconfiguration, not utilizing all SQL Server features available, lots of SQL performance problems, VMWare host and guess best practices, lack of monitoring, alerting, and proactive care.

Results

  • Unscheduled Windows Cluster Failovers from about 90 per quarter down to zero. 
  • Full backups were taking 30h+. We were able to cut backup duration by 2x and set up a proper backup schedule for fulls, differentials, and transactional backups.
  • SAN storage speeds were increased 11-15x. Storage was further improved by implementing: created multiple data file groups. We then handpicked the most used tables and indexes to new filegroups. And then layers those on three SAN speed layers: fast, medium, and slowest. The most frequently accessed data was put on the fastest layer of the SAN (SSD array). Data accessed less frequently, was put on mid-layer (10,000 RPM drives). Data rarely accessed, was put on the slowest SAN layer (7,500 RPM drives). Created a process to migrate data that was no longer needed out of production databases to keep them as small and as fast as possible
  • Data corruption was fixed manually, one corrupted db page at a time. Microsoft Support’s recommended solution would have caused too much downtime and was not practical. So red9 chose a different solution and fixed corruption manually, one page at a time.
  • Third-party software reconfigured not to interfere with SQL Server and stopped corrupting the database
  • Multiple SQL Server settings changes, making it more stable, and faster
  • Massively improved RPO (recovery point objective) and RTO (recovery time objective) times 
  • Fixed massive security issue, where one of them was causing 1000+ employees to have sysadmin rights on the server
  • Implemented multiple new SQL Server features, such as Compression, partial availability, AlwaysOn, filestream optimizations, filegroup backups, resource governor, column store indexing, partitioning
  • Old data migration out of production database into data warehousing
  • Deployed proper monitoring and alerting
  • Tuned about 200 queries. Performance gains ranged from 10x-5,000x

Testimonials

Contact Us

Schedule a call to discuss your SQL Server needs.