Service King*

Service King reduces SQL Server outages from 90 per quarter down to zero.

*Now Crash Champions

Trusted by 120+ clients, from startups to Fortune 100 companies, including:

Challenges

Unstable SQL Server
Slowness
Lack of experience

Solutions

110-Point SQL Server Health Check
SQL Server Performance Tuning

Benefits

Stability
SQL Speed
Improved SLAs
Faster processing
Increased SQL Server capacity
Savings in hardware upgrade

Background

Service King Collision (now Crash Champions Collision Repair) is one of the largest and most respected automotive repair shops in the US. They have been in business for over 40 years, starting with a single location, and have now grown to 345 locations across 24 states.

When Service King approached Red9, they had opened 68 locations and were looking to expand further.

A couple of major technical issues were holding them back: daily SQL Server outages and severe performance issues. Their SQL Server would fail on a nearly daily basis, causing 30-45 minute outages (sometimes multiple times per day). During these down periods, agents across all of their locations would be locked out from checking in new customers or executing any of their other administrative tasks.

The lost productivity was not only hurting bottom line, it was inconveniencing client’s customers and preventing from expanding further.

The lost productivity was not only hurting their bottom line, but it was also inconveniencing their customers and preventing them from expanding further. On top of that, performance issues were causing every action taken in their systems to move at a snail’s pace, further adding to employee and customer frustration.

Service King’s leadership team was worried that these outages and performance issues would hold them back from continuing to acquire new locations and expand their business if left unresolved.

At the time, client believed their SQL Server had hit capacity and client was considering making a $1m+ hardware purchase.

Red9 was able to identify and resolve the core issues causing the downtime and slowness without an additional hardware purchase and ultimately saved Service King Collision hundreds of thousands of dollars in the process.

Problem

There were many:

Unscheduled Windows Cluster Failovers:
- Occurred during business hours.
- Each failover lasted 30-45 minutes.
- Only 25% of cluster nodes were functional; out of four nodes, only one could handle the production workload.
The SAN storage array was misconfigured and working at 40MB/s (or USB2 speeds).
SQL Server Challenges:
- SQL backups could not be taken because full backups took over 24 hours to complete.
- SQL FILESTREAMING feature was being used but had outgrown manageability.
- Multiple databases had SQL corruption.
Configuration Issues:
- Significant gaps in maintenance, monitoring, alerting, and feature usage.
- Severe performance problems.
Massive Security and Configuration Problems:
- VMware was improperly configured, causing further stability and performance issues.
- Admin-level access was accidentally granted to all employees.

At Red9 we always start with Red9 proprietary 110-Point SQL Server Health Check, which we use to uncover a variety of technical issues.

Solution

Unlike the client’s initial instinct to buy more hardware, we resolved many issues.

We fixed Windows Failover Cluster misconfiguration, made networking configuration changes, improved SAN storage speed, and addressed backup and restore issues.

Additionally, we corrected SQL data corruption in a key database (which even Microsoft Support couldn’t fix), resolved database corruption caused by third-party software, and tuned SQL Server with best practices.

Finally, we configured the VMware host and guest for best practices and implemented monitoring, alerting, and proactive care.

Results

We’ve achieved remarkable improvements in system performance and reliability.

Here are some highlights:

We reduced unscheduled Windows Cluster failovers from about 90 per quarter to zero.
Full backups were taking 30h+. We were able to cut backup duration by 2x and set up a proper backup schedule for fulls, differentials, and transactional backups.
Enhanced SAN storage speeds by 11-15 times and optimized data storage through strategic file grouping and layering:
- Created multiple data file groups.
- Handpicked the most used tables and indexes, placing them on the new file groups.
- Organized these file groups across three SAN speed layers: fast (SSD array), medium (10,000 RPM drives), and slowest (7,500 RPM drives).
- Automated the process of migrating unneeded data out of production databases to keep DB files small and fast.
We fixed data corruption – one corrupted DB page at a time. Microsoft Support’s recommended solution would have caused too much downtime and was not practical. Luckily, corruption was all in FILESTREAM data.
We reconfigured third-party antivirus software, which was the culprit of corruption.
Multiple SQL Server settings were changed, making SQL more stable and faster.
Massively improved RPO (recovery point objective) and RTO (recovery time objective) times
Fixed massive security issue, where one of them was causing 1,000+ employees to have sysadmin rights on the server.
Implemented multiple new SQL Server features, such as Compression, partial availability, AlwaysOn, filestream optimizations, filegroup backups, resource governor, column store indexing, and partitioning.
Old data was migrated out of the production database into the data warehouse.
Deployed proper monitoring and alerting.
We used SQL performance tuning and improved the speed of about 200 most frequently run and most resource-intensive T-SQL queries.

SQL performance improvements ranged from 10x to 5,000x!

Service King*

Background

Problem

Solution

Results

Speak with a SQL Expert

Check Red9's SQL Server Services

SQL Consulting

Managed SQL Server

Emergency Support