SQL Server Health Check – Case Study

 

Industry vertical: National Automotive Service

Client background:

2000+ employees

$1 Billion+ in revenues

300 locations nationwide

 

Problem Summary:

  1. Client is experiencing severe issues with the Microsoft Windows Failover Clustering stability. Four node Windows Cluster goes down about two-three times per day and takes about 30 minutes to recover each time. While about 2,000 employees can not work and clients can not be serviced.
  2. Severe SQL Server performance issues. Multiple queries were taking 1, 5, 10 minute or longer or timing out. Employees and customers wait… frustrated.
  3. Windows Cluster has 4 nodes. Three of them do not work.

 

After a while client realized that skill set did not exist internally to resolve stability and SQL performance problems quickly. Therefore Red9 got the call.

 

What was done:

  1. Red9 performed “Comprehensive SQL Server Health Check”. This outlined 41 issues in:
    1. Reliability (which included maintenance)
    2. Performance
    3. Security
    4. Comprehensive methodology was applied to look at various levels Cluster and SQL depended on: CPU, RAM, networking, BIOS, VMWare setup, Windows Operating System settings, Windows Group Policy, Active Directory, multiple error logs, storage, networking and few more areas. It may seem like an overkill to look at all the layers, but Red9 practices prove that comprehensive analysis is effective. Especially when stability and speed matter.
  2. Red9 performed “Microsoft Windows Cluster Health Check”.
    1. After checking Win Clustering best practices we found severe problems in the cluster setup. Which caused nodes going down randomly. And on top of that Cluster could not recover. Since hardware was all different all four nodes and severely underpowered, only single node could realistically handle the SQL Server workload. Therefore we took other 3 useless nodes. Temporarily – until second node was fixed that could become the second node that can handle the workload.
      1. Many cluster options were simply wrong.
      2. Storage misconfigured.
      3. Networking misconfigured.
      4. Priorities and thresholds were misconfigured.
  3. Red9 did “SAN Performance testing”. This revealed severe slowness.
    1. SAN which was just bought and cost $250K, was working at USB2 speed or 60 megabytes per second (MBps).
    2. SAN had three level of drives: fast, medium and slow. We worked with SAN/sysadmin and figured out which is which, and moved SQL Server files around by placing data that got “hit” the most on fastest drives..
  4. Red9 found a contractor which was periodically bringing SQL down.
    1. This external contractor in charge of SQL Server attempted to fix SQL performance problems by running SQL Server Profiler. By using the tool improperly (collecting few wrong events) he was bringing SQL to its knees. Incorrectly concluding that SQL was going to go down anyway, he just caught the moment, when it was about to go down and SQL did go down while he watched. After Red9 identified the problem, security permissions were tightened.

 

Situation after Red9 work was complete:

  1. SQL Server Cluster does not go down at random times and is fully stable. Scheduled failovers are now done periodically every 60-90 days.
  2. SQL Server Failover Cluster reboot time is now under 5 minutes (was 30min+).
  3. About 20-30 SQL queries with performance issues were addresses by significantly cutting resources used or run time duration required. You can see several case studies here: https://red9.com/case-studies/
  4. Internal application that accessed SQL databases in Headquarters are running a lot faster. Employees are happier. Customers don’t have to wait as long as before.

 

Both the Red9 analysis reports and the recommendations received praising feedback from the client.

We continued the work for this client for 2 years+.

 

 

 

Leave a Comment