SQL Server Monitoring

How to do this properly: requirements, goals, software tools, alternatives, processes, and caution points.

Why is SQL Server Monitoring important? For me as a DBA, there are several reasons:

  1. Some servers are simply important. I just can’t have anything happen to them and not know. Why? For starters, I don’t want to look like an idiot when someone asks me what is going on with the server. I want to be the first person to find out, not the other way around. And I strive to be good DBA.
  2. Proactive monitoring catches issues before they become problems at 3AM on a Saturday night while I am holding a beer.
  3. I want to avoid after-hours work as much as possible. Good monitoring lets me “migrate” issues from sleeping hours into business hours.

SQL Monitoring – Questions It Needs To Answer

Monitoring is a wide topic. It begins with “is my server up?” to “how is query X performing now, versus a week ago?” Therefore, it helps to break this subject down. Here is how I do it:

  1. Is stuff down or about to go down? In other words, are the main SQL Server resources online and working?
  2. Has anything significant changed? Or, has the VMware admin stolen a couple CPU cores away or changed RAM size?
  3. How is performance? Only when I have the first two issues handled will I deal with performance. Performance doesn’t matter if SQL is down, right?
  4. The next goal is to help with trending and capacity planning. This is a product of the first three. You (almost) don’t need to do anything specific here, because if you set up the first three properly, you will be logging data. Then, it becomes easy to look into the past and predict the future. It is simple to look at the table that had 5 billion rows last year and now has 7, and predict how much more storage will be needed. You will look like a DBA hero when you can tell your manager “We will run out of disk space next October”.

SQL Server Monitoring – What It Needs To Achieve

  1. Catch the issue – first I want to be able to catch the problem.
  2. Self-heal – I don’t want notifications. I want to be able to run scripts after I know something is wrong.
  3. Alert –notify (and eliminate useless junk alerts) if self-healing is not possible or we just haven’t built it yet.
  4. Log – log to SQL tables when possible. I will be using this data later.
  5. Agentless – I want to deploy code in a form of stored procedures, SQLAgent jobs, and PowerShell scripts. This solves a couple of important problems: it means on most servers, I can probably sneak this code in (believe me, I test the crap out of it before I do that on PROD). It means I can probably avoid creating a hundred JIRA tickets and avoid bureaucratic processes and red tape, when it makes sense. I would not do it on PROD servers, but for DEV and QA environments, it most likely can be done safely.
  6. Alerting solution should work on all SQL versions (2005 and up).
  7. If all of the above is done right, there will be no limit on how many SQL Servers can be monitored. I should be able to use the same monitoring on environments with a single SQL Server or 500.
  8. It will have minimal or no impact on the SQL Server Performance that is being monitored.

A Couple of Caution Points

The more people you involve in deciding what needs to be monitored, the worse it will get. Managers are scared to “miss something”. And that’s complete BS. The method that has worked for me is to ask my manager, “Um, would you like for me to handle SQL monitoring?” He will say “Yes, of course”. And I will take it from there.

Don’t ask for anything else. If you do, your manager will want every alert under the sun and will gladly volunteer your time off-hours. Just don’t provide that option.

One simple caveat. You have to do a superb job with SQL monitoring. You now own it, buddy. If you do that – there will be no issues. The manager won’t care some alert triggered and you didn’t do anything until the next day. If the manager is not in trouble, you won’t be either. But it better be an alert that can be ignored.

It’s when you mess up and start having meetings about alerting; that’s when stuff gets out of hand.

Now that I know what I need to pull off, let’s see where that data lives.

  1. Server stats (like CPU & RAM usage)
  2. Event viewer logs
  3. SQL Error log
  4. SQLAgent failed jobs (or msdb)
  5. SQL DMVs

This means I will need to use a combination of SQL and PowerShell scripts.

I/we have bunch of these written, and are being used on many SQL Servers already. The next step is to clean the code up and write some more, so I will not have to turn red when you find something embarrassing in it. We will see how that works out.

I hope you see the approach we will take if we monitor your SQL Server.