How to do this properly: requirements, goals, software tools, alternatives, processes, and caution points.
Why is SQL Server Monitoring important? For me as a DBA, there are several reasons:
- Some servers are simply important. I just can’t have anything happen to them and not know. Why? For starters, I don’t want to look like an idiot when someone asks me what is going on with the server. I want to be the first person to find out, not the other way around. And I strive to be good DBA.
- Proactive monitoring catches issues before they become problems at 3AM on a Saturday night while I am holding a beer.
- I want to avoid after-hours work as much as possible. Good monitoring lets me “migrate” issues from sleeping hours into business hours.
What questions does SQL monitoring need to answer?
Monitoring is a wide topic. It begins with “is my server up?” to “how is query X performing now, versus a week ago?” Therefore, it helps to break this subject down. Here is how I do it:
- Is stuff down or about to go down? In other words, are the main SQL Server resources online and working?
- Has anything significant changed? Or, has the VMware admin stolen a couple CPU cores away or changed RAM size?
- How is performance? Only when I have the first two issues handled will I deal with performance. Performance doesn’t matter if SQL is down, right?
- The next goal is to help with trending and capacity planning. This is a product of the first three. You (almost) don’t need to do anything specific here, because if you set up the first three properly, you will be logging data. Then, it becomes easy to look into the past and predict the future. It is simple to look at the table that had 5 billion rows last year and now has 7, and predict how much more storage will be needed. You will look like a DBA hero when you can tell your manager “We will run out of disk space next October”.
What does monitoring need to achieve?
- Catch the issue – first I want to be able to catch the problem.
- Self-heal – I don’t want notifications. I want to be able to run scripts after I know something is wrong.
- Alert –notify (and eliminate useless junk alerts) if self-healing is not possible or we just haven’t built it yet.
- Log – log to SQL tables when possible. I will be using this data later.
- Agentless – I want to deploy code in a form of stored procedures, SQLAgent jobs, and PowerShell scripts. This solves a couple of important problems: it means on most servers, I can probably sneak this code in (believe me, I test the crap out of it before I do that on PROD). It means I can probably avoid creating a hundred JIRA tickets and avoid bureaucratic processes and red tape, when it makes sense. I would not do it on PROD servers, but for DEV and QA environments, it most likely can be done safely.
- Alerting solution should work on all SQL versions (2005 and up).
- If all of the above is done right, there will be no limit on how many SQL Servers can be monitored. I should be able to use the same monitoring on environments with a single SQL Server or 500.
- It will have minimal or no impact on the SQL Server Performance that is being monitored.
SQL Monitoring Software, Tools, Alternatives and Other Options
- Don’t do monitoring. Wait… What? Well, that won’t work for me. Moving on.
- 3rd party tools:
Red-Gate – I have heard good things. Never used it in a real world situation, so I can’t tell you much.
Idera SQL Diagnostics Manager – this is a crappy tool for monitoring. I’ve run into three DBAs that say “its great”. Every time, I thought I was missing something, but sadly in all cases those DBAs were just using defaults, and have not used monitoring to the level I am describing here. And that’s just not good enough (sloppy, in my book). After countless calls with Idera Support, Idera account managers, and submitting bugs – I just gave up on it. Oh, and Idera thought the stuff I was asking for was so good, they added those items to their future road map. Yeah… I was just getting started.
Dell FogLight – tried using this too. This was about 5 years ago. It was so complex, two DBAs attempted to set it up, and we never got past monitoring free disk space. Maybe it’s good, but any software these days that requires 2 weeks of consultants on-site is just not going to work for me. So yeah, moving on.
- Solarwinds – probably has some tools that do this, but I am not aware of it. It’s not a DPA (Database Performance Analyzer).
- Splunk – this has a lot of promising features. I dabbled in it, but it seems too complex for what I need.
- SQL Profiler – you can probably achieve a lot with it, but not 100%. We are also talking an extra load on the server, and this is not acceptable.
- SQL Extended Events – you can probably do more with it than with SQL Profiler, but since I need to monitor all the way down to SQL 2005 or SQL2008 – that won’t work. Maybe in a few years, when old SQL versions are extinct.
- Perfmon – It’s good, but not for my purpose here.
- SSMS Activity Monitor – nope, won’t work at all for this.
- PAL tool – This is a great tool, but not for on-going monitoring purposes.
- Wait Stats – Not a good tool for monitoring. Not enough data, and the data is too general.
- Plan cache – well, now we’re getting too far off course. It’s really good for performance monitoring, but can’t get me everything.
- Build your own SQL/PowerShell scripts – I am not a big fan of doing this. But, since SQL software vendors suck at this, I have had no choice but to roll my own.
Here are couple of other miscellaneous SQL monitoring requirements, and overall process-related items that don’t fit anywhere else:
- The goal is never to wake-up the DBA.
- Do not alert on anything that is not actionable.
- Non-DBA alerts should not come to DBAs (same as #1, just in different words). You need some of these points, because you want to create a written policy document. So, when some whacko manager tries to point a bunch of cisco alerts at DBAs, thinking everyone should know, you can point to the policy and say “Nope, no way, my friend”.
- Each alert triggering during non-business hours needs to be addressed the next day. The main question we are trying to answer: what can be done so this never happens again? Can we add some code to self-heal the issue? If not, can we write a Standard Operating Procedure so 24×7 level 1 support can deal with this, and DBAs won’t have to be woken up? (I have declared war on waking up DBAs!)
- You will need an SQL Server list, first. My list contains a bunch of columns. But the most important pieces of data are server names. The second most important item is how important the server is. In smaller environments (under 20-50 servers), DBAs tend to keep server lists in their heads. With larger environments, you need to have a written list. I build server lists regardless of how many servers there are. Just do the same.
- Alert messages should be as succinct as possible. They need to be worded correctly, so there is no question about what happened and how to fix it. Every second matter. Each character matters (yes, each character). Why? Because when the issue is critical, I can’t waste time. At 4 AM, I want to read less, not more. And I want to be able to tell from the subject line how critical the issue is, and if I should be getting up and looking for my slippers. If I must interrupt someone’s dinner, I want to be as brief as possible and communicate the problem quickly.
- Each alert should become a ticket. I hate documentation just like you do. However, this time, I can use that for doing good. Often, it’s hard to see a forest because I am looking at a tree from 5 inches away. When I know a reoccurring problem takes X hours per week to fix, it becomes easier to say to my manager, “This month we wasted 18 hours of DBA time on this, and 314 hours this year. Therefore, dedicating a few days of DBA time to write a kickass PowerShell script is worth the investment”.
- When a script cannot be written to fix the triggering issue, then the next level of defense is a Standard Operating Procedure (SOP). I want a SOP so I can give it to level 1 support and have them fix the problem. Many companies have 24×7 support staff anyway, so if we educate them on how an issue can be fixed, we avoid waking DBAs! So many companies don’t use level 1 support correctly – if each problem contains a SOP for how it should be fixed, we can actually make those people do some valuable work, instead of sleeping through the night or playing games. And in my experience with level 1 support guys, their goal in life, 90% of the time, is it to get the hell out from doing support. So, they actually appreciate a well-written SOP and learning a bit about SQL.
- I will need two DBA emails for this:
- DBA_Emergency – only important stuff is sent here. Alerts that come into this email box mean “wake someone up at 4 AM. It’s worth it.”
- DBA_InfoOnly – this email is for info “I’d like to know, but it’s not important to interrupt my dinner. I will check it out tomorrow, when I get it.”
A couple of caution points
The more people you involve in deciding what needs to be monitored, the worse it will get. Managers are scared to “miss something”. And that’s complete BS. The method that has worked for me is to ask my manager, “Um, would you like for me to handle SQL monitoring?” He will say “Yes, of course”. And I will take it from there.
Don’t ask for anything else. If you do, your manager will want every alert under the sun and will gladly volunteer your time off-hours. Just don’t provide that option.
One simple caveat. You have to do a superb job with SQL monitoring. You now own it, buddy. If you do that – there will be no issues. The manager won’t care some alert triggered and you didn’t do anything until the next day. If the manager is not in trouble, you won’t be either. But it better be an alert that can be ignored.
It’s when you mess up and start having meetings about alerting; that’s when stuff gets out of hand.
Now that I know what I need to pull off, let’s see where that data lives.
- Server stats (like CPU & RAM usage)
- Event viewer logs
- SQL Error log
- SQLAgent failed jobs (or msdb)
- SQL DMVs
This means I will need to use a combination of SQL and PowerShell scripts.
I/we have bunch of these written, and are being used on many SQL Servers already. The next step is to clean the code up and write some more, so I will not have to turn red when you find something embarrassing in it. We will see how that works out.
I hope you see the approach we will take if we monitor your SQL Server.
If you have questions, please reach out using the “Contact Us” form or call us at 1-800-984-5704.