SQL Monitoring Software & Tools
- You don’t do monitoring? Wait… What? Well, that won’t work for me. Moving on.
- 3rd party SQL Software tools:
Red-Gate – I have heard good things. Never used it in a real world situation, so I can’t tell you much.
Idera SQL Diagnostics Manager – this isn’t the best tool for monitoring. I’ve run into three DBAs that say “its great”. Every time, I thought I was missing something, but sadly in all cases those DBAs where just using defaults, and have not used monitoring to the level I am describing here. And that’s just not good enough (sloppy, in my book). After countless calls with Idera Support, Idera account managers, and submitting bugs – I just gave up on it. Oh, and Idera thought the stuff I was asking for was so good, they added those items to their future road map. Yeah… I was just getting started.
Dell FogLight – tried using this too. This was about 5 years ago. It was so complex, two DBAs attempted to set it up, and we never got past monitoring free disk space. Maybe it’s good, but any software these days that requires 2 weeks of consultants on-site is just not going to work for me. So yeah, moving on.
- Solarwinds – probably has some tools that do this, but I am not aware of it. It’s not a DPA (Database Performance Analyzer).
- Splunk – this has a lot of promising features. I dabbled in it, but it seems too complex for what I need.
- SQL Profiler – you can probably achieve a lot with it, but not 100%. We are also talking an extra load on the server, and this is not acceptable.
- SQL Extended Events – you can probably do more with it than with SQL Profiler, but since I need to monitor all the way down to SQL 2005 or SQL2008 – that won’t work. Maybe in a few years, when old SQL versions are extinct.
- Perfmon – It’s good, but not for my purpose here.
- SSMS Activity Monitor – nope, won’t work at all for this.
- PAL tool – This is a great tool, but not for on-going monitoring purposes.
- Wait Stats – Not a good tool for monitoring. Not enough data, and the data is too general.
- Plan cache – well, now we’re getting too far off course. It’s really good for performance monitoring, but can’t get me everything.
- Build your own SQL/PowerShell scripts – I am not a big fan of doing this. But, since SQL software vendors suck at this, I have had no choice but to roll my own.
Miscellaneous SQL Monitoring Tools
- The goal is never to wake-up the DBA.
- Do not alert on anything that is not actionable.
- Non-DBA alerts should not come to DBAs (same as #1, just in different words). You need some of these points, because you want to create a written policy document. So, when some whacko manager tries to point a bunch of cisco alerts at DBAs, thinking everyone should know, you can point to the policy and say “Nope, no way, my friend”.
- Each alert triggering during non-business hours needs to be addressed the next day. The main question we are trying to answer: what can be done so this never happens again? Can we add some code to self-heal the issue? If not, can we write a Standard Operating Procedure so 24×7 level 1 support can deal with this, and DBAs won’t have to be woken up? (I have declared war on waking up DBAs!)
- You will need an SQL Server list, first. My list contains a bunch of columns. But the most important pieces of data are server names. The second most important item is how important the server is. In smaller environments (under 20-50 servers), DBAs tend to keep server lists in their heads. With larger environments, you need to have a written list. I build server lists regardless of how many servers there are. Just do the same.
- Alert messages should be as succinct as possible. They need to be worded correctly, so there is no question about what happened and how to fix it. Every second matter. Each character matters (yes, each character). Why? Because when the issue is critical, I can’t waste time. At 4 AM, I want to read less, not more. And I want to be able to tell from the subject line how critical the issue is, and if I should be getting up and looking for my slippers. If I must interrupt someone’s dinner, I want to be as brief as possible and communicate the problem quickly.
- Each alert should become a ticket. I hate documentation just like you do. However, this time, I can use that for doing good. Often, it’s hard to see a forest because I am looking at a tree from 5 inches away. When I know a reoccurring problem takes X hours per week to fix, it becomes easier to say to my manager, “This month we wasted 18 hours of DBA time on this, and 314 hours this year. Therefore, dedicating a few days of DBA time to write a kickass PowerShell script is worth the investment”.
- When a script cannot be written to fix the triggering issue, then the next level of defense is a Standard Operating Procedure (SOP). I want a SOP so I can give it to level 1 support and have them fix the problem. Many companies have 24×7 support staff anyway, so if we educate them on how an issue can be fixed, we avoid waking DBAs! So many companies don’t use level 1 support correctly – if each problem contains a SOP for how it should be fixed, we can actually make those people do some valuable work, instead of sleeping through the night or playing games. And in my experience with level 1 support guys, their goal in life, 90% of the time, is it to get the hell out from doing support. So, they actually appreciate a well-written SOP and learning a bit about SQL.
- I will need two DBA emails for this:
- DBA_Emergency – only important stuff is sent here. Alerts that come into this email box mean “wake someone up at 4 AM. It’s worth it.”
- DBA_InfoOnly – this email is for info “I’d like to know, but it’s not important to interrupt my dinner. I will check it out tomorrow, when I get it.”
A couple of caution points
The more people you involve in deciding what needs to be monitored, the worse it will get. Managers are scared to “miss something”. And that’s complete BS. The method that has worked for me is to ask my manager, “Um, would you like for me to handle SQL monitoring?” He will say “Yes, of course”. And I will take it from there.
Don’t ask for anything else. If you do, your manager will want every alert under the sun and will gladly volunteer your time off-hours. Just don’t provide that option.
One simple caveat. You have to do a superb job with SQL monitoring. You now own it, buddy. If you do that – there will be no issues. The manager won’t care some alert triggered and you didn’t do anything until the next day. If the manager is not in trouble, you won’t be either. But it better be an alert that can be ignored.
It’s when you mess up and start having meetings about alerting; that’s when stuff gets out of hand.
Now that I know what I need to pull off, let’s see where that data lives.
- Server stats (like CPU & RAM usage)
- Event viewer logs
- SQL Error log
- SQLAgent failed jobs (or msdb)
- SQL DMVs
This means I will need to use a combination of SQL and PowerShell scripts.
I/we have bunch of these written, and are being used on many SQL Servers already. The next step is to clean the code up and write some more, so I will not have to turn red when you find something embarrassing in it. We will see how that works out.
I hope you see the approach we will take if we monitor your SQL Server.