Monitoring SQL Server is part of any critical SQL Server infrastructure.
How and what to monitor can be tricky.
3rd party SQL Server Software tools:
Here are a few ways to accomplish SQL Server monitoring:
- Redgate – I have heard good things. I’ve never used it in a real-world situation, so I can’t tell you much.
- Idera SQL Diagnostics Manager – Not the best. I’ve run into three DBAs who say, “It’s great.”
Every time, I thought I was missing something, but sadly, in all cases, those DBAs were just using defaults and had not used monitoring to the level I am describing here. And that’s just not good enough for me.
After countless calls with Idera Support, Idera account managers, and submitting bugs, I just gave up on this tool.
Oh, and Idera thought the stuff I was asking for was so good that they added those items to their future roadmap.
Yeah… I was just getting started.
- Dell FogLight – I tried using this, too. This was a few years ago. It was so complex.
Two DBAs before me attempted to set up this tool but never got past monitoring free disk space. Maybe it’s good, but any software these days that requires two weeks of consultants on-site is just not going to work for me. - SolarWinds – Probably has some tools that do this, but I am not aware of them. It’s not a Database Performance Analyzer (DPA).
- Splunk – This has a lot of promising features. I dabbled in it, but it seems too complex for my needs.
- SQL Server Profiler – You can probably achieve a lot with it, but not 100%. We are also talking about an extra load on the server, which is not acceptable.
- SQL Server Extended Events – You can probably do more with it than with SQL Profiler, but since I need to monitor all the way down to SQL 2005 or SQL 2008, that won’t work. Maybe in a few years, when older SQL versions are extinct.
- Performance Monitor – It’s good, but not for my purpose here.
- SSMS Activity Monitor – Nope, won’t work at all for this.
- The PAL (Performance Analysis of Logs) tool – This is a great tool, but not for ongoing monitoring purposes.
- Wait stats – Not a good tool for monitoring. Not enough data and the data is too general.
- Plan cache – Well, now we’re getting too far from the goal. It’s really good for performance monitoring, but it can’t provide everything I need.
- Build your own SQL/PowerShell performance and monitoring scripts – This is not recommended when you manage a small number of SQL Servers, especially if you are a DBA taking care of one production server.
But that’s what we chose to do. We decided to write our own monitoring tool.
Why? Because when a client has 200+ SQL Servers, you will never have the budget to buy the many software licenses needed.
Miscellaneous SQL Monitoring thoughts or how good monitoring should work
- Never wake up the DBA unless it’s absolutely necessary.
- Do not issue alerts for items that are not actionable or are not sufficiently important.
- Non-DBA alerts should not come to DBAs.
You need some of these points because you want to create a written policy document. So, when some whacko manager tries to point a bunch of Cisco alerts at DBAs, thinking everyone should know, you can point to the policy and say, ‘Nope, no way, my friend’. - Each alert triggering during non-business hours needs to be addressed the next day.
The main question we are trying to answer is: What can be done so this never happens again? Can we add some code to self-heal the issue? If not, can we write a Standard Operating Procedure (SOP) so 24×7 Level 1 support can deal with this, and DBAs won’t have to be woken up? (I have declared war on waking up DBAs!) - You will need a SQL Server list, first.
My list contains many columns, but the most important pieces of data are the server names. The second most important item is the server’s importance. In smaller environments (under 20-50 servers), database administrators (DBAs) often memorize server lists. However, in larger environments, maintaining a written list is necessary.
I compile server lists regardless of the number of servers. I recommend doing the same.
- Alert messages should be as succinct as possible.
They need to be worded correctly, so there’s no question about what happened and how to fix it. Every second matters. Each character matters (yes, each one).
Why? Because when the issue is critical, I can’t afford to waste time.
At 4 AM, I prefer to read less, not more. I want to discern the issue’s urgency from the subject line, determining whether it’s worth getting up and searching for my slippers.
If I must interrupt someone’s dinner, I aim to be as brief as possible and communicate the problem quickly.
- Each alert should become a ticket.
I hate documentation just like you do. However, this time, I can use that to do good. Often, it’s hard to see a forest because I am looking at a tree from 5 inches away.
When I know a reoccurring problem takes X hours per week to fix, it becomes easier to say to my manager, “This month we wasted 18 hours of DBA time on this, and 314 hours this year.
Therefore, dedicating a few days of DBA time to write a kickass PowerShell script is worth the investment.
- When a script cannot be written to fix the triggering issue, then the next level of defense is a Standard Operating Procedure (SOP).
I want an SOP so I can give it to Level 1 support and have them fix the problem. Many companies have 24×7 support staff anyway, so if we educate them on how an issue can be fixed, we avoid waking DBAs!
So many companies don’t use level 1 support correctly – if each problem contains an SOP for how it should be fixed, we can make those people do some valuable work, instead of sleeping through the night or playing games.
In my experience with Level 1 support guys, their goal in life, 90% of the time, is to get the hell out of doing support. So, they actually appreciate a well-written SOP and learning a bit about SQL. - I will need two DBA emails for this:
- DBA_Emergency – only important stuff is sent here. Alerts that come into this email box mean “wake someone up at 4 AM. It’s worth it.
- DBA_InfoOnly – this email is for info “I’d like to know, but it’s not important to interrupt my dinner. I will check it out tomorrow when I get it.”
A couple of caution points
The more people you involve in deciding what needs to be monitored, the worse it will get.
Managers are scared to “miss something”. And that’s complete BS.
The method that has worked for me involves asking my manager, “Um, would you like me to handle SQL monitoring?” He will respond with “Yes, of course,” and I’ll take it from there.
Don’t ask for anything else. If you do, your manager will want every alert imaginable and will gladly volunteer your time during off-hours. Simply don’t provide that option.
One simple caveat: you have to do a superb job with SQL monitoring. You now own it, buddy. If you do that, there will be no issues.
The manager won’t care if an alert is triggered and you don’t do anything until the next day, as long as the manager is not in trouble. You won’t be either. But it better be an alert that can be ignored.
It’s when you mess up and start having meetings about alerting that things get out of hand.
Now that I know what I need to accomplish, let’s see where that data resides:
- Server stats (like CPU & RAM usage)
- Event viewer logs
- SQL error log
- SQL Server Agent failed jobs (or
msdb
) - SQL DMVs
This means I will need to use a combination of SQL and PowerShell scripts.