Ember’s Take on Monitoring DB2 LUW
I have some very specific perspectives on monitoring DB2. In addition to regular consulting in my day job, we also provide full-service virtual DBA services, including monitoring. The monitoring we choose to do is very much under my control, and I’m constantly working on enhancements. I thought I’d blog on what I like to monitor and alert on without going too deeply into how these things can be monitored. There are excellent tools on the market, and excellent ways of scripting your own monitoring.
Categories of Monitoring
I am big these days of properly organizing what I do. Monitoring is no exception. I break monitoring into several main categories:
- Recoverability/high availability
- Diagnostic log parsing (which ties into several other categories)
- Performance alerting
- Historical performance and capacity metric collection
All of these categories have a time-based component to them. Alerts occur for a specific time or time period. Historical data reports data over a specific time period.
I work with a number of clients with a number of different approaches to monitoring. Often I find people monitoring at the application and the system level, but neglecting monitoring at the database level. I also see large holes in monitoring at times.
This is the most basic thing to monitor, but consists of several factors. There are several questions I ask in this area:
- Is the DB2 instance up?
- Is the DB2 database up?
- Am I receiving monitoring data?
Knowing if I’m receiving data is critical to detect a server down or network problem. The answers to these three questions are critical to be constantly asking. In fact, I like to ensure they’re being asked every 5-15 minutes, 24/7 and if the answer to any of them is “No”, it is something that needs to be addressed immediately.
Recoverability is often my second priority. If something bad happens, I need to be able to get the database back quickly and effectively. How quickly and with what level of data loss depends on the priority and money that each client has given to the issue. The goals around RTO and RPO play a major role here. In monitoring, I’m often looking for things that could cause my recovery plan to be derailed. The questions I ask here are:
- When was my last successful backup?
- Is HADR up and in the proper state?
- What is the HADR log gap?
- If using LOAD with HADR, are all the load copy locations correct?
- Is TSAMP or other automatic failover solution in the proper state?
Capacity can also be an availability issue. If the database does not have the most basic room to grow, then inserts and updates will fail. Preferably we are alerted long before it gets to that point so we can prevent the outage. The questions I ask here are:
- Are any DB2-used file systems at a warning level (getting full, but not dangerously so)?
- Are any DB2-used file systems at an error level (dangerously full, requiring action 24/7)?
- Are any non-autoresize table spaces at a warning level (getting full, but not dangerously so)?
- Are any non-autoresize table spaces at an error level (dangerously full, requiring action 24/7)?
- Are any table spaces approaching their maximum possible size?
- Is a high percentage of the (primary) active transaction log space used?
This is one of the areas that crosses over with diagnostic log parsing, because I also parse the diagnostic log for the various full conditions to detect if a capacity error has occurred that needs to be immediately dealt with.
Capturing historical data on the size of objects is also very useful for capacity planning and ongoing growth analysis.
Diagnostic log parsing
Parsing the DB2 diagnostic log is a critical component of monitoring to me. There are two ways to approach this.
Positive Parsing – Looking for Specific Messages
Positive parsing involves having a list of error messages that you periodically look for in the diagnostic log. The disadvantage here is that if your list is not comprehensive, you can miss major problems. However, when you do have an alert fire, you’re much more sure that it’s an important issue that must be addressed.
Negative Parsing – Looking for “Error” Messages
Negative parsing involves alerting on any message that DB2 classifies as an Error or Critical message. The problem here is that there may be many messages that DB2 classifies this way that you may not need to take any action on. Negative parsing can lead to a lot of false alerts.
This is an area where every DBA will have their own metrics that they prefer to alert on and their own values they consider warning and error conditions. I’m not sure you could find two DBAs who would agree 100% on this topic. That’s why off-the-shelf performance tools offer so many metrics and options. DBAs also may disagree on whether they want alerts 24/7, or want to deal with those alerts during business hours. These things may also vary by database supported by the same DBA team. I generally want to only get alerts on these during business hours (though I want to know what happened off-hours the next day), but there may be times for certain databases for certain metrics that I want to be alerted 24/7. Many of these will lead me to deeper investigation. Here are my critical performance alerts:
- Index Read Efficiency (Rows Read/Rows Returned) – This tells me how well indexed the database is for the workload. It is not valid in BLU databases. If this is high, it tells me that I need to spend time finding problem SQL and analyzing that SQL.
- Overall Buffer Pool Hit Ratio – In most databases, STMM will tune buffer pool size pretty well. However, a lower buffer pool hit ratio can indicate that the workload is too much for the memory. This can lead me either to look at the workload and see what I can do to work on it, or request additional memory.
- Package Cache Hit Ratio – I like a static package cache size, which means I have to pay attention to tuning the size. If the database is an OLTP database, watching the ratio is a critical part of tuning the size.
- Package Cache Overflows – I like a static package cache size, which means I have to pay attention to tuning the size. No matter the purpose of the database, I want to tune the size of the package cache to avoid overflows.
- Catalog Cache Hit Ratio – It has been a while since I’ve had to take action on this metric, but if there are issues, this can be a severe performance impact. Low ratios indicate the Catalog Cache should be increased in size.
- Catalog Cache Overflows – It has been a while since I’ve had to take action on this metric, but if there are issues, this can be a severe performance impact. Overflows indicate the Catalog Cache should be increased in size.
- Log Pages Read vs. Log Pages Written – A large number of log pages read can indicate that LOGBUFSZ is too small. This can impact performance significantly.
- Number of Log Files Archived per Hour – Too many means transaction logs may be under sized. Too few means they may be too large. Archiving requires system resources, and transaction log files that are too large increase database start time.
- Deadlocks – A deadlock here or there may be OK, but more than a few can indicate a problem. There are a few things that might work at the database level to reducce deadlocks, but most deadlock problems must be solved at the application level. Either way, knowing there’s a problem here before you get to a peak period or before code gets to production is half the battle.
- Lock Escalations – Lock escalations are really the most severe of the locking problems. A lock escalation reduces concurrency, and therefore perceived performance. Even one lock escalation is too many and should be investigated. Sometimes this problem can be addressed by increasing LOCKLIST, though an AUTOMATIC setting sure makes sense for LOCKLIST.
- Lock Timeouts – Lock Timeouts are normal and will occur in any database. But when they reach over a certain level, they can be an indicator that there is a major application or database problem. You may see these increase before you see actual deadlocks. They may also occur as a symptom of lock escalation. Even application slowness can cause them.
- Sort Overflow Percent – Sort is a four-letter word. Particularly in OLTP databases. Generally keeping an eye on sorts is good, and making sure that they are efficient when they do occur is the goal of monitoring how often they overflow. Less than 3% of sorts should overflow, preferably much less.
- Files Closed – This used to be a much bigger problem area than it usually is today. The default value used to be lower, and more files are used with SMS table spaces than DMS. With automatic storage being the default go-to these days instead of SMS, it is much less likely to run into issues with this. However, if you have a lower value, a lot of tables in SMS table spaces, a lot of table spaces, or a lower than recommended ulimit, you can still run into problems in this area.
- Server CPU Utilization I would prefer to catch a problem before it escalates to the level where it shows up in system resource utilization, but that is not always possible. It is important to keep an eye on overall CPU Utilization, whether DB2 is running on a dedicated or shared server.
There are absolutely other metrics I look at, but for me, these are the key performance indicators that tell me to dig deeper and where to start digging. There are also some critical system-level metrics in this area – CPU utilization and disk statistics – that it can be useful to alert on.
Historical performance and capacity metric collection
Metrics on everything listed above should be kept, so that when a problem is detected, context is available to define if one area really does appear to be a problem, or if those values are just normal for a particular database. For me, this covers as many metrics as I can grab and convince someone to pay to store. I love to write out values from tables like mon_get_database, mon_get_table, mon_get_index, mon_get_bufferpool, mon_get_hadr, mon_get_locks, mon_get_tablespace, mon_get_transaction_log, and even mon_get_workload on a regular basis. I also like to store periodic looks at the table row size. I run a locking event monitor (not the default one) to make sure I have information on deadlocks and lock timeouts if they become a problem. Keeping historical data on system-level metrics such as CPU utilization and disk activity is also useful, though this is often done by system administrators at a different level than the database level that other monitoring covered here occurs at.
Given unlimited space, I would love to use a package cache eviction event monitor to store essentially every statement ever executed against a database. Or even an activity event monitor, though the impact of that is questionable. In the real world, I settle for capturing the “worst” SQL and stats on it periodically. This allows me to see if a statement has consistently been a problem or is a new issue.
No One Answer
While there are certainly some wrong answers (not monitoring, missing a few indisputably key things), monitoring is an area where there is a lot of variation among DBAs on what should be monitored. What are your must-haves for monitoring?