Ongoing Support of DB2’s HADR
There are some things to be aware of with ongoing support of a HADR system. I thought I’d group them together to provide a primer of do’s and don’ts for support of HADR.
HADR does occasionally stop all by itself. Also, system events can cause it to not be active. For these reasons, it is critical that you have some sort of monitoring for HADR. I have my monitoring solution treat HADR down as a sev-1 event, that pages my team out in the middle of the night to get resolved. My reasoning for this is that HADR is often part of a recovery plan, and it only takes one subsequent event to cause a major lapse in High Availability or Recoverability. I have personally seen a severe RAID array failure (which resulted in the disk being unrecoverable on its own) just 3 hours (in the middle of the night) after HADR went down. In that case, we were luckily able to read from (but not write to) the failed RAID array for about an hour before it died completely, and were able to get all of the transaction logs from that 3-hour lapse copied. But it was a lesson for my team at the time to treat HADR failures as immediate emergencies themselves so we’re covered for any subsequent emeregencies. It is really amazing the statistically unlikely failures that can and do occur.
Thus, I monitor HADR using whatever monitoring system I have available. My preferred way to monitor HADR is using db2pd. With 10.1, they changed the db2pd output for the -hadr option to be easier to parse for scripting. And while the MON_GET_HADR table function is nice, it requires a connection to the database, and I also ran across a APAR in 10.5 fixpack 4 where the entire DB2 instance crashed if I ever queried MON_GET_HADR (APAR IT04151), which makes me irrationally afraid of MON_GET_HADR.
To monitor HADR, use syntax like this:
The primary things you are looking for in that output is the HADR_ROLE is what you expect (no unexpected failover has occured), the HADR_STATE is PEER (assuming SYNCMODE of ASYNC or higher), and the HADR_CONNECT_STATUS is CONNECTED. Or as I say in my head when looking at this output: PRIMARY, PEER, CONNECTED. The fourth pink arrow in the above points to something else that is worth monitoring and that is the HADR_LOG_GAP. I generally look for anything over 10% of a single log file to indicate a severe issue.
Obviously, you have to script reading the above to feed into most monitoring infrastructures.
Stopping/Starting Systems using HADR
There are several circumstances under which you need to have your procedures straight for what to do with HADR. If you’re using TSAMP, I recomend that a DBA is always personally involved with any reboot or failover. Without TSAMP, you may be able to train someone junior or a system administrator to take correct action and tell them when to engage a DBA for help.
Standby HADR Server Reboot
If you need to reboot a standby database server (due to system maintenance or maintenance at other non-database levels), you will usually not deactivate HADR. However, you will have to activate HADR after the reboot is complete. HADR will become active when the database is activated, and unless you have added some scripting, the database on a standby server is NOT automatically activated.
If TSAMP is being used
If TSAMP is used to automate HADR failover (using db2haicu), you should disable TSAMP prior to the standby database server becoming unavailable. This is fairly simple, and simply involves issuing the db2haicu command with the -disable option prior to any planned outages. After the outage, you would issue the db2haicu command without any options, and then select 1 to enable TSAMP again. Always check the TSAMP states using lssam (or your favorite other way of looking at it) to ensure that all states are blue or green.
Primary HADR Server Reboot
Usually if you need to reboot or patch or otherwise affect the primary node, you will first fail DB2 over to the principal standby node using the TAKEOVER HADR command on the standby node. Then the reboot of the former primary node is treated just like rebooting a standby server, as described above. Afterwards, some companies prefer to run on the standby node for a while, while others prefer to immediately fail back to the original primary.
If the primary HADR server is rebooted without a failover, it is less likely to need DBA involvement when it comes back up, because HADR will automatically come up when the database is activated, and the primary HADR database is nearly always activated either explicitly or implicitly on first connection.
If TSAMP is being used
As with reboot of the standby server, if the primary server is to be rebooted, no mater whether the database is failed over or not, you should disable TSAMP using the -disable option on the db2haicu command. After the server is back up, you would issue the db2haicu command without any options, and then select 1 to enable TSAMP again. Always check the TSAMP states using lssam (or your favorite other way of looking at it) to ensure that all states are blue or green.
Reboot of Both HADR Servers at Once
It is rare, especially when in produciton, but if you reboot both servers at once, always ensure the standby comes up first. The reason is that when the primary comes back online, with the first connection or activation attempt, will first check to see if it can get to the standby server. If it cannot, then it will not allow any incoming connections. The reason for this is that the primary assumes that there may be a network issue and refuses to allow connections so a scary condition called split-brain does not occur. You can force the primary to start using the “BY FORCE” keywords on the “START HADR” command on the primary – however, it is possible you will have to reset HADR with a restore on the standby database server if you do.
I had a client once with one of the most unstable networks I had ever seen. They were trying to do first-line DB2 support on their own, and became very frustrated when on multiple occasions, they had a network issue that forced them to reboot both live production database servers at the same time. Each time, their primary database did not become available until the standby database server came up. Their conclusion was that it was a flawed HA solution because the database could not be up unless both servers were up, but they simply did not understand the order of bringing things up or the commands to use to bypass that order.
Applying DB2 Fixpack
DB2 fixpacks can be applied to HADR servers in a rolling fashion. I have performed and trained others to perform fixpacks with zero observable downtime going back to DB2 version 8. The failover will necesarily reset transactions already in progress, so this depends on how robust your application is for restarting work. The general order of events is this:
- All pre-fixpack prep work is performed on the primary and standby servers
- DB2 is deactivated on the standby server
- The fixpack code is installed on the standby server
- The DB2 instance is updated on the standby server
- DB2 is restarted on the standby server
- HADR is restarted and the databases are brought back in sync
- The database is failed over from the primary server to the standby server
- Post-fixpack database actions such as binds and db2updvXX are performed on the database
- HADR is stopped
- The Fixpack is installed and applied on the Primary database server
- HADR is started and the databases are brought back in sync
- If desired, the database may be failed back to the primary
- Post-fixpack instance actions are performed
This work is still performed at off-peak times to minimize the impact of the failovers.
Maintaining and Verifying Settings that Are Not Logged
HADR copies for you most things that are logged. It does not copy changes to database configuration, database manager configuration, the DB2 registry, or changes made by STMM to bufferpools. Of course, DB2 cannot copy changes to OS-level parameters, filesystem sizes, and that sort of thing. It is important to be diligent in performing these types of changes on all HADR servers. To ensure this, it is important to manually compare all configurations between the two servers from time to time to ensure nothing has been missed.
Keeping Copies of Maintenance Scripts and Crontab
In the case of a failover due to a failure of the primary, you’ll want to make sure that any maintenance, monitoring, or data pruning scripts that you use on the primary are copied to the standby. You can automate this with rsync or manually copy them, but in any case, you want to ensure that everything is on the standby that is on the primary so you can easily pick up operations on the standby without access to the primary.
Health Check on HADR
Periodically, you should perform a health check on the HADR pair using the HADR calculator. This can point out areas where a busy network or other factors might cause HADR to impact database performance on your primary database. See my series on the HADR Tools for details on how to do this.