Three HADR Failures
When you move DB2 servers from one hosting provider to another, you have to re-do hadr. Along with another DBA, I had 6 production databases across 3 instances to move last Saturday night, along with the HADR databases. The entire change had been tested ahead of time at least once, and probably more like 3 times, with no hiccups. But when we went to do it for real, we ran into three different HADR failures, so I thought I’d share them with my readers. Our solutions worked, but there may have been other options.
HADR isn’t “configured” properly
This is actually a common error message when you’re first setting up HADR. In that case, it most often means you have a typeo in one of the HADR parameters in the database configuration. AFTER HADR appears to have started successfully, it stops, and then you get errors like this in the db2diag.log:
2012-09-30-00.33.14.123075-240 I33173537E425 LEVEL: Error PID : 9586 TID : 182959892064PROC : db2hadrs (PWPTST) 0 INSTANCE: db2wps1 NODE : 000 DB : PWPTST FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduS, probe:20750 RETCODE : ZRC=0x87800140=-2021654208=HDR_ZRC_CONFIGURATION_ERROR "One or both databases of the HADR pair is configured incorrectly"
In this case, when I checked the HADR configuration and found it was all correct.
$ db2 get db cfg for PWPTST |grep HADR HADR database role = STANDARD HADR local host name (HADR_LOCAL_HOST) = srv-db2p05 HADR local service name (HADR_LOCAL_SVC) = 18950 HADR remote host name (HADR_REMOTE_HOST) = srv-db2p08 HADR remote service name (HADR_REMOTE_SVC) = 18951 HADR instance name of remote server (HADR_REMOTE_INST) = db2wps1 HADR timeout value (HADR_TIMEOUT) = 120 HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
$ db2 get db cfg for PWPTST |grep HADR HADR database role = STANDARD HADR local host name (HADR_LOCAL_HOST) = srv-db2p08 HADR local service name (HADR_LOCAL_SVC) = 18951 HADR remote host name (HADR_REMOTE_HOST) = srv-db2p05 HADR remote service name (HADR_REMOTE_SVC) = 18950 HADR instance name of remote server (HADR_REMOTE_INST) = db2wps1 HADR timeout value (HADR_TIMEOUT) = 120 HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
But I’ve seen these before, so I knew to check /etc/hosts. My HADR configuration uses host names, and DB2 then resolves those host names to IP addresses using the hosts file (/etc/hosts). In this case, /etc/hosts was correct on the standby Server, but incorrect on the Primary server. If you didn’t already know, this is a confirmation that the standby talks to the primary even before you get HADR started on the primary – I went through all of this before even issuing the command to start HADR on the primary server.
As soon as the hosts file was corrected, HADR stayed up on the standby, and I was able to start HADR on the primary and see them sync up.
An Odd One
This failure was the oddest of the three. When the command was issued to start HADR on the standby database, the following error was returned:
SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command. SQLSTATE=*
My first thoughts on this error are to try it again, to try a db2 terminate, and to try stopping and starting the instance (not a big deal since this is a standby), but even after a db2stop/db2start, this message was still returned. There were no errors in db2diag.log. We then fail back to our general HADR setup bag of tricks. First, we took a new backup, and tried to restore it to see if that would work. It didn’t. Finally we resorted to dropping the database on the standby server, restoring the same image we had been using, and then re-configuring HADR. If you end up dropping your HADR standby database, you lose the HADR settings. That was what finally worked.
Bad Log File
The third failure we saw on Saturday night was a bad log file. After the restore, DB2 should be deleting all active log files. It usually gives you a warning message that it is doing this. But occasionally it doesn’t for whatever reason. In this case, again HADR looked like it started successfully, but then went down with the following in the db2diag.log:
2012-09-30-01.43.50.480225-240 I16080925E381 LEVEL: Error PID : 4826 TID : 182959892064PROC : db2hadrs (TS256P01) 0 INSTANCE: db2inst1 NODE : 000 DB : TS256P01 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduEntry, probe:21150 RETCODE : ZRC=0x87800148=-2021654200=HDR_ZRC_BAD_LOG "HADR standby found bad log"
In this case, at least the solution is simple. We deleted all the logs on the standby and then did the restore again, and it worked. As much as I hate deleting log files – especially since they were in the active path – it was necessary.
No matter how many times you test it, you can still run into issues when you go to do a change in production. It is important to be prepared to try different solutions. Sometimes you don’t even know the true problem – you simply have a set of solutions that have worked in the past for a given problem, and you work through them until something works.
Special thanks to Jim Reutener who I was assisting in this situation.