Three HADR Failures

When you move DB2 servers from one hosting provider to another, you have to re-do hadr. Along with another DBA, I had 6 production databases across 3 instances to move last Saturday night, along with the HADR databases. The entire change had been tested ahead of time at least once, and probably more like 3 times, with no hiccups. But when we went to do it for real, we ran into three different HADR failures, so I thought I’d share them with my readers. Our solutions worked, but there may have been other options.

HADR isn’t “configured” properly

This is actually a common error message when you’re first setting up HADR. In that case, it most often means you have a typeo in one of the HADR parameters in the database configuration. AFTER HADR appears to have started successfully, it stops, and then you get errors like this in the db2diag.log:

2012-09-30-00.33.14.123075-240 I33173537E425      LEVEL: Error
PID     : 9586                 TID  : 182959892064PROC : db2hadrs (PWPTST) 0
INSTANCE: db2wps1              NODE : 000         DB   : PWPTST
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduS, probe:20750
RETCODE : ZRC=0x87800140=-2021654208=HDR_ZRC_CONFIGURATION_ERROR
          "One or both databases of the HADR pair is configured incorrectly"

In this case, when I checked the HADR configuration and found it was all correct.

Primary

$ db2 get db cfg for PWPTST |grep HADR
 HADR database role                                      = STANDARD
 HADR local host name                  (HADR_LOCAL_HOST) = srv-db2p05
 HADR local service name                (HADR_LOCAL_SVC) = 18950
 HADR remote host name                (HADR_REMOTE_HOST) = srv-db2p08
 HADR remote service name              (HADR_REMOTE_SVC) = 18951
 HADR instance name of remote server  (HADR_REMOTE_INST) = db2wps1
 HADR timeout value                       (HADR_TIMEOUT) = 120
 HADR log write synchronization mode     (HADR_SYNCMODE) = NEARSYNC

Standby

$ db2 get db cfg for PWPTST |grep HADR
 HADR database role                                      = STANDARD
 HADR local host name                  (HADR_LOCAL_HOST) = srv-db2p08
 HADR local service name                (HADR_LOCAL_SVC) = 18951
 HADR remote host name                (HADR_REMOTE_HOST) = srv-db2p05
 HADR remote service name              (HADR_REMOTE_SVC) = 18950
 HADR instance name of remote server  (HADR_REMOTE_INST) = db2wps1
 HADR timeout value                       (HADR_TIMEOUT) = 120
 HADR log write synchronization mode     (HADR_SYNCMODE) = NEARSYNC

But I’ve seen these before, so I knew to check /etc/hosts. My HADR configuration uses host names, and DB2 then resolves those host names to IP addresses using the hosts file (/etc/hosts). In this case, /etc/hosts was correct on the standby Server, but incorrect on the Primary server. If you didn’t already know, this is a confirmation that the standby talks to the primary even before you get HADR started on the primary – I went through all of this before even issuing the command to start HADR on the primary server.
As soon as the hosts file was corrected, HADR stayed up on the standby, and I was able to start HADR on the primary and see them sync up.

An Odd One

This failure was the oddest of the three. When the command was issued to start HADR on the standby database, the following error was returned:

SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command.
SQLSTATE=*

My first thoughts on this error are to try it again, to try a db2 terminate, and to try stopping and starting the instance (not a big deal since this is a standby), but even after a db2stop/db2start, this message was still returned. There were no errors in db2diag.log. We then fail back to our general HADR setup bag of tricks. First, we took a new backup, and tried to restore it to see if that would work. It didn’t. Finally we resorted to dropping the database on the standby server, restoring the same image we had been using, and then re-configuring HADR. If you end up dropping your HADR standby database, you lose the HADR settings. That was what finally worked.

Bad Log File

The third failure we saw on Saturday night was a bad log file. After the restore, DB2 should be deleting all active log files. It usually gives you a warning message that it is doing this. But occasionally it doesn’t for whatever reason. In this case, again HADR looked like it started successfully, but then went down with the following in the db2diag.log:

2012-09-30-01.43.50.480225-240 I16080925E381      LEVEL: Error
PID     : 4826                 TID  : 182959892064PROC : db2hadrs (TS256P01) 0
INSTANCE: db2inst1             NODE : 000         DB   : TS256P01
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduEntry, probe:21150
RETCODE : ZRC=0x87800148=-2021654200=HDR_ZRC_BAD_LOG
          "HADR standby found bad log"

In this case, at least the solution is simple. We deleted all the logs on the standby and then did the restore again, and it worked. As much as I hate deleting log files – especially since they were in the active path – it was necessary.

Conclusion

No matter how many times you test it, you can still run into issues when you go to do a change in production. It is important to be prepared to try different solutions. Sometimes you don’t even know the true problem – you simply have a set of solutions that have worked in the past for a given problem, and you work through them until something works.

Special thanks to Jim Reutener who I was assisting in this situation.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *