When Both HADR Servers are Rebooted at the Same Time, DB2 Won’t Work
I have a client that has been having some issues that cause both HADR servers to go down at the same time. Ignoring for the moment that it’s a bad idea to share resources that can cause this to happen, they seem frustrated that when both servers come back up, DB2 does not automatically come back up. If I have only one client with this issue, and they think it’s a basic failing of DB2 HADR, then there must be others out there, so I thought I would explain it.
When Both HADR Servers are Rebooted (or crash) at Once
IBM clearly did not mean HADR to protect against failures of both servers at once. HADR can protect against failure of one server at a time. With the use of TSA(db2haicu), it can also automate the takeover. Failures of both servers at once can be a network issue, so DB2 protects against the possibility of having two database servers actively taking transactions that cannot later be reconciled (a nightmarish condition called ‘Split-Brain’). The protection that it uses prevents the primary database from being started until one of the following happens:
- HADR starts on the standby server and the primary server is able to talk to it:
(on standby) db2 “START HADR ON DATABASE SAMPLE AS STANDBY”
- A start command is issued with a “by force” keyword, potentially breaking HADR and introducing the possibility of Split-Brain:
(on primary) db2 “START HADR ON DATABASE SAMPLE AS PRIMARY BY FORCE”
Please use the “by force” option with extreme care
I happen to agree with IBM on the approach they’ve taken here.
The problem comes in with the next part:
When DB2 starts on the Primary, HADR is also automatically started. BUT when DB2 starts on the standby, HADR DOES NOT START, and MUST BE STARTED MANUALLY.
Now that is the effect. What actually happens is that HADR is started on either database when the database is activated. On the Primary, connections from the application activate the database pretty immediately. On the standby, the only thing that can activate the database is an explicit “activate database” command. So one solution here may be to add explicit database activation to your startup scripts.
The difference in likely database activation may cause problems. If HADR started automatically on the standby, then there would be no problem in a double-reboot scenario. But again, HADR is not designed to protect against double-outage scenarios, so it makes sense to me that an actual person has to be engaged and involved in these situations.
If Only the Standby Crashes
This is one of the failure scenarios HADR is designed to deal with. If only the standby crashes, there is no outage. However, when the standby comes back up, HADR itself will be down. This is one of the reasons why it is critical to monitor HADR itself. I treat an HADR outage as if it were a sev 1 – to get it back up and running, day or night – at least for systems where I expect takeover to occur in a few minutes or less. If caught soon enough, only a single command would have to be issued to manually start HADR on the standby database. If HADR has been down too long – ‘too long’ depends heavily on your transaction volume – then take a backup from the primary and restore it on the standby before HADR will start.
If Only the Primary Crashes
This is the failure HADR is meant to protect against. When using db2haicu/TSA, if the Primary database server crashes or is rebooted, the database will fail over to the standby. The time it takes to fail over depends on two settings in the DB cfg and the transaction volume on the database. When the primary database server comes back up, it checks with the standby before it starts, learns that it is now the new standby, and starts itself as a standby.
TSA states for Unexpected Failures
Most of the time TSA can deal with many different failure scenarios. But some failures can cause TSA to get stuck in a “pending online” or other unhealthy state. Always check TSA status when re-starting HADR using the lssam command or db2pd -ha.
How to Properly Stop TSA if you’re Stopping Things on Purpose
It is best for TSA if you stop it fully before a planned outage of one or more servers. Details on how to do that are in this document: http://db2commerce.com/wp-content/uploads/2012/01/Shutdown-Procedure-for-an-Automated-HADR-Environment_11152010.pdf
Simply put, DB2 is not a product you can set up and expect it to run without human involvement. It requires frequent attention by a DBA, especially in outage scenarios.