What is HADR?
HADR is DB2’s implementation of log shipping. Which means it’s a shared-nothing kind of product. But it is log shipping at the Log Buffer level instead of the Log File level, so it can be extremely up to date. It even has a Synchronous mode that would guarantee that committed transactions on one server would also be on another server. (in years of experience on dozens of clients, I’ve only ever seen NEARSYNC used) It can only handle two servers (there’s no adding a third in), and is active/warm spare – only with 9.7 Fixpack 1 and later can you do reads on the standby and you cannot do writes on the standby.
How much does it cost?
As always verify with IBM because licensing changes by region and other factors I’m not aware of. But generally HADR is included with DB2 licensing – the rub is usually in licensing DB2 on the standby server. Usually the standby server can be licensed at only 100 PVU, which is frequently much cheaper than full DB2 licensing. If you want to be able to do reads on the standby, though, you’ll have to go in for full licensing. Usually clients run HADR only in production, though I have seen a couple lately doing it in QA as well to have a testing ground.
What failures does it protect against?
HADR protects against hardware failures – CPU, disk, memory and the controllers and other hardware components. Tools like HACMP and Veritas use a shared-disk implementation, so cannot protect against disk failure. I have seen both SAN failures and RAID array (the whole array) failures, so it may seem like one in a million, but even the most redundant disks can fail. It can also be used to facilitate rolling hardware maintenance and rolling FixPacks. You are not guaranteed to be able to keep the database up during a full DB2 version upgrade. It must be combined with other (included) products to automatically sense failures and fail over.
What failures does it not protect against?
HADR does not protect against human error, data issues, and HADR failures. If someone deletes everything from a table and commits the delete, HADR is not going to be able to recover from that. It is not a replacement for a good backup and recovery strategy. You must also monitor HADR – I treat HADR down in production as a sev 1 issue where a DBA needs to be called out of bed to fix it. I have actually lost a production raid array around 5 am when HADR had gone down around 1 am. Worst case scenarios do happen.
How to set it up
HADR is really not too difficult to set up on it’s own. Configuring automatic failover is a bit more difficult, though DB2 has made it significantly easier in 9.5 and above with the introduction of bundled TSA and the haicu tool. I’m not going to list every detail here because there are half a dozen white papers out there on how to set it up. The general idea is:
1. Set the HADR parameters on each server
HADR local host name (HADR_LOCAL_HOST) = your.primary.hostname
HADR local service name (HADR_LOCAL_SVC) = 18819
HADR remote host name (HADR_REMOTE_HOST) = your.secondary.hostname
HADR remote service name (HADR_REMOTE_SVC) = 18820
HADR instance name of remote server (HADR_REMOTE_INST) = inst1
HADR timeout value (HADR_TIMEOUT) = 120
HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 120
2. Set the Alternate Servers on the Primary and the standby (for Automatic Client Reroute)
3. Set db configuration parameters INDEXREC to RESTART and LOGINDEXBUILD to ON
4. Take a backup (preferably Offline) of the database on the primary server
5. Restore the database on the standby server, leaving it in rollforward pending state
6. Start HADR on the standby
7. Start HADR on the primary
8. Wait 5 minutes and check HADR status
9. Run db2haicu to set up TSA for automated failover
10. Test multiple failure scenarios at the app and database level
For chunks of this, your database will be unavailable. There are also a number of inputs you need to have ready for running db2haicu, and you will need ongoing sudo authority to execute at least one TSA related command.
Remember that the primary and standby servers should be as identical as possible – filesystems, hardware, and software.
Some clients also neglect step #10 – testing of failovers. This is an important step to make sure you really can failover. It is possible to think you have everything set up right, do a failover and then not have it work properly from the application’s perspective.
This section represents hours spent troubleshooting different problems or recovering from them. I hope it can help someone find an issue faster.
HADR is extremely picky about its variables. They must be exactly right with no typos, or HADR will not work. I have, on several occasions had numbers reversed or the instance name off, and spent a fair amount of time looking for the error before finding it. Because of this, it can help if you have another dba look over the basics if things aren’t working on setup. HADR is also picky on hosts file and/or db2nodes.cfg set up, and in some cases you may end up using an IP address in the db cfg parameters instead of a hostname.
HADR also sometimes fails after it tells you it has successfully started, so you must check the status after you start it.
Occasionally HADR doesn’t like to work from an Online backup, so an Offline one will be required. I have one note about it not going well with a compressed backup, but that was years ago, and I frequently used compressed backups without trouble.
HADR does not copy things that aren’t logged – so it is not a good choice if you have non-logged LOBs or if you do non-recoverable loads. If you are using HADR and you do a non-recoverable load, you have to take a backup on the primary and restore it into the standby – if you don’t, any table with a non-recoverable load will not be copied over, nor will future changes, and if you go to failover, then you will not be able to access that table. For this reason, I wouldn’t use it in a scenario where you don’t have good control over data being loaded into the database. If you do run into that, then you have to backup your primary database, restore it into your standby database, and start HADR.
HADR does go down sometimes without warning – so you must monitor it using whatever monitoring tools you have, and ensure that you respond very quickly when it goes down. I use db2pd to monitor(parsing output with scripts), partially because db2pd works when other monitoring tools hang. We look at ConnectStatus, State, and LogGapRunAvg.
On reboot, HADR comes up with database activation. Which means it usually comes up just fine on your primary database, but not on your standby database (no connections to prompt activation). So you’ll generally need to manually start hadr on your standby after a reboot. The primary database will not allow connections on activation until after it can communicate with the standby. This is to prevent a DBA’s worst nightmare – ‘Split Brain’. DB2’s protections against split-brain are pretty nifty. But this means that if you reboot both your primary and your standby at the same time and your primary comes up first, then your primary will not allow any connections until your standby is also up. This can be very confusing the first time or two that you see it. You can manually force the primary to start if you’re sure that the standby is not also up and taking transactions. Or if you’re rebooting both, just do the standby first and do the primary after the standby is back up and activated. If you need your standby down for a while, then stop HADR before you stop the servers. I would recommend NOT stopping HADR automatically on reboot, because the default behavior protects you from split-brain.
What is split-brain? It is simply both your primary and standby databases thinking they are the primary and taking transactions – getting you into a nearly impossible to resolve data conflict.
You must keep the same ids/groups on the primary and standby database servers. I’ve seen a situation on initial set up where the id that Commerce was using to connect to the database was only on the primary server, and not on the standby server, and thus on failover, the database looked fine, but Commerce could not connect.
You also want to be aware of any batch-jobs, data loads, or even scheduled maintenance like runstats or backups – when you fail over, you’ll need to run these on the other database server. Or you can also run them from a client that will get the ACR value and always point to the active database server. Frequently we don’t care which database server the database is running on, and may have it on what was initially the “standby” for months at a time.
Overall, I really like HADR and it’s ease of administration. The level of integration for TSA in 9.5/9.7 is great.