HADR_TIMEOUT vs. HADR_PEER_WINDOW
It has taken me a while to fully understand the difference between HADR_TIMEOUT and HADR_PEER_WINDOW. I think there is some confusion here, so I’d like to address what each means and some considerations when setting them. In general, you’ll only need HADR_TIMEOUT when using HADR and only need HADR_PEER_WINDOW when using TSA(db2haicu) or some other automated failover tool.
HADR Timeout defines, in seconds, the time after unavailability of the other HADR server is first noticed that the HADR state will change from connected to disconnected. If you are starting HADR on the primary server, then if the primary server cannot connect to the standby in this number of seconds, the start will fail and HADR will not be running. Assuming no failover software and the setting of HADR_PEER_WINDOW to 0, The primary server will continue processing transactions without sending them to the standby. It will periodically retry the connection to the standby, and if the standby becomes available it will again start processing transactions with commits tied to the requirements of the SYNCMODE being used.
If attempting a takeover without force, DB2 will wait this amount of time to attempt to communicate with the other server before failing and returning an error message.
The real point of this time period is to allow minor network hiccups to occur without other action being taken, but yet to consider the connection failed so as not to impede transactions after a reasonable period of time.
Setting this value depends on your network. I have a client with frequent network issues where I keep this value at 300. I have other clients where I use simply 120, which seems to work well for most environments. I have seen it set as low as 10 seconds for a very highly available network where seconds of slowdown are not very acceptable, but would be very cautious setting it that low.
This parameter is not usually used when only HADR is in place with manual failover. But it is critical if using an automated failover for HADR such as TSA(db2haicu) or others. This tells DB2 how long AFTER the connection is considered failed to continue to behave as if the connection were not failed. Now that may sound a bit odd. But the real intention here is to allow the connection to be considered failed, and then give time for that failure to be detected by the failover automation software before any transactions are allowed to complete and compromise the data. This means you can easily have connections waiting for as much as HADR_TIMEOUT plus HADR_PEER_WINDOW before a failover is completed and your database is again available.
Most frequently I see HADR_PEER_WINDOW set to 300 out of an abundance of caution – actual takeovers do not generally take that long, though in a failure state there may be multiple factors slowing down the failover.