Going Beyond 2 Servers – HADR as a High Availability/Disaster Recovery option.
When the client considers high availability and disaster recovery, they often do not know what they are talking about. Many times the client may be dropping buzzwords like “five nines”. To them this is the definition of disaster recovery. In other cases, they are thinking of a worst-case scenario where a whole data center falls off the face of the earth and they need high availability.
If you read that carefully, you should have stopped and said “Er… wait … didn’t he get that backwards?”.
Yes. My statements were backwards to prove a point. The client often doesn’t know what he wants and may be spouting buzzwords or a skewed understanding of recovery. It is often the DBA’s job to ask the right questions and propose the proper solution.
Inspired by a presentation from Dale McInnis on High Availability (HA) and Disaster Recovery (DR), Ember wrote an excellent article on what each means. Her article gives you a starting point to help the client define exactly what they need. It’s very in depth, but for the purpose of this article I want to over simplify the definition.
- High Availability (HA): Two servers, same data center, quick failover, and usually little or no data loss.
- Disaster Recovery (DR): Two servers, separate data centers, quick failover, little or no data loss or some possible data loss depending on purpose.
Using only two servers, these modes are mutually exclusive. You are either High Availability or Disaster Recovery. You can’t have a second server in the same data center (HA) as your database and simultaneously in a geographically dispersed data center (DR). You choose one or the other.
IBM’s pureScale is the Gold Standard for recoverability offering the possibility of High Availability and Disaster Recovery in one solution. However, this is a more expensive option, requiring many servers and a shared-disk architecture that may be foreign to a DB2 database administrator.
The use of multiple servers in your HADR configuration is a simpler solution.
Please note: Additional HADR servers beyond the principal failover server are commonly referred to as “auxiliary servers”. For the purposes of this article, I will refer to them as secondary and tertiary servers to help walk the reader through various examples.
With this design, two things are accomplished. First, high availability is handled by the principal standby server (Server B), within the same data center, via automatic failover through TSAMP. You receive a robust high availability option with TSAMP automatically handling failover and NEARSYNC providing little or no data loss.
Second, disaster recovery is accomplished to a geographically dispersed data center. Failover to auxiliary HADR servers [secondary (SERVER C) or tertiary (SERVER D)] is done manually with some possible data loss with a SUPERASYNC setting. However, most clients understand this to be more of a doomsday scenario where slower recovery time or data loss is tolerated. In this case, the goal is to be functional for a major issue, not instantaneous recovery.
Configuring a Multiple Server HADR Cluster
This article is not meant to be a primer for DB2’s native HADR ability. The assumption is you know of the feature, are aware of its purpose, and have a basic understanding on how to configure, activate, and monitor a simple two-server HADR cluster. If you need a primer, I have just the article for you.
The main concept you will need to understand beyond the concepts involved with a two-server HADR implementation is the HADR Target List (HADR_TARGET_LIST). This essentially tells each HADR server what role other servers play when it is the primary database server. The standby servers use the list on the primary to know what role they are currently playing.
Setting up multiple HADR standby servers is no different than a traditional two-server standby with the exception of this database configuration setting.
Setting this database configuration setting was easy, keeping all my hostnames, ports, and roles straight was not.
For reference, here is the example configuration for the beginning of this article.
- Server A – Primary Database Server, Texas Data Center
- Server B – Principal Standby Database Server, Texas Data Center
- Server C – Secondary Standby Database Server, Washington Data Center
- Server D – Tertiary Standby Database Server, Washington Data Center
In the end, the configuration setting will be set with the following syntax:
UPDATE DB CFG FOR <DBNAME> USING hadr_target_list "SERVER B:PORTB|SERVER C:PORTC|SERVER D:PORTD";
When you look at your database configuration setting, you will see something similar to:
HADR target list (HADR_TARGET_LIST) = SERVER B:PORTB|SERVER C:PORTC|SERVER D:PORTD
The catch is as you move through setting each server’s HADR_TARGET_LIST you need to roll out your specific hostname/port and roll in the previous servers hostname/port. The first database server in the list will be the principal standby when the server it is set on is the primary database server. The example below will help clarify what I mean.
To help with this exercise, I would first make a list of all servers and ports, in order you wish to failover.
SERVER A:PORTA|SERVER B:PORTB|SERVER C:PORTC|SERVER D:PORTD
As you prepare your HADR_TARGET_LIST setting, remove the server you are on. For example:
- Server A – SERVER B:PORTB|SERVER C:PORTC|SERVER D:PORTD
- Server B – SERVER A:PORTA|SERVER C:PORTC|SERVER D:PORTD
- Server C – SERVER A:PORTA|SERVER B:PORTB|SERVER D:PORTD
- Server D – SERVER A:PORTA|SERVER B:PORTB|SERVER C:PORTC
It may be a good habit to set this parameter in a two cluster configuration as well. Many multiple server configurations originally start as a two cluster configuration that expands. If HADR_TARGET_LIST is already set, you can incorporate an auxiliary server online. If it is not set, it will require a database deactivate/activate to take effect.
Another topic to grasp is the role of HADR_SYNCMODE within a multiple server cluster. No matter how confused you may become when looking at various output you can always fall back to this concept.
- The primary database and principal standby database can use any of the HADR sync modes. The mode use is determined by the setting of the HADR_SYNCMODE configuration parameter.
- Communication with all other standby servers must be SUPERASYNC on DB2 10.5 and 10.1.
Period. That’s it.
What you are actually setting is the sync mode that should occur between the primary database server and the principal standby server if that standby ever takes over and is the new primary database.
In the example above, the primary database server (SERVER A) is communicating via SUPERASYNC to the tertiary standby server (SERVER D). If failover to the tertiary standby server (SERVER D) is initiated the tertiary standby server takes over as primary and starts to communicate with the old primary database server (SERVER A) via NEARSYNC. All other connections to standbys servers become SUPERASYNC.
Setting your HADR Database Configuration Parameters
As you work through HADR setup. There are a few caveats to be aware of:
- You need a port for each server to use for HADR.
For example, 18819, 18820, 18821, 18822
- You need to make sure PEER_WINDOW is set if you plan on using TSAMP for automatic failover. TSAMP will not finish configuration without this. Check out Ember’s blog on PEER_WINDOW.
- Although HADR will set up without downtime, some non-critical but important database configuration parameters like LOGINDEXBUILD (and previously unconfigured HADR_TARGET_LIST) require a database restart to take effect.
Setting all the parameters while keeping your HADR_TARGET_LIST and various ports right gets tedious and is prone to error. My recommendation is to keep a text file with your expected settings and walk away after initially writing it. Come back an hour later with fresh eyes and double-check yourself.
A snippet of a text file I keep in my documentation (and used for initial setup) is below. Feel free to model your own file after this one.
Starting and Failover of the HADR Cluster
Starting HADR within the cluster is not really different than a simple two-server HADR configuration. The only difference is to start HADR on all standby servers first and then initiate HADR on the primary database server.
- Principal Standby – START HADR ON DATABASE <DBNAME> as STANDBY
- Secondary Standby – START HADR ON DATABASE <DBNAME> as STANDBY
- Tertiary Standby – START HADR ON DATABASE <DBNAME> as STANDBY
- Primary Database – START HADR ON DATABASE <DBNAME> as PRIMARY
Manually issuing a failover command is the same as a two-server cluster as well.
For example, on Standby X:
TAKEOVER HADR ON DATABASE <DBNAME>
Be careful when manually issuing takeover to the secondary and tertiary standby servers. Because HADR_SYNCMODE must be SUPERASYNC you will never be listed in PEER state. You will need to determine the amount of lag, if any, or risk data loss. You can look at output from the “db2pd” tool to help determine if your logs are relatively in sync.
Interpreting Status of the HADR Cluster
Although there are multiple ways to determine the status of your HADR cluster, the command I lean on the most is the “db2pd” command. To me, it quickly provides the most detail in a format that can be visually parsed. It also runs without a database connection.
When using “db2pd” there are a few idiosyncrasies to watch for in the primary database server output. Output from the standby servers will be relatively the same.
The biggest difference is the lengthy output when issued on the primary database server. This is due to DB2 showing you what it understands that server’s role to be in comparison to all of the other standby servers.
This means on the primary server you will see three separate and distinct pages of output. In our example (primary database with three standby servers) one page of “db2pd” output will be displayed for each server the primary database server is communicating with.
Some unusual differences (when issuing from the primary database server):
- DB2 is displaying what it understands the primary database server’s role to be within the configuration. So you will see the HADR_ROLE = PRIMARY listed on all three pages of db2pd output.
- The first page of db2pd output will show a different HADR_SYNCMODE for the principal standby server, while all other pages of db2pd output will read HADR_SYNCMODE = SUPERASYNC.Pages 2+ (representing connectivity to each second and tertiary standby servers) will read HADR_STATE = REMOTE_CATCHUP. This is due to the mandatory SUPERASYNC sync mode for additional standby servers on DB2 10.5 and 10.1. It will always be in this state.Where page 1 (representing connectivity to principal standby) will display HADR_STATE = PEER.
- You can reference STANDBY_ID within each section of output to view DB2’s understanding of the HADR_TARGET_LIST order.
TSAMP and the Multiple Server HADR Configuration
TSAMP is compatible with a multiple server cluster, but will only work with automating failover between the primary database server and the principal standby server.
TSAMP will NOT automate failover with any standby servers beyond principal standby server. This is probably a good thing, as your additional standby servers are in SUPERASYNC mode and may not be in peer state. Therefore you would want manual failover commands issued.
As far as TSAMP is concerned, any server beyond the principal standby server does not exist. Those nodes could be removed or added, stopped or started, without proactively managing TSAMP .
HADR is not HARD
To borrow from Dale McInnis, HADR is not HARD. When I initially looked at setting up a multiple server cluster I was overwhelmed. Things looked to be much more complicated, when in the end it was nothing more than a glorified two server HADR cluster.
This is not complex, it’s tedious. Keeping your hostnames, ports, and target list order takes some diligence while you work server to server. Knowing what is normal in “db2pd” output or why servers communicate a certain way is the only real learning curve.
Once the cluster is set up, it’s easy to manage, and provides your client the unusual ability of high availability and disaster recovery at the same time.
Michael Krafick is an IBM Champion and occasional contributor to db2commerce.com. He has been a DB2 LUW DBA for over 15 years in data warehousing, transactional OLTP environments, and e-commerce databases. He was recently inducted into the IDUG “Speaker Hall of Fame” and given the IBM “DB2 Community Leader Award” in 2015. Michael also has extensive experience in setting up monitoring configurations for DB2 Databases as well as preparing for high availability failover, backup, and recovery. He can be reached at “Michael.Krafick (at) icloud (dot) com”. Linked-in Profile: http://www.linkedin.com/in/michaelkrafick. Twitter: @mkrafick