Architecting High-Availability and Disaster Recovery Solutions with DB2
NOTE: Please do not attempt to architect a high-availability solution without a DB2 expert involved. There are many questions and answers needed beyond those here to make the best decision and ask all the tough questions.
High Availability vs. Disaster Recovery
High Availability and Disaster Recovery often have different goals and priorities. High Availability can be considered the ability to keep a database available with little to no data loss during common problems. Disaster Recovery usually protects against more major and unusual failures, but may take longer to complete and have slightly more possibility for data loss. Meeting goals for either may have an impact on database performance depending on hardware and network details. I once heard a client discussing Disaster Recovery on a call as “What if we lost Dallas? The whole City?”
When embarking on a design process, the first thing to lay out is what your goals are for RTO and RPO. Often you may have different goals for High Availability vs Disaster Recovery, and you may also define different percentages of uptime required for planned outages vs. unplanned outages. Some of these may be defined in an SLA (Service Level Agreement) either internally to your organization or to an external customer. Always design your system(s) for less downtime than you promise the client to allow for unexpected scenarios. When talking through goals, it can be useful to come up with specific scenarios that you want to plan to protect against. Disk failure, major RAID or SAN failure, network failure, server component failure, power failure, natural disaster, human error, etc.
RTO (Recovery Time Objective)
This goal is expressed in units of time. If there is an outage, how soon are you expected to have the database up and running? Usually whether this is achieved via the High Availability solution or the Disaster Recovery solution is determined by the type of failure that caused the initial outage. There may be different Recovery Time Objectives defined for High Availability events versus Disaster Recovery events, or the same objective may be defined for both.
RPO (Recovery Point Objective)
The Recovery Point Objective is also defined as a duration of time. But in this case, it is the maximum age of data recovered in the case of a major failure. This usually applies more strongly to Disaster Recovery plans, as High Availability often accounts for less than a few seconds of data loss, if any. Often RPO will be defined for high availability in hours.
One of the most common challenges I see in defining High Availability and Disaster Recovery situations is an executive that has become familiar with the terms four nines or five nines, and asks for this level of availability. Each Nine breaks down this way:
|Nines||% Uptime||Time Per Year Down|
|One Nine||90%||36.5 days|
|Two Nines||99%||3.65 days|
|Three Nines||99.9%||8.76 hours|
|Four Nines||99.99%||52.56 minutes|
|Five Nines||99.999%||5.26 minutes|
There are several things to keep in mind with these numbers. First is that as you increase the uptime, the cost goes up in at least an exponential manner. I can accomplish one nine quite easily with one database server and some offsite data movement (depending on RPO). If I had to guarantee Five Nines, my knee-jerk implementation would involve 10 servers.
Second, the uptime may be defined either over unplanned down time or over both planned and unplanned down time. Since very few methods allow for an on-line version-level upgrade of DB2, that alone pushes us back to three nines depending on database size, or to some more complicated implementations.
Third, uptime may be defined at the application level, so the database’s share of any down-time may be smaller than you think. If the infrastructure team, the database team, and the application team all plan 50 minutes of down time per year, they’re unlikely to be able to overlap all of that, and unlikely to make a 99.99% uptime goal.
There is also the definition of what “downtime” is. My database can be up and available 99.99% of the time, but if an application pounds it with volumes of bad SQL, that could be considered downtime. The same workload that could be 99.99% unplanned downtime on one set of hardware could be unable to even hit three nines on significantly less powerful hardware.
High Availability Options
I don’t pretend to cover all options, but want to offer a few of the more common ones.
The first solution that comes to mind that meets many implemenations’ needs for High Availability is HADR. HADR allows us to have a second (or third or fourth) DB2 server that is rolling forward through transaction logs and can be failed over to in a matter of minutes or seconds. This failover can even be initiated automatically when a problem is detected, using TSAMP (often included with DB2 Licensing). Often the HADR standby servers can be licensed at just 100 PVU instead of the full PVU count for the servers. Usually the standby database servers are not connectable, but in some limited situations, read-only traffic can be directed to them if you’re willing to fully license the database server. Please verify all licensing details with IBM.
HADR has the advantage that it is often free for use with DB2 and it is easy to set-up, monitor, and administer for qualified personnel. It also can be configured to share almost nothing, protecting against a very wide range of failures including disk/RAID/SAN failures. I’ve seen two different situations where a client lost an entire RAID array at once on a production database server, and HADR was a lifesaver for their data.
For High-Availability implementations with HADR, the two database servers should have high-speed networks between them with 100 KM or less of distance. SYNC and NEARSYNC HADR modes are appropriate for High Availability.
Fixpacks can be applied in a rolling fashion with the only downtime being up to a few minutes for a failover. Full DB2 version upgrades require a full outage.
HADR can also run on most hosting providers with no special configurations other than additional network interfaces for high-throughput systems.
HADR can often be implemented by any DB2 DBA with the assistance of operating system and network personnel with average skill levels.
Shared Disk Outside of DB2
I have seen this solution successfully used. Two servers are configured to share one set of disks using software such as HACMP, Power-HA, RHCS, Veritas, TSAMP, or others. The disk is only mounted on one server at a time, and if a failure is detected, the disk is unmounted from one server and mounted on another server.
The thing I don’t like about this solution is that the disk is a frequent point of failure for database systems, and this implementation does not protect against disk failure. If you go with this option, please make sure your RAID Array or SAN is extremely robust with as many parity disks as you can handle and disks manufactured at different times are used, and very detailed monitoring is done to immediately act on any single disk failure. It’s also advisable that you couple this with a Disaster-Recovery solution.
Many hosting providers will support this kind of solution, but may charge more to do so.
This type of shared-disk solution also requires the support of a talented expert in whatever software you use – A DBA and typical system administrators may not be able to do this without help.
PureScale is DB2’s answer to RAC. It is a shared-disk implementation that allows for multiple active database servers at the same time. A typical configuration would include three database servers and two “CF” servers to control load distribution and locking. It is a very robust solution, mostly appropriate for OLTP and other transaction processing databases. The complexity is an order of magnitude beyond HADR, and you will need a talented DB2 DBA or consultant to work with you internal network and system administration teams. It uses GPFS, and a high-speed interconnect between the servers (RoCE or Infiniband). You can use TCPIP for test environments.
Many hosting providers cannot easily support this – you have to ensure that yours will. IBM-related hosting providers like Softlayer tend to be easier to talk to about this. Until recent versions, there were hardware restrictions that are progressively being eased as time goes on.
If you have a good relationship with IBM or a contractor who does, additional help may be available from IBM for this.
Unless you’re engaging IBM Lab services, all database servers should be in the same data center unless you’re combining this with HADR.
DB2 Replication (SQL, Q, CDC)
DB2 Replication can also be used to establish a standby database server. With Replication, it is best to have one server defined as the master and others defined as a slave. With the caveat that you can actually have different indexing on your standby if you’re using it for read-only reporting.
Replication is complicated to set up, and if you have too much transactional volume for SQL replication to handle, the licensing costs for Q-rep or CDC can be very significant. Since it is set up on a table-by-table basis, it requires a lot of DBA effort to set up depending on the number of tables you have. It can also be complicated to integrate with HADR. The time to implement replication depends heavily on the number of tables involved.
The big drivers for choosing replication as an option are the need to access more than one server or to use a second server as a reporting server, and also the fact that this is the only method that will allow you to do a nearly-online version-level upgrade. That is a BIG plus, and any solution for 5 nines or for 4 nines including planned downtime would likely include this.
Disaster Recovery Options
Many of the options from above can also be used as Disaster Recovery options with a few tweaks. But if you’re trying to meet both High Availability and Disaster Recovery goals, you’re nearly always going to be looking at 3 or more database servers. Two servers is rarely going to be able to do both High Availability and Disaster Recovery.
When used for Disaster Recovery, HADR should incorporate a distance between the database servers – certainly in different data centers and often in different areas of the country. HADR SYNCMODES that are appropriate for Disaster Recovery include ASYNC and SUPERASYNC.
Shared Disk Outside of DB2
Shared disk clusters with a geographic factor are much harder to implement and often include some sort of disk-level replication between disks in two data centers. This relies on how fast the disk replication is as a major component, and involves some complicated fail-over scenarios for virtual IP addresses.
PureScale can be geographically distributed, but the experts say that it is very complicated and should not be done without engaging IBM Lab Services. Even so, the cluster cannot be distributed across more than 100 KM, with a very, very fast network connection.
DB2 Replication (SQL, Q, CDC)
DB2 replication is also a good choice for Disaster Recovery, assuming a very fast network connection.
Solutions Combining High Availability and Disaster Recovery
My favorite combination is probably a three or four server HADR cluster with NEARSYC between the Primary database server and the principal standby, and SUPERASYNC with a tertiary database server in a different data center. This is economical and easily meets 3 nines if properly architected and on robust hardware.
I have also seen shared-disk clusters outside of DB2 used for high availability, with HADR used in ASYNC between two data centers. That seemed to work well enough.
PureScale can now be used with HADR to more easily meet both HA and DR requirements, but remember there will still be downtime for DB2 version level upgrades.
Replication can be used in combination with any of the other options.
Are Database Backups Still Needed?
One of the frequent questions I get with High Availabilty or Disaster Recovery solutions is “Do I even need database backups?” My response is that you absolutely do. On at least one occasion, I have had a database backup save my client’s data even when HADR was used. One reason is often human error. One of the most common causes of restores or failovers is human error. Most replication/HADR/shared disk solutions will immediately replicate a human error from one server to another. Another reason is all the failure scenarios that you didn’t plan for – it is hard to imagine everything that can go wrong, and a backup with transaction log files can go miles towards being ready for the unexpected. Backups can also be useful for data movement to a test or development or QA enviornment.
There are a number of High Availability and Disaster Recovery solutions. Knowing your minimum requirements and needs is critical to architecture a cost-effective and robust solution.