An Unfortunate Series of TSAMP Events

A story of fail and recover.

Problem Discovery and Description

Sometimes, I think that I subconsciously knew that something was wrong. I woke up before 5 AM and couldn’t get back to sleep for no real reason that I could figure out. I gave up on sleep around 5:15 and went to take a shower. On the way to the shower, I read work email and found this:

TSAMP_issue

At the same time, I received a note from the project manager indicating that the hosting provider had apparently rebooted all of our “PROD” and “UAT” database servers in the night in an attempt to uninstall HACMP.

This is a newer hosting provider for us, chosen by our client without much input from us. The databases in question are not yet supporting a “live” site – only being configured and developed on by developers from several companies. With tight timelines and global companies, they’re used most of the time.

We had discovered two weeks earlier that the hosting provider was not taking any system level backups because the systems were not yet considered live. This was a surprise to us, as we are used to hosting providers providing basic services like backup from the moment servers are provided. We had brought it up as a risk and a problem with the hosting provider.

Anyway, I went straight to my home office, and logged on to the server in question. The first issue I saw was that the many-line /etc/services file had been replaced by a simple two-line file that indeed did not list any database lines at all. That’s an easy fix on it’s own, and the SA and I quickly worked to get the required lines back in place.

Did I mention that this happened on Friday the 13th?

Once /etc/services was corrected, db2 still would not start, this time with:

db2start
06/16/2014 16:25:13     0   0   SQL1042C  An unexpected system error occurred.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019

An error message that strikes fear into the heart of any DB2 DBA.

I found that I could get the db2 instance up on the primary server in each TSA/HADR cluster by doing the following. I don’t pretend to understand why, but doing parts of it in various combinations did not work.

  1. db2iupdt
  2. installSAM – which failed because it found TSA was already installed
  3. uninstallSAM – which failed because it thought TSA was still active
  4. installSAM – which failed because it found TSA was already installed
  5. db2iupdt

Once I had the db2 instance up, I could not use db2haicu -delete. My lssam on the primary at this point looked like this:
TSAMP_Issue2
This was ugly. But it actually wasn’t as scary as what I got on the standby. I couldn’t get the standby DB2 instance to start with any combination of commands, and it similarly would not let me install or uninstall TSAMP. This is what I got from an lssam there:
TSAMP_issue3
On the standby, any attempt at db2haicu failed because the DB2 instance was down.

Clearly, the hosting provider uninstalling HACMP had uninstalled some files that TSAMP uses, and had also altered /etc/services and wiped out most of the entries there. Because the hosting provider had not been taking system backups, there was no way to restore the system and apparently no rollback plan. Reportedly, the hosting provider was following a series of steps provided by IBM. To exacerbate the problem, the hosting provider performed these steps on 12 database servers in 6 HADR/TSAMP clusters at the same time.

Resolving the problem

After opening a PMR with DB2 support (and getting DB2 support to consult someone with TSA expertise), our system admin was able to get TSAMP to a point where I could uninstall it successfully. I uninstalled it and re-installed it. I was then still unable to start the DB2 instances. I got the same error:

db2start
06/16/2014 16:25:13     0   0   SQL1042C  An unexpected system error occurred.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019

And in the db2diag.log, I found this:

2014-06-16-16.25.13.072800+000 E1823213A907         LEVEL: Error
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted
EDUID   : 1
FUNCTION: DB2 UDB, high avail services, sqlhaGetObjectState2, probe:400
MESSAGE : ECF=0x90000552=-1879046830=ECF_SQLHA_OBJECT_DOES_NOT_EXIST
          Cluster object does not exist
DATA #1 : String, 35 bytes
Error during vendor call invocation
DATA #2 : unsigned integer, 4 bytes
29
DATA #3 : String, 28 bytes
db2_db2inst2_redacted02s_0-rs
DATA #4 : signed integer, 4 bytes
4
DATA #5 : unsigned integer, 4 bytes
1
DATA #6 : String, 0 bytes
Object not dumped: Address: 0x000000011018C324 Size: 0 Reason: Zero-length data
DATA #7 : unsigned integer, 8 bytes
1
DATA #8 : signed integer, 4 bytes
0
DATA #9 : String, 0 bytes
Object not dumped: Address: 0x000000011018B11C Size: 0 Reason: Zero-length data

2014-06-16-16.25.13.073839+000 E1824121A586         LEVEL: Error
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted_02s
EDUID   : 1
FUNCTION: <0>, <0>, <0>, probe:1164
RETCODE : ECF=0x90000552=-1879046830=ECF_SQLHA_OBJECT_DOES_NOT_EXIST
          Cluster object does not exist
DATA #1 : String, 63 bytes
libsqlha: sqlhaGetObjectState() call error from wrapper library
DATA #2 : String, 0 bytes
Object not dumped: Address: 0x000000011018B11C Size: 0 Reason: Zero-length data
DATA #3 : signed integer, 4 bytes
0

2014-06-16-16.25.13.097492+000 E1824708A387         LEVEL: Error
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted02s
EDUID   : 1
FUNCTION: DB2 UDB, high avail services, sqlhaSetStartPreconditions, probe:18246
RETCODE : ECF=0x90000557=-1879046825=ECF_SQLHA_CLUSTER_ERROR
          Error reported from Cluster

2014-06-16-16.25.13.097733+000 E1825096A465         LEVEL: Severe
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted02s
EDUID   : 1
FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5104
MESSAGE : ZRC=0x827300D4=-2106392364=HA_ZRC_CLUSTER_ERROR
          "Error reported from Cluster"
DATA #1 : String, 66 bytes
An error was encountered when interacting with the cluster manager

I could not get db2haicu -delete to work at any point. Apparently in 10.1/10.5 the CLUSTER_MGR DBM cfg parameter became informational and can only be set through db2haicu. This meant that in order to un-set it so I could get back to a point where I could reconfigure TSAMP, I had to drop and re-create every DB2 instance. There were 14 of them.

I made sure I had database backups before undertaking this. I then did the following:

  1. db2cfexp backup.db2cfexp backup
  2. db2 get dbm cfg |tee dbmcfg.out
  3. db2 list db directory |tee dbdir.out
  4. db2set -all |tee db2set.out
  5. db2 list node directory
  6. switched to root and did a db2idrop
  7. re-created the instance with a db2icrt
  8. db2cfimp backup.db2cfexp
  9. Set the parameters not covered by cfexp. In my case, this included:
    1. all DFT_MON parameters
    2. SVCENAME
    3. SYSMON group
  10. I then compared the dbm cfg and db2set from before and after to make sure everything was fine

The db2cfimp re-cataloged the database for me, meaning I did not have to restore it.

After I had all of the DB2 instances re-created and started, I was then able to fully re-do the TSAMP configuration on all 6 HADR/TSAMP clusters. This is where I was so grateful I had taken complete documentation when I originally set up the clusters. I had Word documents with all the info I needed to re-do each and every cluster.

At some point, DB2 support referred to the approach I was taking as a Sledgehammer approach. It may well have been, but having software partially uninstalled is a scary thing to me. How do I know I’m not missing some critical files somewhere? Also when I asked, DB2 support confirmed that CLUSTER_MGR is not configurable other than through db2haicu, and that dropping and recreating the db2 instances was my only option at that point. I call on IBM to make the CLUSTER_MGR parameter configurable again! I could have saved several hours of work by not having to do that part, at least.

I thought I would share this issue in the hopes that it helps someone else. I don’t claim that the actions above are the best – if you’re in a similar scenario, please consult IBM support to get what you need to recover. With luck, others are taking system level backups or have hosting providers that have a rollback plan for every system change, so no one will ever encounter this but me.

You may also like...

6 Responses

  1. Éric Castro says:

    Hi Ember, great post and history.

    I did not know its not possible to set CLUSTER_MGR anymore except by db2haicu.

    I think what you could have done to completely erase TSA configuration its to list and remove the domain using root. Like this:

    [root@server1 ~]# lsrpdomain
    Name OpState RSCTActiveVersion MixedVersions TSPort GSPort
    dpa_domain Pending online 3.1.5.2 No 12347 12348
    [root@server1 ~]#

    [root@server1 ~]# rmrpdomain -f dpa_domain

    I’ve done this a few time when I couldn’t use db2haicu to delete TSA configuration. But the instance parameter CLUSTER_MGR is still there.

    I don’t know if its an IBM best practice but it worked.

    • Ember Crooks says:

      I believe my SA did just that for me.

      • Kishore says:

        Hello Ember,

        Great Post. Intially i tried your solution

        When i had same issue again, i carefully removed all the resource associated with database and was able to manage to bring instance up.

        Thanks

  2. KR says:

    Hi,

    Nice article, great blog.

    Not sure if this was an option at your DB2 level (or would have helped even if it was) but I had a similar issue and the following allowed me to start DB2 again:

    db2haicu -disable

    I notice after running this, CLUSTER_MGR gets unset.

  1. June 24, 2014

    […] An Unfortunate Series of TSAMP Events […]

Leave a Reply

Your email address will not be published. Required fields are marked *