Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup

Most of what you’ll need to set up and test TSA using db2haicu is in my first few posts on the topic:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.

But there are is one ongoing issue that I’ve seen that I thought I would share. Most of the time, this issue relates to not shutting down the two database servers properly in the right order when they are both shut down. Most of my clients never, ever, ever shut down both servers at once anyway.

TSA States

From the time you first get db2haicu set up, you should be looking at the states of the TSA resources and resource groups, so you know what looks normal for your implementation. I’ve found minor differences in different implementations done in the same way – I don’t know if that’s tied to the Fix Pack or what, but there are a few different things that can be normal.

Viewing States Using TSA Commands as Root

One system I have, the following is what TSA states are what is normal:

Online IBM.ResourceGroup:db2_db2inst1_Prod-db1.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs:Prod-db1
Online IBM.ResourceGroup:db2_db2inst1_Prod-db2.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs:Prod-db2
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCP01-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db1
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db2
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db1
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db2
Online IBM.Equivalency:db2_db2inst1_Prod-db1.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db1.adomain.com:Prod-db1
Online IBM.Equivalency:db2_db2inst1_Prod-db2.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db2.adomain.com:Prod-db2
Online IBM.Equivalency:db2_db2inst1_db2inst1_WCP01-rg_group-equ
        |- Online IBM.PeerNode:Prod-db1.adomain.com:Prod-db1
        '- Online IBM.PeerNode:Prod-db2.adomain.com:Prod-db2
Online IBM.Equivalency:db2_public_network_0
        |- Online IBM.NetworkInterface:bond0:Prod-db2
        '- Online IBM.NetworkInterface:bond0:Prod-db1

Now, if you’re viewing that on Linux, the “Online”s are all green, and the expected “Offline”s are all blue. If there’s a problem it will be in red.

This is my favorite way of looking at it. The red highlighting made it easy to understand if there was a problem, even when I understood very little about what it all meant.

Viewing States Using db2pd

You can also use db2pd to look at the states. I’m not as big of a fan of this method, but I think it’s a matter of preference. Here’s what the same system as above looks like using that method:

$ db2pd -d wc005p01 -ha

Option -ha is an instance scope option.  The database option has been ignored.
           DB2 HA Status
Instance Information:
Instance Name                  = db2inst1
Number Of Domains              = 1
Number Of RGs for instance     = 2

Domain Information:
Domain Name                    = prod_db2ha
Cluster Version                = 3.1.0.3
Cluster State                  = Online
Number of nodes                = 2

Node Information:
Node Name                     State
---------------------         -------------------
Prod-db1.adomain.com         Online
Prod-db2.adomain.com          Online

Resource Group Information:
Resource Group Name            = db2_db2inst1_db2inst1_WCP01-rg
Resource Group LockState       = Unlocked
Resource Group OpState         = Online
Resource Group Nominal OpState = Online
Number of Group Resources      = 2
Number of Allowed Nodes        = 2
   Allowed Nodes
   -------------
   Prod-db1.adomain.com
   Prod-db2.adomain.com
Member Resource Information:
   Resource Name                  = db2_db2inst1_db2inst1_WCP01-rs
   Resource State                 = Online
   Resource Type                  = HADR
   HADR Primary Instance          = db2inst1
   HADR Secondary Instance        = db2inst1
   HADR DB Name                   = WCP01
   HADR Primary Node              = Prod-db1.adomain.com
   HADR Secondary Node            = Prod-db2.adomain.com

   Resource Name                  = db2ip_172_12_12_12-rs
   Resource State                 = Online
   Resource Type                  = IP

Resource Group Name            = db2_db2inst1_Prod-db1.adomain.com_0-rg
Resource Group LockState       = Unlocked
Resource Group OpState         = Online
Resource Group Nominal OpState = Online
Number of Group Resources      = 1
Number of Allowed Nodes        = 1
   Allowed Nodes
   -------------
   Prod-db1.adomain.com
Member Resource Information:
   Resource Name                  = db2_db2inst1_Prod-db1.adomain.com_0-rs
   Resource State                 = Online
   Resource Type                  = DB2 Partition
   DB2 Partition Number           = 0
   Number of Allowed Nodes        = 1
      Allowed Nodes
      -------------
      Prod-db1.adomain.com

Network Information:
Network Name                  Number of Adapters
-----------------------       ------------------
db2_public_network_0          2

   Node Name                     Adapter Name
   -----------------------       ------------------
   Prod-db2                      bond0
   Prod-db1                      bond0

Quorum Information:
Quorum Name                                  Quorum State
------------------------------------         --------------------
Operator                                     Offline
db2_Quorum_Network_172_10_10_10:11_36_34     Online
Fail                                         Offline

I guess I can see how this method might be more understandable. But it doesn’t highlight problems in red!

It also has the advantage of being something you can execute as the db2 instance owner rather than as root.

Changing States

So, what do you do if things are highlighted in red?

Well, the first course of action is to check into HADR. First make sure that neither database is waiting on the other to start. Verify that HADR shows as “Connected” in “Peer” status with little or no log gap, using db2 -d -hadr:

$ db2pd -d wcp01 -hadr

Database Partition 0 -- Database WCP01 -- Active -- Up 71 days 16:06:16 -- Date 01/29/2013 20:33:23

HADR Information:
Role    State                SyncMode HeartBeatsMissed   LogGapRunAvg (bytes)
Primary Peer                 Nearsync 0                  1238

ConnectStatus ConnectTime                           Timeout
Connected     Mon Nov 19 04:27:21 2012 (1353320841) 120

PeerWindowEnd                         PeerWindow
Tue Jan 29 20:37:59 2013 (1359513479) 300

LocalHost                                LocalService
Prod-db1.adomain.com                     18819

RemoteHost                               RemoteService      RemoteInstance
Prod-db2.adomain.com                     18820              db2inst1

PrimaryFile  PrimaryPg  PrimaryLSN
S0009993.LOG 9847       0x00000081FF427CE6

StandByFile  StandByPg  StandByLSN
S0009993.LOG 9846       0x00000081FF426FF3

If HADR is working properly, then you may want to try to disable and re-enable db2haicu.

Finally, if your situation matches the one below, you can try (at your own risk) the following procedure.

“Pending Online”

This is an issue that pops up sometimes with a running setup. If you ever have to down both servers, please follow the steps in section 7 of this document: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf. If you don’t you’re likely to get TSA in an inconsistent state and mess with it for a while. I’m going to share the steps that I use to get TSA out of this pending online state – but please note, these can be extremely dangerous, and if you don’t understand what you’re doing, you probably don’t want to use them – contact IBM to see if these steps work for you or not. Use at your own risk. I got these steps from a colleague who got them from support, but later support told him they might be dangerous.

You’ll need to run these as root. Even if your instance owner can run lssam, you still need root for the rest of these commands.

After you have verified that HADR is properly running, look at the states of the resources to ensure that your problem matches the one I am describing:

> su - root
# lssam
Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs Control=SuspendedPropagated
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated
                |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs Control=SuspendedPropagated
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

After you have confirmed that this matches your issue, find which is the master in the resource group:

# lssamctrl –V   
Starting to list SAM Control information.
lssamctrl: Executed on Fri Apr 22 12:45:08 2011 at "dbserver01", master node "dbserver01".
Displaying SAM Control information:
SAMControl:
        TimeOut                = 60
        RetryCount             = 3
        Automation             = Auto
        ExcludedNodes          = {}
        ResourceRestartTimeOut = 5
        ActiveVersion          = [3.1.0.1,Fri Mar 11 16:10:54 EST 2011]
        EnablePublisher        = Disabled
        TraceLevel             = 31
        ActivePolicy           = []
        CleanupList            = {}
        PublisherList          = {}
Completed Listing SAM Control information.

That told us: master node “dbserver01”
Now, on the master node, get the process id for the recovery manager:

# ps -ef |grep -i recoveryrm  
    root  7929864  3866752   0   Apr 07      -  0:36 /usr/sbin/rsct/bin/IBM.RecoveryRMd

Now kill that process id:

# kill 7929864

Next, confirm that the recovery manager starts a new process:

# ps -ef |grep -i recoveryrm 
    root  7929866  3866752   1 12:54:17      -  0:00 /usr/sbin/rsct/bin/IBM.RecoveryRMd 

Validate that the “In Config State” is TRUE:

# lssrc -ls IBM.RecoveryRM |grep "In Config State"
   In Config State      : TRUE

Now see the changes in status. The Pending Status is now Offline, the Nominal changed to Offline, and the the Control=SuspendedPropegated is removed:

# lssam  
Offline IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Offline
        |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs
                |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Offline IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Offline
        '- Offline IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Offline IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Offline IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Offline
        '- Offline IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Offline IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

Now issue commands to properly set the Resource Groups – first set the Resource Group online for the Master server, and then set it online for the Standby server:

# chrg -o online db2_db2inst1_dbserver01_0-rg 
# chrg -o online db2_db2inst1_dbserver02_0-rg 

Check the status again, and note the differences – the Resource groups at the bottom now show as online:

# lssam
Offline IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Offline
        |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs
                |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

Now, set the Resource Group online for the Database:

# chrg -o online db2_db2inst1_db2inst1_WCQ01-rg

You may note a Lock state while the Resource Group switches to ONLINE:

# lssam  
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Request=Lock Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs Control=SuspendedPropagated
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

After a bit, everything should show as normal again:

# lssam 
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

What TSA Looks Like if HADR is Simply Down

Always make sure you get HADR up before digging into TSA states. It looks similar (but slightly different) if HADR is just down. Notice the “Request=Lock” that’s in there – that’s different than the issue above.

Online IBM.ResourceGroup:db2_db2inst1_Prod-db1.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_Prod-db2.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs:dbserver02
Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCP01-rg Request=Lock Nominal=Online
        |- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs Control=SuspendedPropagated
                |- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:dbserver02
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.Equivalency:db2_db2inst1_Prod-db1.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db1.adomain.com:dbserver01
Online IBM.Equivalency:db2_db2inst1_Prod-db2.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db2.adomain.com:dbserver02
Online IBM.Equivalency:db2_db2inst1_db2inst1_WCP01-rg_group-equ
        |- Online IBM.PeerNode:Prod-db1.adomain.com:dbserver01
        '- Online IBM.PeerNode:Prod-db2.adomain.com:dbserver02
Online IBM.Equivalency:db2_public_network_0
        |- Online IBM.NetworkInterface:bond0:dbserver02
        '- Online IBM.NetworkInterface:bond0:dbserver01

I’d love to hear problems that others have encountered and how you’ve resolved them to help others! Leave a comment with your situation and solution.

Other Posts In This Series

This series consists of four posts:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.
“Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup

Search this blog on “TSA” for other posts on TSA issues and tips.

You may also like...

10 Responses

  1. Henry says:

    Situation:
    a server which hold standby database down, then after it was up,
    you can see Control=SuspendedPropagated
    no lock on resource group .
    What should I do to remove this flag?

    Thank you.
    DB21085I Instance “db2pb1” uses “64” bits and DB2 code release “SQL09075” with
    level identifier “08060107”.
    Informational tokens are “DB2 v9.7.0.5”, “special_28492”, “IP23285_28492”, and
    Fix Pack “5”.
    Product is installed at “/db2/db2pb1/db2_software”.
    arlpb1ci:db2pb1 7> oslevel -s
    7100-01-05-1228

    Online IBM.ResourceGroup:db2_db2pb1_db2pb1_PB1-rg Nominal=Online
    |- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs Control=SuspendedPropagated
    |- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap11
    ‘- Offline IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap12
    |- Online IBM.ServiceIP:db2ip_10_180_0_111-rs Control=SuspendedPropagated
    |- Online IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap11
    ‘- Offline IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap12
    ‘- Online IBM.ServiceIP:db2ip_10_194_6_209-rs Control=SuspendedPropagated
    |- Online IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap11
    ‘- Offline IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap12
    Resource Group Information:
    Resource Group Name = db2_db2pb1_db2pb1_PB1-rg
    Resource Group LockState = Unlocked
    Resource Group OpState = Online
    Resource Group Nominal OpState = Online
    Number of Group Resources = 3
    Number of Allowed Nodes = 2

    • Ember Crooks says:

      The only series of steps I have to try are the ones in this blog entry. Did you resolve this? Sorry for the late response, I was taking a vacation – camping with the family.

  2. Gene Torres says:

    On the Pending Online issue, my problems were as follows:

    Softdog issues:
    I viewed the lssam output and can see that the instance on db2prod02 is showing “Pending online”. The reason for this is a 3rd party watchdog module that is preventing IBM’s cluster software from loading its own (there can only be one watchdog module active for a given server). The syslog show the problem :

    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Loading watchdog softdog, timeout = 8000 ms.
    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Found loaded iTCO_vendor_support with count 1
    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: iTCO_vendor_support has a use count of 1 and cannot be unloaded

    The “iTCO_vendor_support” module needs to be disabled (preferably uninstalled). You should check db2prod01 as well so there is no unexpected issue in the future. This is the advise I asked Adam to pass onto you last Friday. Looks like you’re still working on this, with your SysAdmin I’m assuming.

    Once the instance is able to reach an “Online” state, db2haicu will be able to add HADR databases again.

    and then just permissions issues getting db2haicu to run:

    I had to do the following to get it to work as well as to do a hadr takeover before it would let me add secondary and tertiary db’s into the cluster. On the primary, it would refuse to add databases into the cluster stating a problem with error:

    2014-02-27-15.11.02.709792-420 E51459483E655 LEVEL: Error
    PID : 28178 TID : 139851322767136PROC : db2haicu
    INSTANCE: atlinst NODE : 000
    FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaUICreateHADR, p
    robe:900
    RETCODE : ECF=0x9000056F=-1879046801=ECF_SQLHA_HADR_VALIDATION_FAILED
    The HADR DB failed validation before being added to the cluster
    MESSAGE : Please verify that HADR_REMOTE_INST and HADR_REMOTE_HOST are correct
    and in the exact format and case as the Standby instance name and
    hostname.
    DATA #1 : String, 7 bytes
    atlinst
    DATA #2 : String, 9 bytes
    db2prod02

    On new instances, I would get the following technote issue regarding db2havend and the library file:

    http://www-01.ibm.com/support/docview.wss?uid=swg21649212

    Also had issue on CT_MANAGEMENT_SCOPE:

    http://www-01.ibm.com/support/docview.wss?uid=swg1IC64785
    db2set DB2_DIRECT_IO=false
    export CT_MANAGEMENT_SCOPE=2

    But my main hurdle I spent all of last Fri/Sat night on was:
    — change setsuid permissions on db2havend(s) and lib32
    –http://www-01.ibm.com/support/docview.wss?uid=swg21649212

    MUST BE:
    -r-sr-xr-x 1 root db2inst1 4642211 Apr 3 18:17 db2havend
    -r-sr-xr-x 1 root db2inst1 3990657 Apr 3 18:17 db2havend32

    lrwxrwxrwx 1 root root 14 Apr 11 13:10 libdb2tsa.so -> libdb2tsa.so.1
    -r-xr-xr-x 1 bin bin 152529 Mar 19 01:32 libdb2tsa.so.1

    check by using
    ls -l | grep db2have

    FIX by using:

    chmod 555 on libdb2tsa.so.1 in dir sqllib\lib64
    chmod 4555 on db2havend and db2havend64 in sqllib\adm

    Thank you as your post did help me… Not same issue but it did good to know I wasn’t alone … Thank you Ember

  3. milind Taralkar says:

    Hi Ember,

    Can you please let me know what can be done in below situation.

    Failed offline IBM.ResourceGroup:db2_tdbin02_tdbin02_XXX-rg Nominal=Online
    |- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs
    |- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR01
    ‘- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR02
    ‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs
    |- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR01
    ‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR02

    When I’m trying to switvh over from server 1 to server 2 some of the db’s goes into Failed Offline mode. There are 14 DB’s in one instance.

    • Ember Crooks says:

      Does only one database go into failed offline or all 14? Do you have all 14 fully configured in TSAMP? How are you doing the failover – through TAKEOVER command or db2haicu?

      Multiple databases on one instance can be problematic with TSAMP – especially when using the VIP as you are, as you have to ensure that all databases fail over at the same time or you have to define different virtual IP addresses for each database.

      • milind Taralkar says:

        Hi Ember,

        I’m doing failover by using db2haicu command..
        all the 14 DB’s are configured in TSAMP with different VIP … Out of 14 sometimes 3 or 4 Db’s goes in Failed Offline mode.

  4. harsha says:

    hi Ember

    how much time standby will take to takeover if primary is failed in tsa concept in db2

    • Ember Crooks says:

      Maximum time should be hadr_peer_window plus hadr_timeout. The actual failover, when initiated depends on volume, but us frequently less than 30 seconds.

  1. September 25, 2013

    […] As stated before, I wish there was an option on db2haicu that basically said “I’ve fixed the original problem, reset the TSA states”. This one is a bit easier than the problem and reset I describe in Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup […]

Leave a Reply

Your email address will not be published. Required fields are marked *