Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup

You may also like...

14 Responses

  1. Henry says:

    Situation:
    a server which hold standby database down, then after it was up,
    you can see Control=SuspendedPropagated
    no lock on resource group .
    What should I do to remove this flag?

    Thank you.
    DB21085I Instance “db2pb1” uses “64” bits and DB2 code release “SQL09075” with
    level identifier “08060107”.
    Informational tokens are “DB2 v9.7.0.5”, “special_28492”, “IP23285_28492”, and
    Fix Pack “5”.
    Product is installed at “/db2/db2pb1/db2_software”.
    arlpb1ci:db2pb1 7> oslevel -s
    7100-01-05-1228

    Online IBM.ResourceGroup:db2_db2pb1_db2pb1_PB1-rg Nominal=Online
    |- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs Control=SuspendedPropagated
    |- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap11
    ‘- Offline IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap12
    |- Online IBM.ServiceIP:db2ip_10_180_0_111-rs Control=SuspendedPropagated
    |- Online IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap11
    ‘- Offline IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap12
    ‘- Online IBM.ServiceIP:db2ip_10_194_6_209-rs Control=SuspendedPropagated
    |- Online IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap11
    ‘- Offline IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap12
    Resource Group Information:
    Resource Group Name = db2_db2pb1_db2pb1_PB1-rg
    Resource Group LockState = Unlocked
    Resource Group OpState = Online
    Resource Group Nominal OpState = Online
    Number of Group Resources = 3
    Number of Allowed Nodes = 2

    • Ember Crooks says:

      The only series of steps I have to try are the ones in this blog entry. Did you resolve this? Sorry for the late response, I was taking a vacation – camping with the family.

  2. Gene Torres says:

    On the Pending Online issue, my problems were as follows:

    Softdog issues:
    I viewed the lssam output and can see that the instance on db2prod02 is showing “Pending online”. The reason for this is a 3rd party watchdog module that is preventing IBM’s cluster software from loading its own (there can only be one watchdog module active for a given server). The syslog show the problem :

    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Loading watchdog softdog, timeout = 8000 ms.
    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Found loaded iTCO_vendor_support with count 1
    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: iTCO_vendor_support has a use count of 1 and cannot be unloaded

    The “iTCO_vendor_support” module needs to be disabled (preferably uninstalled). You should check db2prod01 as well so there is no unexpected issue in the future. This is the advise I asked Adam to pass onto you last Friday. Looks like you’re still working on this, with your SysAdmin I’m assuming.

    Once the instance is able to reach an “Online” state, db2haicu will be able to add HADR databases again.

    and then just permissions issues getting db2haicu to run:

    I had to do the following to get it to work as well as to do a hadr takeover before it would let me add secondary and tertiary db’s into the cluster. On the primary, it would refuse to add databases into the cluster stating a problem with error:

    2014-02-27-15.11.02.709792-420 E51459483E655 LEVEL: Error
    PID : 28178 TID : 139851322767136PROC : db2haicu
    INSTANCE: atlinst NODE : 000
    FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaUICreateHADR, p
    robe:900
    RETCODE : ECF=0x9000056F=-1879046801=ECF_SQLHA_HADR_VALIDATION_FAILED
    The HADR DB failed validation before being added to the cluster
    MESSAGE : Please verify that HADR_REMOTE_INST and HADR_REMOTE_HOST are correct
    and in the exact format and case as the Standby instance name and
    hostname.
    DATA #1 : String, 7 bytes
    atlinst
    DATA #2 : String, 9 bytes
    db2prod02

    On new instances, I would get the following technote issue regarding db2havend and the library file:

    http://www-01.ibm.com/support/docview.wss?uid=swg21649212

    Also had issue on CT_MANAGEMENT_SCOPE:

    http://www-01.ibm.com/support/docview.wss?uid=swg1IC64785
    db2set DB2_DIRECT_IO=false
    export CT_MANAGEMENT_SCOPE=2

    But my main hurdle I spent all of last Fri/Sat night on was:
    — change setsuid permissions on db2havend(s) and lib32
    –http://www-01.ibm.com/support/docview.wss?uid=swg21649212

    MUST BE:
    -r-sr-xr-x 1 root db2inst1 4642211 Apr 3 18:17 db2havend
    -r-sr-xr-x 1 root db2inst1 3990657 Apr 3 18:17 db2havend32

    lrwxrwxrwx 1 root root 14 Apr 11 13:10 libdb2tsa.so -> libdb2tsa.so.1
    -r-xr-xr-x 1 bin bin 152529 Mar 19 01:32 libdb2tsa.so.1

    check by using
    ls -l | grep db2have

    FIX by using:

    chmod 555 on libdb2tsa.so.1 in dir sqllib\lib64
    chmod 4555 on db2havend and db2havend64 in sqllib\adm

    Thank you as your post did help me… Not same issue but it did good to know I wasn’t alone … Thank you Ember

  3. milind Taralkar says:

    Hi Ember,

    Can you please let me know what can be done in below situation.

    Failed offline IBM.ResourceGroup:db2_tdbin02_tdbin02_XXX-rg Nominal=Online
    |- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs
    |- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR01
    ‘- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR02
    ‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs
    |- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR01
    ‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR02

    When I’m trying to switvh over from server 1 to server 2 some of the db’s goes into Failed Offline mode. There are 14 DB’s in one instance.

    • Ember Crooks says:

      Does only one database go into failed offline or all 14? Do you have all 14 fully configured in TSAMP? How are you doing the failover – through TAKEOVER command or db2haicu?

      Multiple databases on one instance can be problematic with TSAMP – especially when using the VIP as you are, as you have to ensure that all databases fail over at the same time or you have to define different virtual IP addresses for each database.

      • milind Taralkar says:

        Hi Ember,

        I’m doing failover by using db2haicu command..
        all the 14 DB’s are configured in TSAMP with different VIP … Out of 14 sometimes 3 or 4 Db’s goes in Failed Offline mode.

  4. harsha says:

    hi Ember

    how much time standby will take to takeover if primary is failed in tsa concept in db2

    • Ember Crooks says:

      Maximum time should be hadr_peer_window plus hadr_timeout. The actual failover, when initiated depends on volume, but us frequently less than 30 seconds.

  5. Suvradeep Sensarma says:

    Hi Ember,
    I have gone through your article and it is really very descriptive and easy to understand. However, I am recently facing one strange issue and I am unable to figure it out what is going wrong in this case. If you can give input on this it will be very helpful.
    Recently one our server which hosts the PRIMARY database of HADR server went down. But it did not automatically failed over to STANDBY. I had to manually do a TAKEOVER. Once, the PRIMARY came up I switched back to original setup.
    To find the cause of not having automatic failover worked, I issued the lssam command first. I am seeing this unusual output as below. The HADR db status shows as Pending Online and Unknown. Googling it out did not server much purpose, however one link (http://www-01.ibm.com/support/docview.wss?uid=swg21961711) I found where it suggests that TSAMP is not able to monitor the db2 HADR status. I tried the to run the db2pd -hadr command from root but it works perfectly fine.
    Can you please suggest what can be done to diagnose further?

    Pending online IBM.ResourceGroup:db2_dbins371_dbins371_DSIMPR-rg Nominal=Online
    |- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs
    |- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs:server1
    ‘- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs:server2

    • Ember Crooks says:

      TSAMP state problems can be difficult. You can try running db2haicu and see if it just needs to be enabled after an extended outage. There are some suggestions on other approaches in my blog articles, but they should be used at your own risk.

  6. miguel martin says:

    Your post about HADR – TSA was very helpful for me, and I would like to make some questions.

    I have an HADR environment with TSA db2 v10.5, actually it works well. And I have the intention to add an auxiliary standby, and add it to the Cluster TSA. I have read about it and that the cluster TSA does not support a second standby to make the switch role, but in my test environment, I have created a cluster with 3 nodes (Primary, Principal standby and auxiliary standby

    Online IBM.ResourceGroup:db2_db2inst1_db2inst1_TESTDB-rg Nominal=Online
    |- Online IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs
    |- Online IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs:primary1
    ‘- Offline IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs:standby1
    ‘- Online IBM.ServiceIP:db2ip_10_120_202_58-rs
    |- Online IBM.ServiceIP:db2ip_10_120_202_58-rs:primary1
    ‘- Offline IBM.ServiceIP:db2ip_10_120_202_58-rs:standby1
    Online IBM.ResourceGroup:db2_db2inst1_db2inst1_QADB-rg Nominal=Online
    |- Online IBM.Application:db2_db2inst1_db2inst1_QADB-rs
    |- Online IBM.Application:db2_db2inst1_db2inst1_QADB-rs:primary1
    ‘- Offline IBM.Application:db2_db2inst1_db2inst1_QADB-rs:standby1
    ‘- Online IBM.ServiceIP:db2ip_10_120_202_59-rs
    |- Online IBM.ServiceIP:db2ip_10_120_202_59-rs:primary1
    ‘- Offline IBM.ServiceIP:db2ip_10_120_202_59-rs:standby1
    Online IBM.ResourceGroup:db2_db2inst1_primary1_0-rg Nominal=Online
    ‘- Online IBM.Application:db2_db2inst1_primary1_0-rs
    ‘- Online IBM.Application:db2_db2inst1_primary1_0-rs:primary1
    Online IBM.ResourceGroup:db2_db2inst1_standby1_0-rg Nominal=Online
    ‘- Online IBM.Application:db2_db2inst1_standby1_0-rs
    ‘- Online IBM.Application:db2_db2inst1_standby1_0-rs:standby1
    Online IBM.ResourceGroup:db2_db2inst1_standby2_0-rg Nominal=Online
    ‘- Online IBM.Application:db2_db2inst1_standby2_0-rs
    ‘- Online IBM.Application:db2_db2inst1_standby2_0-rs:standby2
    Online IBM.Equivalency:db2_db2inst1_db2inst1_TESTDB-rg_group-equ
    |- Online IBM.PeerNode:primary1:primary1
    ‘- Online IBM.PeerNode:standby1:standby1
    Online IBM.Equivalency:db2_db2inst1_db2inst1_QADB-rg_group-equ
    |- Online IBM.PeerNode:primary1:primary1
    ‘- Online IBM.PeerNode:standby1:standby1
    Online IBM.Equivalency:db2_db2inst1_primary1_0-rg_group-equ
    ‘- Online IBM.PeerNode:primary1:primary1
    Online IBM.Equivalency:db2_db2inst1_standby1_0-rg_group-equ
    ‘- Online IBM.PeerNode:standby1:standby1
    Online IBM.Equivalency:db2_db2inst1_standby2_0-rg_group-equ
    ‘- Online IBM.PeerNode:standby2:standby2
    Online IBM.Equivalency:db2_public_network_0
    |- Online IBM.NetworkInterface:eth1:standby1
    |- Online IBM.NetworkInterface:eth1:primary1
    ‘- Online IBM.NetworkInterface:eth1:standby2
    Online IBM.Equivalency:db2_public_network_1
    |- Online IBM.NetworkInterface:eth0:standby1
    |- Online IBM.NetworkInterface:eth0:primary1
    ‘- Online IBM.NetworkInterface:eth0:standby2
    [db2inst1@primary1 ~]$ db2pd -db deltas -hadr

    But something wrong happen when I swith the roles from de Primary to the auxiliary standby. Manually from de Auxiliary Stanby – “db2 takaover hadr on db testdb” The command executes successfully,

    HADR_ROLE = PRIMARY
    REPLAY_TYPE = PHYSICAL
    HADR_SYNCMODE = NEARSYNC
    STANDBY_ID = 1
    LOG_STREAM_ID = 0
    HADR_STATE = PEER
    HADR_FLAGS =
    PRIMARY_MEMBER_HOST = standby2
    PRIMARY_INSTANCE = db2inst1
    PRIMARY_MEMBER = 0
    STANDBY_MEMBER_HOST = primary1
    STANDBY_INSTANCE = db2inst1
    STANDBY_MEMBER = 0
    HADR_CONNECT_STATUS = CONNECTED
    HADR_CONNECT_STATUS_TIME = 10/03/2017 07:49:06.422433 (1507042146)
    HEARTBEAT_INTERVAL(seconds) = 30
    HEARTBEAT_MISSED = 0
    HEARTBEAT_EXPECTED = 83
    HADR_TIMEOUT(seconds) = 300
    TIME_SINCE_LAST_RECV(seconds) = 0
    PEER_WAIT_LIMIT(seconds) = 0
    LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
    LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
    LOG_HADR_WAIT_COUNT = 0
    SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 16384
    SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 87380
    PRIMARY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
    STANDBY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
    HADR_LOG_GAP(bytes) = 0
    STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
    STANDBY_RECV_REPLAY_GAP(bytes) = 0
    PRIMARY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
    STANDBY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
    STANDBY_REPLAY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
    STANDBY_RECV_BUF_SIZE(pages) = 512
    STANDBY_RECV_BUF_PERCENT = 0
    STANDBY_SPOOL_LIMIT(pages) = 13000
    STANDBY_SPOOL_PERCENT = 0
    STANDBY_ERROR_TIME = NULL
    PEER_WINDOW(seconds) = 300
    PEER_WINDOW_END = 10/03/2017 08:26:24.000000 (1507044384)
    TAKEOVER_APP_REMAINING_PRIMARY = 0
    READS_ON_STANDBY_ENABLED = N

    And sudenly It goes down

    db2pd -db testdb -hadr

    Database TESTDB not activated on database member 0 or this database name cannot be found in the local database directory.

    And the Primary automatically took over control of the database. And I dont know why.

    I would like to know what is the correct way to make the stanby auxiliary the new primary.

    • Ember Crooks says:

      Adding a third node to the TSAMP domain is not an officially supported solution from IBM. It is also not one that I have ever attempted to implement. One of the reasons is that often the auxiliary standby(s) are used for DR purposes and not HA. Often they are in a geographically separate location with a similar set of application servers that would be activated if they were ever used. Usually communication between the app servers and the database servers if they were across a significant distance would not achieve reliable or acceptable performance for the end-user. I have a number of clients using auxiliary standbys, but never integrating them into the TSAMP domain.

      I suspect you have some sort of problem in the TSAMP setup that is causing TSAMP to issue commands to re-establish the primary. Clustering scripts can get complex if you’ve ever tried to build them yourself – I suspect there is additional scripting you would need to do for the cluster to make this work.

  1. September 25, 2013

    […] As stated before, I wish there was an option on db2haicu that basically said “I’ve fixed the original problem, reset the TSA states”. This one is a bit easier than the problem and reset I describe in Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup […]

Leave a Reply

Your email address will not be published. Required fields are marked *