A Better Understanding of TSA Resources and States

It is very possible to support DB2’s HADR using TSA to automate failover without understanding many of the TSA components and details, and even the TSA commands. I’ve posted several blog entries on TSA issues resolved without any deeper knowledge of things. In this post, I hope to shed a bit more light on the details of TSA. This means I’ll be sharing commands that are not DB2 commands, but are TSA commands, and that I’ve run as root.

Environment

The output I’m sharing in this blog entry comes from a system that I set up using the guidlines set forth in my HADR/TSA Series:
HADR
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.

It’s a simple two-server cluster using a network quorum. I’m using DB2 10.5, FixPack 3.

Basic Status Check

My favorite way to check the status has always been lssam:
TSA_States_01

And since I have to redact a fair amount in that image, here’s what that looks like in text:

Online IBM.ResourceGroup:db2_db2inst1_db2inst1_SAMPLE-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_SAMPLE-rs
                |- Online IBM.Application:db2_db2inst1_db2inst1_SAMPLE-rs:host01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_SAMPLE-rs:host02
        '- Online IBM.ServiceIP:db2ip_111_00_00_30-rs
                |- Online IBM.ServiceIP:db2ip_111_00_00_30-rs:host01
                '- Offline IBM.ServiceIP:db2ip_111_00_00_30-rs:host02
Online IBM.ResourceGroup:db2_db2inst1_host01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_host01_0-rs
                '- Online IBM.Application:db2_db2inst1_host01_0-rs:host01
Online IBM.ResourceGroup:db2_db2inst1_host02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_host02_0-rs
                '- Online IBM.Application:db2_db2inst1_host02_0-rs:host02
Online IBM.Equivalency:db2_db2inst1_db2inst1_SAMPLE-rg_group-equ
        |- Online IBM.PeerNode:host01:host01
        '- Online IBM.PeerNode:host02:host02
Online IBM.Equivalency:db2_db2inst1_host01_0-rg_group-equ
        '- Online IBM.PeerNode:host01:host01
Online IBM.Equivalency:db2_db2inst1_host02_0-rg_group-equ
        '- Online IBM.PeerNode:host02:host02
Online IBM.Equivalency:db2_public_network_0
        |- Online IBM.NetworkInterface:en12:host02
        '- Online IBM.NetworkInterface:en12:host01

The first thing I look at quickly is to see the color. Things should generally be green, with a few areas of blue. The blue items are “offline”, but that’s ok, because a database or IP can only be online on one of the nodes at a time. If for some reason, I can’t see the color (like if a teammate sends me the text), then I look to see if each larger group shows as “Online” and there are no groups where both items in the group are “Offline” and there are no occurrences of the terms “SuspendedPropegated” or “Pending online” – both of which indicate common issues.

But those are very basic and visual queues. What are we really looking at here?

Domain

It is unlikely that the domain is going to appear to be offline, but here’s how to check it:

$lsrpdomain
Name       OpState RSCTActiveVersion MixedVersions TSPort GSPort
UATec_db2h Online  3.1.4.4           No            12347  12348

If it were down, you could use:

startrpdomain UATec_db2h

You can also stop the domain:

stoprpdomain UATec_db2h

Starting and stopping at this level are not things you’re likely to do. Perhaps with an upgrade.

Resource Groups

A resource is something that can be controlled by TSA (hardware or software). A resource group is a virtual group of resources. In the lssam output, we see that there is a resource group that contains the database and the virtual or floating IP. These are in the same resource group because they should always fail over together. This is the output from above that represents that resource group:

Online IBM.ResourceGroup:db2_db2inst1_db2inst1_SAMPLE-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_SAMPLE-rs
                |- Online IBM.Application:db2_db2inst1_db2inst1_SAMPLE-rs:host01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_SAMPLE-rs:host02
        '- Online IBM.ServiceIP:db2ip_111_00_00_30-rs
                |- Online IBM.ServiceIP:db2ip_111_00_00_30-rs:host01
                '- Offline IBM.ServiceIP:db2ip_111_00_00_30-rs:host02

Note that I can also limit the lssam output to a single resource group using the -g option(lssam -g db2_db2inst1_db2inst1_SAMPLE-rg). This might also be a useful way of filtering the output to look quickly at this most interesting of my resource groups.

There is also a resource group for the db2 instance on the primary server and a resource group for the db2 instance on the standby server. Both of these should be online, simultaneously, unless you have an instance down for maintenance. This is the output from above that represents those two resource groups:

Online IBM.ResourceGroup:db2_db2inst1_host01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_host01_0-rs
                '- Online IBM.Application:db2_db2inst1_host01_0-rs:host01
Online IBM.ResourceGroup:db2_db2inst1_host02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_host02_0-rs
                '- Online IBM.Application:db2_db2inst1_host02_0-rs:host02

The last remaining section of the lssam output shows us Equivalencies. Equivalencies are fixed sets of resources that provide the same function. A good example of this is the network interface card – in the output above, en12. There is one of these on each of the servers in our cluster. The other resources may only use one of these at a time, and it’s not something that TSA can failover.

Floating IP Address

Additional information is available on resources. The floating IP address for example, we can get more information on like this:

$ lsrsrc -Ab IBM.ServiceIP
Resource Persistent and Dynamic Attributes for IBM.ServiceIP
resource 1:
        Name              = "db2ip_111_00_00_30-rs"
        ResourceType      = 0
        AggregateResource = "0x2029 0xffff 0x454eac36 0x0cb8029a 0x9377a68d 0xa5d2f010"
        IPAddress         = "111.00.00.30"
        NetMask           = "255.255.255.0"
        ProtectionMode    = 1
        NetPrefix         = 0
        ActivePeerDomain  = "UATec_db2h"
        NodeNameList      = {"host01"}
        OpState           = 1
        ConfigChanged     = 0
        ChangedAttributes = {}
resource 2:
        Name              = "db2ip_111_00_00_30-rs"
        ResourceType      = 1
        AggregateResource = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000"
        IPAddress         = "111.00.00.30"
        NetMask           = "255.255.255.0"
        ProtectionMode    = 1
        NetPrefix         = 0
        ActivePeerDomain  = "UATec_db2h"
        NodeNameList      = {"host01","Unknown_Node_Name"}
        OpState           =
        ConfigChanged     =
        ChangedAttributes =

Application Details

Additional information, including the exact scripts that TSA uses to manage db2 is available with this command:

$ lsrsrc -Ab IBM.Application
Resource Persistent and Dynamic Attributes for IBM.Application
resource 1:
        Name                  = "db2_db2inst1_db2inst1_SAMPLE-rs"
        ResourceType          = 0
        AggregateResource     = "0x2028 0xffff 0x454eac36 0x0cb8029a 0x9377a677 0x8bef7c78"
        StartCommand          = "/usr/sbin/rsct/sapolicies/db2/hadrV105_start.ksh db2inst1 db2inst1 SAMPLE"
        StopCommand           = "/usr/sbin/rsct/sapolicies/db2/hadrV105_stop.ksh db2inst1 db2inst1 SAMPLE"
        MonitorCommand        = "/usr/sbin/rsct/sapolicies/db2/hadrV105_monitor.ksh db2inst1 db2inst1 SAMPLE"
        MonitorCommandPeriod  = 21
        MonitorCommandTimeout = 29
        StartCommandTimeout   = 900
        StopCommandTimeout    = 900
        UserName              = "root"
        RunCommandsSync       = 1
        ProtectionMode        = 1
        HealthCommand         = ""
        HealthCommandPeriod   = 10
        HealthCommandTimeout  = 5
        InstanceName          = ""
        InstanceLocation      = ""
        SetHealthState        = 0
        MovePrepareCommand    = ""
        MoveCompleteCommand   = ""
        MoveCancelCommand     = ""
        CleanupList           = {}
        CleanupCommand        = ""
        CleanupCommandTimeout = 130
        ProcessCommandString  = ""
        ResetState            = 0
        ReRegistrationPeriod  = 0
        CleanupNodeList       = {}
        MonitorUserName       = ""
        ActivePeerDomain      = "UATec_db2h"
        NodeNameList          = {"host01"}
        OpState               = 1
        ConfigChanged         = 1
        ChangedAttributes     = {}
        HealthState           = 0
        HealthMessage         = ""
        MoveState             = [32768,{},0x0000 0x0000 0x00000000 0x00000000 0x00000000 0x00000000]
        RegisteredPID         = 0
resource 2:
        Name                  = "db2_db2inst1_db2inst1_SAMPLE-rs"
        ResourceType          = 1
        AggregateResource     = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000"
        StartCommand          = "/usr/sbin/rsct/sapolicies/db2/hadrV105_start.ksh db2inst1 db2inst1 SAMPLE"
        StopCommand           = "/usr/sbin/rsct/sapolicies/db2/hadrV105_stop.ksh db2inst1 db2inst1 SAMPLE"
        MonitorCommand        = "/usr/sbin/rsct/sapolicies/db2/hadrV105_monitor.ksh db2inst1 db2inst1 SAMPLE"
        MonitorCommandPeriod  = 21
        MonitorCommandTimeout = 29
        StartCommandTimeout   = 900
        StopCommandTimeout    = 900
        UserName              = "root"
        RunCommandsSync       = 1
        ProtectionMode        = 1
        HealthCommand         = ""
        HealthCommandPeriod   = 10
        HealthCommandTimeout  = 5
        InstanceName          = ""
        InstanceLocation      = ""
        SetHealthState        = 0
        MovePrepareCommand    = ""
        MoveCompleteCommand   = ""
        MoveCancelCommand     = ""
        CleanupList           = {}
        CleanupCommand        = ""
        CleanupCommandTimeout = 130
        ProcessCommandString  = ""
        ResetState            = 0
        ReRegistrationPeriod  = 0
        CleanupNodeList       = {}
        MonitorUserName       = ""
        ActivePeerDomain      = "UATec_db2h"
        NodeNameList          = {"host01","Unknown_Node_Name"}
        OpState               =
        ConfigChanged         =
        ChangedAttributes     =
        HealthState           =
        HealthMessage         =
        MoveState             =
        RegisteredPID         =
resource 3:
        Name                  = "db2_db2inst1_host01_0-rs"
        ResourceType          = 0
        AggregateResource     = "0x2028 0xffff 0x454eac36 0x0cb8029a 0x9377a673 0x4fc0f21c"
        StartCommand          = "/usr/sbin/rsct/sapolicies/db2/db2V105_start.ksh db2inst1 0"
        StopCommand           = "/usr/sbin/rsct/sapolicies/db2/db2V105_stop.ksh db2inst1 0"
        MonitorCommand        = "/usr/sbin/rsct/sapolicies/db2/db2V105_monitor.ksh db2inst1 0"
        MonitorCommandPeriod  = 10
        MonitorCommandTimeout = 120
        StartCommandTimeout   = 900
        StopCommandTimeout    = 900
        UserName              = "root"
        RunCommandsSync       = 1
        ProtectionMode        = 1
        HealthCommand         = ""
        HealthCommandPeriod   = 10
        HealthCommandTimeout  = 5
        InstanceName          = ""
        InstanceLocation      = ""
        SetHealthState        = 0
        MovePrepareCommand    = ""
        MoveCompleteCommand   = ""
        MoveCancelCommand     = ""
        CleanupList           = {}
        CleanupCommand        = ""
        CleanupCommandTimeout = 130
        ProcessCommandString  = ""
        ResetState            = 0
        ReRegistrationPeriod  = 0
        CleanupNodeList       = {}
        MonitorUserName       = ""
        ActivePeerDomain      = "UATec_db2h"
        NodeNameList          = {"host01"}
        OpState               = 1
        ConfigChanged         = 0
        ChangedAttributes     = {}
        HealthState           = 0
        HealthMessage         = ""
        MoveState             = [32768,{},0x0000 0x0000 0x00000000 0x00000000 0x00000000 0x00000000]
        RegisteredPID         = 0
resource 4:
        Name                  = "db2_db2inst1_host01_0-rs"
        ResourceType          = 1
        AggregateResource     = "0x3fff 0xffff 0x00000000 0x00000000 0x00000000 0x00000000"
        StartCommand          = "/usr/sbin/rsct/sapolicies/db2/db2V105_start.ksh db2inst1 0"
        StopCommand           = "/usr/sbin/rsct/sapolicies/db2/db2V105_stop.ksh db2inst1 0"
        MonitorCommand        = "/usr/sbin/rsct/sapolicies/db2/db2V105_monitor.ksh db2inst1 0"
        MonitorCommandPeriod  = 10
        MonitorCommandTimeout = 120
        StartCommandTimeout   = 900
        StopCommandTimeout    = 900
        UserName              = "root"
        RunCommandsSync       = 1
        ProtectionMode        = 1
        HealthCommand         = ""
        HealthCommandPeriod   = 10
        HealthCommandTimeout  = 5
        InstanceName          = ""
        InstanceLocation      = ""
        SetHealthState        = 0
        MovePrepareCommand    = ""
        MoveCompleteCommand   = ""
        MoveCancelCommand     = ""
        CleanupList           = {}
        CleanupCommand        = ""
        CleanupCommandTimeout = 130
        ProcessCommandString  = ""
        ResetState            = 0
        ReRegistrationPeriod  = 0
        CleanupNodeList       = {}
        MonitorUserName       = ""
        ActivePeerDomain      = "UATec_db2h"
        NodeNameList          = {"host01"}
        OpState               =
        ConfigChanged         =
        ChangedAttributes     =
        HealthState           =
        HealthMessage         =
        MoveState             =
        RegisteredPID         =

Note the full path to the db2 scripts – that could be useful to know if you want to change things. Note that it’s not just start and stop scripts listed, but also the monitoring script. There is a verbose mode for this script that you can use if you see any issues with a failure being detected. See the first article in the references section for more details on this.

Reference

Great Article on TSA and HADR: http://www.ibm.com/developerworks/data/tutorials/dm-1009db2hadr/ Seriously, read this one.
Great article on TSA: https://www.ibm.com/developerworks/tivoli/library/tv-tivoli-system-automation/

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *