Another TSA State Issue
One of the features I would most like IBM to add into DB2 is the managing of TSA States through DB2 commands. I’m not a TSA expert, but I would like to find a class or opportunity to study it further. There are plenty of scenarios where I have to go in as root and run various commands to get TSA out of an odd state.
I saw a new one last night, and thought I’d share it with my readers. In this case, we had a takeover for maintenance (DB2 FixPack) that took an extraordinarily long time. As we were investigating it, it finally completed. But the db2diag.log was full of plenty of junk (about version differences, but we did the nodes in the correct order), and I think there were some odd failures in there. When we were all done with the FixPack and went to bring TSA back up, we got the following from lssam:
Online IBM.ResourceGroup:db2_db2inst1_Prod-db1.adomain.com_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs:Prod-db1 Online IBM.ResourceGroup:db2_db2inst1_Prod-db2.adomain.com_0-rg Nominal=Online Control=MemberInProblemState '- Failed Offline IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs '- Failed Offline IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs:Prod-db2 Control=MemberInProblemState Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCP01-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db1 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db2 '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db1 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db2
This was after we had HADR back up and running and the Fixpack applied on both servers. Having a bit of knowledge, I suspected that TSA got into a state that required manual intervention. Even using db2haicu to disable and then enable TSA did not change this. After googling a bit, I came up with this command that got things back to normal:
resetrsrc -s "Name = 'db2_db2inst1_Prod-db2.adomain.com_0-rs'" IBM.Application
After running this and giving it a few minutes, all the states looked good. There’s not a lot when I google this stuff, but in this case, this link helped a lot: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014297038
Please be cautious about running commands simply from forum suggestions, as you don’t always understand the source or the implications. However, in this case, I’ve used this command before at DB2 support’s recommendation. Apparently the ‘Failed Offline” is a permanent state that requires human intervention. This had to be run on the appropriate node – in this case Prod-db2.adomain.com.
As stated before, I wish there was an option on db2haicu that basically said “I’ve fixed the original problem, reset the TSA states”. This one is a bit easier than the problem and reset I describe in Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup
IBM: Can you please add some state management into TSA work done through db2haicu or as the instance owner using db2 commands?