Example of Troubleshooting a Locking Problem

I work on a team of DBAs, and we will frequently talk through an issue together to find the root cause. We are all capable of getting there on our own, but sometimes we realize that we’re not seeing something or think one of the others might see an issue faster.

This post is based on a real issue one of my fellow DBAs saw and that we talked through to find the root cause after the immediate issue had been resolved.

The Problem

Client reports site is very slow to the point of being non-functional. Our DBA logs in and sees that there are 200 applications in a lock-wait status, and analyzing a bit further discovers they all appear to be hung up waiting on one other application. She takes an application snapshot and a lock snapshot and forces off the offending application. It takes about 5 minutes to rollback, and the immediate problem is resolved – no more applications in lock-wait, and no more site slowness.

We know our application (WebSphere Commerce) well enough to know that forcing off applications in this manner is safe as long as we’re not forcing off too many, and when we are dealing with a site-down situation in the middle of a business day, it is frequently most important to resolve the outage quickly, and then figure out what went wrong. This approach also requires gathering some data before resolving the immediate issue, so that data is not lost.

Analyzing What Really Happened

Upon looking at the application snapshot for the forced off application, we find this statement was the last statement executed:

SELECT T1.CALCODE_ID FROM CALCODE T1 WHERE T1.PUBLISHED=1 AND T1.STOREENT_ID IN(?,?,?) AND T1.CALUSAGE_ID=? AND ((T1.CALCODE_ID NOT IN (SELECT T4.CALCODE_ID FROM CALC ODEMGP T4)) OR (T1.CALCODE_ID IN (SELECT T5.CALCODE_ID FROM CALCODEMGP T5 WHERE T5.MBRGRP_ID IN (?)))) AND (T1.CALCODE_ID IN (SELECT T10.CALCODE_ID FROM CLCDPROMO T10 WHERE T10.PX_PROMOTION_ID IN (SELECT T11.PX_PROMOTION_ID FROM PX_PROMOTION T11 WHERE T11.CDREQUIRED=0))) AND T1.CALCODE_ID IN (SELECT T2.CALCODE_ID FROM CATENCALCD T 2 WHERE (T2.CATENTRY_ID = ? OR T2.CATENTRY_ID = (SELECT T3.CATENTRY_ID_PARENT FROM CATENTREL T3 WHERE T3.CATENTRY_ID_CHILD=? AND T3.CATRELTYPE_ID='PRODUCT_ITEM')) AND T2.STORE_ID IN(?,?) AND (T2.TRADING_ID IS NULL OR T2.TRADING_ID IN (?))) UNION SELECT T7.CALCODE_ID FROM CATGPCALCD T7 WHERE T7.CATGROUP_ID IN (?,?,?,?,?,?) AND ( ( T7.CALCODE_ID NOT IN (SELECT T8.CALCODE_ID FROM CALCODEMGP T8)) OR (T7.CALCODE_ID IN (SELECT T5.CALCODE_ID FROM CALCODEMGP T5 WHERE T5.MBRGRP_ID IN (?))) ) AND ( T7.C ALCODE_ID IN (SELECT T9.CALCODE_ID FROM CALCODE T9 WHERE T9.PUBLISHED=1 AND T9.STOREENT_ID IN(?,?,?) AND T9.CALUSAGE_ID=?)) AND (T7.CALCODE_ID IN (SELECT T13.CALCODE_ ID FROM CLCDPROMO T13 WHERE T13.PX_PROMOTION_ID IN (SELECT T12.PX_PROMOTION_ID FROM PX_PROMOTION T12 WHERE T12.CDREQUIRED=0)))

That’s a base-Commerce SQL statement and not a custom one.

However, the statements from the applications that were in lock wait on the table were pretty much all:

INSERT INTO DMEXPLOG (DMEXPLOG_ID, STOREENT_ID, EXPERIMENT_ID, TESTELEMENT_ID, PERSONALIZATIONID, CREATED, CLICKED, VALUE, CONTENTTYPE, OPTCOUNTER) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)

The lock-waits were all on the DMEXPLOG table. My fellow DBA puzzled over this for a bit because the two statements don’t really look like they should conflict. It is important to remember that the the SQL statement supplied in the application snapshot is really the LAST statement executed, and not the one that is CURRENTLY executing, and this is one of the cases where that shows up. So how do we get what the current statement executing is? We cannot, really without a statement event monitor, which we were not running at the time. If it’s a long-running statement, we might be able to catch it with db2top.

I don’t recommend running a statement event monitor on a running production e-commerce database unless you are really desperate and have a repeating problem, and even then only for a very short duration. The large numbers of statements executed in e-commerce databases generate a large amount of data in a very short time.

So the next question we considered is why weren’t the locks timing out. We went and verified that our LOCKTIMEOUT variable was set properly using:

db2 "get db cfg for sample" |grep LOCKTIMEOUT
 Lock timeout (sec)                        (LOCKTIMEOUT) = 45

OK, so that’s set properly. Probably what was happening was that the locks were timing out, but then the application was simply re-connecting or picking up another existing connection and trying again – good application logic when it’s a step in the transaction that cannot be skipped.

While we quickly ruled it out, we also considered that a reorg might be involved. Reorgs do not respect LOCKTIMEOUT in many situations, so can be problematic in lock chaining issues. However we’re smart enough not to run reorgs in the middle of the day, and we verified that none were unexpectedly running here.

We next looked at the snapshot information we had about the locks:

Lock object type = Table 
Lock mode = Exclusive Lock  
Lock mode requested = Intention Exclusive Lock (IX) 
Name of tablespace holding lock = USERSPACE1 
Schema of table holding lock = WSCOMUSR 
Name of table holding lock = DMEXPLOG

Hang on, is it saying there’s an Exclusive(X) lock on the entire table?!?!?

The insert needs an Intent Exclusive(IX) lock on the table and will acquire exclusive locks on individual rows. X locks and IX locks are not compatible, and so the two cannot access the table at the same time. See the lock compatibility chart in the Info Center if you need to check the compatibility of two (or more) locks.

In an e-commerce database during regular operation, we NEVER expect to see an exclusive table lock. That immediately made me think of Lock Escalation, and so looking at db2diag.log, we found:

2013-01-18-12.16.34.057220-360 E82054608E917       LEVEL: Warning
PID     : 32104                TID  : 47614489782592PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000          DB   : SAMPLE
APPHDL  : 0-50232              APPID: 173.10.105.33.59586.13011817552
AUTHID  : WSCOMUSR
EDUID   : 8208                 EDUNAME: db2agent (SAMPLE) 0
FUNCTION: DB2 UDB, data management, sqldEscalateLocks, probe:1

MESSAGE : ADM5501I  DB2 is performing lock escalation. The affected application 
          is named "db2jcc_application", and is associated with the workload 
          name "SYSDEFAULTUSERWORKLOAD" and application ID 
          "173.10.105.33.59586.13011817552"  at member "0". The total number of 
          locks currently held is "1249935", and the target number of locks to 
          hold is "624967". The current statement being executed is "delete 
          from DMEXPLOG where CREATED < ?".

There it is, our smoking gun. A delete was doing a table scan and locking up the whole of a table that every single transaction touches.

That delete is actually out-of-the-box Commerce code too.

WebSphere Commerce is actually pretty good at leaving deletes to be done at another time and simply marking rows for deletion. So our ongoing investigation will be in two directions:

  1. Why does this delete need to be done in real time? Can we work it into our dbclean/pruning process instead?
  2. There’s an index that will prevent the tablescan during the delete. We probably want to implement that index, though since this is an 88 million record table that every transaction seems to hit, we’re actually going to use a performance testing environment to test adding the index during a load test to see if adding the index itself causes any performance problems before we attempt it on production

Summary

Solving this problem was three major steps:

  1. Gather info that will disappear and then get the database functioning normally again
  2. Find the root cause of the problem
  3. Work on a long-term solution for the true problem

Note that the problem here is not one that is caused by DB2 itself or the DBMS. This is actually an application problem that manifests in the database. The application should probably not be doing that delete as a part of normal operation, and if it must, then an index needs to be added to support the delete.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *