Locks – Timeout vs. Deadlock
So questions in this area most often come from DBAs from other RDBMS’s or from developers. I’ll probably write other posts on locks, but here are the basics on deadlocks vs. timeouts.
Why we get them confused at all
Ah the wisdom that went into grouping these under a single error code with different return/reason codes. For whatever reason, both a deadlock and a lock timeout return an SQL0911 error code. A deadlock is return/reason code 2, while a lock timeout is return/reason code 68. If you’ve been a DB2 dba for more than a couple of years, you know these by heart, largely because of the number of times a developer has called you about deadlocks when they are only lock timeouts.
What are they?
Db2 does more locking than most databases. Even reads can cause locks, with the number/scale of the locks depending on your isolation level. When some other connection is holding on to a lock that is not compatible with the lock you want, you must wait. Waiting for locks is entirely normal, and unless it is eating significantly into your response time, is not a problem. DB2 allows you to put a limit on the amount of time you’ll wait at a database level using the LOCKTIMEOUT configuration parameter. Generally it should be set to between 30 and 90 seconds for OLTP databases. It can be longer for DW/DSS databases. It is rarely appropriate to keep the default value of -1 (which means wait forever), but I have seen it once.
My expertise on Oracle is clearly not at the level of my DB2 knowledge, but my understanding is that Oracle does not allow you to set a database-wide value for lock timeout – you can specify nowait on the end of SQL statements to not allow any wait time at all. Oracle’s opinion seems to be that this is not needed because in Oracle “readers don’t block writers and writers don’t block readers”. I’ve seen issues with “blocking locks” (as I believe Oracle calls them) in a production environment caused by “for update” selects, though, so I’m not sure I agree with eliminating the idea of a lock timeout.
Deadlocks have a very specific meaning to any DBA. It’s not just two connections going after the same exact lock. It’s when my connection has a lock on the employees table and won’t give up that lock until it can finish the transaction with a lock on the departments table. But your connection has a lock on the departments table and won’t give up that lock until it can finish the transaction with a lock on the employees table. I have the cereal and won’t give it up until I get the milk. But you have the milk and won’t give it up until you get the cereal.
Oracle and db2 both have facilities to detect and resolve deadlocks by choosing one connection as the victim and giving it an error message. In DB2, that error message is SQL0911 with RC 2.
When to be concerned
First of all, significant issues with either deadlocks or lock timeouts are attributable to database design, application design, or the interaction of multiple applications. They are NOT inherently a DBMS problem, and most of the time there may not be much you can do about them at the DBMS level. The fully in the box system/physical DBAs out there go “not MY problem”. So the real solution comes from application developers most of the time for Commerce databases – or from Commerce support themselves.
I wory if I see more than 1 deadlock a day, and I consider it a fairly urgent problem if I’m seeing more than 10 deadlocks a day. I’ve seen up to 5,000 per HOUR in extremely bad issues.
On lock timeouts, I look at a trend over the last few days to determine what is acceptable. I’d be worried at more than 200 timeouts per hour, and might address it before then if there are other issues being caused.
How to track and troubleshoot them
Lock timeouts can be tracked using snapshots (including db2top), and through the administrative views. This link is the Info Center entry on the System Monitor Element for them: http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.admin.mon.doc/doc/r0001290.html
Generally, I recommend tracking key indicators every hour so you can compare what you’re seeing to the past days or weeks – this can be very helpful when looking at problems. Here is a methodology I like for doing that in 9.7: http://www.ibm.com/developerworks/data/library/techarticle/dm-1009db2monitoring1/index.html?ca=drs-
So all of that simply tells you the number of lock timeouts – since the database restart or since the last monitor reset or simulated monitor reset. The real question you’ll be asking is what are the connections doing when they get the timeout. Before version 9.5, that was harder – you would have to trap the error using db2cos, and gather the data at that time (in theory, I never actually implemented it). In db2 9.7 you can set the db2 registry variable (using db2set) DB2_CAPTURE_LOCKTIMEOUT to ON. Then when lock timeouts are encountered, files on them will be written to your DIAGPATH that should capture some data about the lock timeouts. In my limited experience in this area, they tend to capture the most recent SQL of ONE of the connections involved, which may be enough to help, and is certainly better than the information we used to have.
Again, the ultimate solution to excessive lock timeouts comes from the application – with Commerce databases, it is frequently the order or way in which different things are used by developers.
Deadlocks are generally the more serious of the two. On the default DIAGLEVEL you will see them noted in your db2diag.log. Also by default, there is a deadlock event monitor running to collect details around what is going on. You might have to alter what level of details/history this collects (I don’t usually, but you can). Most frequently it will run out of space and simply quit gathering data, so you need to drop and recreate it every now and then, prefferably with a larger number of files. Every database I build gets the default one dropped, and one with more space created in its place. I use something like this:
mkdir /db_data/db2inst1/NODE0000/SQL00001/db2event/my_detaildeadlock (note this location will change if the database directory is somewhere other than /db_data, or if the instance name is different, or if there is more than one database in the DB2 instance)
db2 "create event monitor my_detaildeadlock for deadlocks with details write to file 'my_detaildeadlock' maxfiles 2000 maxfilesize 10000 blocked append autostart"
db2 "set event monitor my_detaildeadlock state=1"
db2 "set event monitor db2detaildeadlock state=0"
db2 drop event monitor db2detaildeadlock
The data that you get out of this is really quite good. If you parse it (or get the deadlock event monitor to write to tables and query it), you get both (or all of) the statements involved in each deadlocks, their exact times, the tables involved, etc. It also holds quite a bit of history, so I like to look at it over time and see things like deadlocks per day or hour to determine exactly when a problem occurred or started.
A few configuration parameters
So one thing that can cause issues with deadlocks or lock timeouts is if you’re doing too many table scans. A good measure of this is your Read Efficiency. A good link on Read Efficiency is here: http://www.dbisoftware.com/brother-eagle/advice/db2dbiref.php, though I would set the limits for acceptable a bit higher.
Inaccurate or old runstats is the first thing to check when you have issues with deadlocks, believe it or not. If all the transactions are taking 10X as long to run because they have bad access plans, then they’re also holding their locks for 10X as long (and possibly locking more rows), and that tends to lead to problems.
Commerce supports the use of 3 registry variables that reduce locking. These variables are set as a part of DB2_WORKLOAD=WC, which you should absolutely have set on every Commerce DB2 instance. They are:
Info Center entry on locktimeout db configuration parameter: http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.admin.config.doc/doc/r0000329.html