Issues with STMM
I thought I’d share some issues with STMM that I’ve seen on Linux lately. I’ve mostly been a fan of STMM, and I still am for small environments that are largely transaction processing and have only one instance on a server.
Here are the details of this environment. The database is a small analytics environment. It used to be a BCU environment that was 4 data nodes and one coordinator node on 9.5. The database was less than a TB, uncompressed. There were also some single-partition databases for various purposes on the coordinator node. I’ve recently migrated it to BLU – 10.5 on Linux. The users are just starting to make heavier use of the environment, though I largely built and moved some data about 6 months ago. The client does essentially a full re-load of all data once a month.
The new environment is two DB2 instances – one for the largely BLU database, and one for a transaction processing database that replaces most of the smaller databases from the coordinator node. Each instance has only one database. The server has 8 CPUS and about 64 GB of memory – the minimums for a BLU environment.
The first crash we saw was both instances going down within 2 seconds of each other. The last message before the crash looked like this:
2015-08-06-17.58.02.253956+000 E548084503E579 LEVEL: Severe PID : 20773 TID : 140664939472640 PROC : db2wdog INSTANCE: db2inst1 NODE : 000 HOSTNAME: dbserver1 EDUID : 2 EDUNAME: db2wdog [db2inst1] FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:20 MESSAGE : ADM0503C An unexpected internal processing error has occurred. All DB2 processes associated with this instance have been shutdown. Diagnostic information has been recorded. Contact IBM Support for further assistance. 2015-08-06-17.58.02.574134+000 E548085083E455 LEVEL: Error PID : 20773 TID : 140664939472640 PROC : db2wdog INSTANCE: db2inst1 NODE : 000 HOSTNAME: dbserver1 EDUID : 2 EDUNAME: db2wdog [db2inst1] FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:8959 DATA #1 : Process ID, 4 bytes 20775 DATA #2 : Hexdump, 8 bytes 0x00007FEF1BBFD1E8 : 0201 0000 0900 0000 ........ 2015-08-06-17.58.02.575748+000 I548085539E420 LEVEL: Info PID : 20773 TID : 140664939472640 PROC : db2wdog INSTANCE: db2inst1 NODE : 000 HOSTNAME: dbserver1 EDUID : 2 EDUNAME: db2wdog [db2inst1] FUNCTION: DB2 UDB, base sys utilities, sqleCleanupResources, probe:5475 DATA #1 : String, 24 bytes Process Termination Code DATA #2 : Hex integer, 4 bytes 0x00000102 2015-08-06-17.58.02.580890+000 I548085960E848 LEVEL: Event PID : 20773 TID : 140664939472640 PROC : db2wdog INSTANCE: db2inst1 NODE : 000 HOSTNAME: dbserver1 EDUID : 2 EDUNAME: db2wdog [db2inst1] FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10 MESSAGE : Sending SIGKILL to the following process id DATA #1 : signed integer, 4 bytes ...
The most frequent cause of this kind of error in my experience tends to be memory pressure at the OS level – the OS saw that too much memory was being used, and instead of crashing itself, it chooses the biggest consumer of memory to kill. On a DB2 database server, this is almost always db2sysc or another DB2 process. I still chose to open a ticket with support, to get confirmation on this and see if there was a known issue.
IBM support pointed me to this technote, confirming my suspicions: http://www-01.ibm.com/support/docview.wss?uid=swg21449871. They also recommended “have a Linux system administrator review the system memory usage and verify that there is available memory, including disk swap space. Most Linux kernels now allow for the tuning of the OOM-killer. It is recommended that a Linux system administrator perform a review and determine the appropriate settings.” I was a bit frustrated with this response as this box runs on a PureApp environment and runs only DB2. The solution is to tune the OOM-killer at the OS level?
While working on the issue I discovered that I had neglected to set INSTANCE_MEMORY/DATABASE_MEMORY to fixed values, as is best practice on a system with more than one DB2 instance when you’re trying to use STMM. So I set them for both instances and databases, allowing the BLU instance to have most of the memory. I went with the idea that this crash was basically my fault for not better limiting the two DB2 instances on a box. Though I wish STMM would play better for multiple instances.
Several weeks later, I had another crash, though this time only of the BLU instance, not of the other instance. It was clearly the same issue. I re-opened the PMR with support, and asked for help identifying what tuning I needed to do to keep these two instances from stepping on each other. IBM support again confirmed that it was a case of the OS killing DB2 due to memory pressure. This time, they recommended setting the Linux kernel parameter vm.swappiness to 0. While I worked on getting approvals for that, I tweeted about it. The DB2 Knowledge Center does recommend it be set to 0. I had it set to the default of 60.
Scott Hayes reached out to me on twitter because he had recently seen a similar issue. After a discussion with him about the details, I decided to implement a less drastic setting for vm.swappines, and to instead abandon the use of STMM. I always set the package cache manually anyway. I had set catalog cache manually. Due to problems with loads, I had already set the utility heap manually. In BLU databases, STMM cannot tune sort memory areas. All of this meant that the only areas STMM was even able to tune in my BLU database were DBHEAP, LOCKLIST, and the buffer pools. I looked at what the current settings were and set these remaining areas to just below what STMM had them at. I have already encountered one minor problem – apparently STMM had been increasing the DBHEAP each night during LOADs, so when they ran LOADs the first night, they failed due to insufficient DBHEAP. That was easy to fix, as the errors in the diagnostic log specified exactly how much DBHEAP was needed, so I manually increased the DBHEAP. I will have to keep a closer eye on performance tuning, but my monitoring already does things like send me an email when buffer pool hit ratios or other KPIs are off, so that’s not much of a stretch for me.