Clustering a Table on an Index
I have been playing a fair amount lately with clustering indexes and have been rehashing my views on reorging tables on an index. This is still a work in progress, but thought I’d share some details and see if others out there have any thoughts to share with me and others on it.
Clustering a Table on an Index
I’ve long been of the opinion that as long as I’m doing a table reorg in an OLTP database, I might as well reorg the table on an index. If there is a clustering index on the table, this is the easiest call in the world – of course the table will be reorged on that index – you can’t do it on any other index. But when you don’t have a clustering index, determining the right index is more difficult.
I have long reorged tables on the primary key when there is no clustering index defined. But with research I’ve done over the last couple of years, I have come around on this. I don’t see the performance gain there in many cases. When my tables are being accessed, how likely is it that they’ll be accessed by sequential primary key, especially in my WebSphere Commerce databases where the primary keys are almost always simply generated numbers in sequential order?
I’m increasingly convinced that my tables should be clustered on a low-cardinality index – I can get the most performance gain there. WebSphere Commerce databases have plenty of low-cardinality indexes that I’m not ready to drop outright. For a low-cardinality index, I am much more likely to be accessing a chunk of the data by the index value at a time. See my developerWorks article, Why low cardinality indexes negatively impact performance for a description of why clustering can make such a performance difference when these indexes are used to access a table.
What is a Clustering Index?
Before we dig in farther, let’s first start with the basics. A clustering index is created when you specify the CLUSTER keyword on index creation(if a new index on an existing table, the table should then be reorged on the clustering index). DB2 then attempts to maintain the physical data in the table in the same order as the index. It respects a clustering index on LOAD and even on insert, DB2 attempts to physically place rows where they belong in the order of the clustering index. However, the clustering is not guaranteed, and there are a number of situations in which DB2 is unable to maintain the table in the order of the index. This is where regular reorgs come into play – assuming data is changed or inserted across the range of values in the index column or columns, regular reorgs will be needed to keep the table clustered on the index.
MDC is a concept introduced all the way back in DB2 version 8. It is guaranteed clustering over one or more columns. I’d love to get my hands on it, but WebSphere Commerce considers it a “customization”, so I won’t be talking much about it here.
Performance Impact on Reorg of Reorging on an Index
The Reorg itself will run longer if you’re doing inplace reorgs on a index as opposed to a reorg that does not cluster the table on any index. The reason for this is that a reorg without specifying an index will scan the table starting at the end, and move things around as it makes sense, and at the end may have a range of pages that can be released from the table. It never has to scan an index. An inplace reorg on an index, though, must first clear a range of pages at the beginning of the table, then scan an index to determine what rows need to be in what order and move those rows into the proper positions. This is a lot more data to move around, and can take significantly longer. Especially in larger databases, the amount of time a reorg takes may be a limiting factor in the kinds of reorgs you do, especially considering that inplace reorgs are not fully online (see When is a Reorg Really Online?).
Determining Indexes for Clustering and Clustering Reorgs
The absolute best way to determine the ideal clustering indexes that you’re going to use is to look at your SQL workload on your database and find where a clustering index might help you there. This is time consuming as you consider each table and the SQL related to that table (which can be hard to find without a tool such as DBI’s Brother-Panther). It is also rigid, and likely requires you to define your indexes either as clustering indexes or to store the indexes you want to reorg tables on somewhere in an unchanging format. For DBAs supporting hundreds of databases with hundreds of tables in each database, this may be impossible.
Even when I consider only WebSphere Commerce databases, there is a vast variety in where clustering indexes are going to make the most impact. I have one client that uses multiple stores, but only one language, so there I’m looking strongly at clustering several tables on storeent_id. Another client doesn’t have many stores, but does have many languages that are being used. Certainly for the DESC tables, there I need to work more towards using language_id as a common clustering index. I wrote recently about how I helped a critical query using a couple of clustering indexes, and that was based on SOLR functionality that is only in the latest WCS Fix/Feature Pack, so changes between WCS versions can change things as well.
Some Queries in the Right Direction
What I really want is a query I can run that will tell me not just what index for any given table will make the most sense as the clustering index or index that I regularly cluster on using a reorg, but also how much of an improvement in my workload they will give me. There may be a way I can script some of that – explaining and/or advising the SQL in my package cache, and perhaps looking at explain/advis tables? Sounds like a fun science project, but I’m not there yet.
What I’m starting out with is something a bit more simple. The below two ksh scripts make a start in this direction. The first simply describes all the indexes in the table, in a way I like better than describe indexes. I’m still playing with it:
$ cat desc_ind.sh tabname=`echo $1 | tr '[:lower:]' '[:upper:]'`; db2 connect to wc005p01; db2 -v "select i.lastused, substr(indname,1,20) as indname, substr(colnames,1,30) as colnames, fullkeycard as fullkeycard, card as table_card, decimal(clusterfactor,10,5) as clustefactor, indextype, index_scans from syscat.indexes i join syscat.tables t on i.tabname=t.tabname and i.tabschema=t.tabschema join table(mon_get_index( i.tabschema, i.tabname, -2 )) mgi on mgi.iid = i.iid where t.tabschema='WSCOMUSR' and t.tabname='$tabname' with ur"; db2 connect reset;
This script is executed simply with
./desc_ind.sh TABNAME. It could easily be altered to also specify a schema name – that’s just not something I need at the moment. The output of this script looks something like this:
$ ./desc_ind.sh XSTORESKU Database Connection Information Database server = DB2/LINUXX8664 9.7.7 SQL authorization ID = DB2INST1 Local database alias = WC005P01 select i.lastused, substr(indname,1,20) as indname, substr(colnames,1,30) as colnames, fullkeycard as fullkeycard, card as table_card, decimal(clusterfactor,10,5) as clustefactor, indextype, index_scans from syscat.indexes i join syscat.tables t on i.tabname=t.tabname and i.tabschema=t.tabschema join table(mon_get_index( i.tabschema, i.tabname, -2 )) mgi on mgi.iid = i.iid where t.tabschema='WSCOMUSR' and t.tabname='XSTORESKU' with ur LASTUSED INDNAME COLNAMES FULLKEYCARD TABLE_CARD CLUSTEFACTOR INDEXTYPE INDEX_SCANS ---------- -------------------- ------------------------------ -------------------- -------------------- ------------ --------- -------------------- 09/15/2014 P_XSTORESKU +XSTORESKU_ID 3550167 3550167 0.85567 REG 586462 09/04/2014 I_XSTORESKU01 +STLOC_ID 2681 3550167 0.18429 REG 41 09/16/2014 I_XSTORESKU02 +UPC 15955 3550167 0.03371 REG 334 09/16/2014 U_XSTORESKU01 +STLOC_ID+UPC 3550167 3550167 0.15293 REG 406417 01/01/0001 U_XSTORESKU02 +UPC+STLOC_ID 3550167 3550167 0.00156 REG 0 5 record(s) selected. DB20000I The SQL command completed successfully.
This brings data together from several sources I want to know about when looking at my indexes in this context. Notice that I’m pulling the lastused to see if the index is used at all (in this case, there are some stupid indexes on the table), and also pulling from MON_GET_INDEX to look at the number of index scans.
The next script I’ve been playing with is designed to return the name of a single index that appears to be the best candidate for a clustering index without actually looking at the SQL. It finds indexes where the index cardinality is less than 50% of the table (not sure I’m happy with that number yet), but with a cardinality of greater than 1, because clustering a table on an index with a cardinality of one is patently ridiculous. Then it orders them in descending order of the number of index scans reported in MON_GET_INDEX, and returns only the index with the most scans for this table. This tells me that DB2 is already using this low-cardinality index even without the clustering, which tends to tell me that it may be causing potential performance problems.
$ cat clus_ind.sh tabname=`echo $1 | tr '[:lower:]' '[:upper:]'`; db2 connect to wc005p01; db2 -v "select substr(indname,1,20) as indname, float(fullkeycard)/float(card) from syscat.indexes i join syscat.tables t on i.tabname=t.tabname and i.tabschema=t.tabschema join table(mon_get_index( i.tabschema, i.tabname, -2 )) mgi on mgi.iid = i.iid where t.tabschema='WSCOMUSR' and t.tabname='$tabname' and float(fullkeycard)/float(card) < .5 and fullkeycard > 1 order by index_scans desc fetch first 1 row only with ur"; db2 connect reset;
The problem with this approach is that I’m not looking at what index could have the most impact – maybe one of these indexes would be used very heavily if only the cluster factor was higher, since DB2 tends to avoid the usage of low-cardinality indexes when a table is not well clustered over them.
What other factors should I be considering here? Does anyone have any suggestions on how to make this methodology (or some other method) more relevantly find indexes on which a table would most optimally be clustered on? I feel like my approach could be heavily improved on, but wonder how to do it without extremes like explaining all the SQL in my package cache.
I’m still sure I should be dropping the lowest cardinality indexes in many situations. Perhaps choosing a clustering index is really more applicable to the “medium” cardinality indexes.