DB2 Basics: What is an Index?
Don’t call them indices in the United States – most American DBAs agree that they are ‘indexes’, though either plural form is grammatically acceptable in American English. Indexes are arguably one of the most important aspects of database design. They’re also something that we can pretty easily change and add to over time. Even overly restrictive vendors tend to allow new non-unique indexes to be added to their vended DB2 databases.
What is an Index
An index is a data structure that exists physically on disk. It consists of one or more columns in order and pointers to rows in the table. It is by definition a partial duplication of the data, but is tied to the data by the DBMS so that an index cannot be different than the data in the table
Realistically, DB2 supports only a B-tree style of index. There are other types you might hear about like block indexes for MDC tables, but they are only used in special situations.
The diagram below shows what the structure of a B-tree index looks like. On the root node, there are values and pointers. Each value represents the last value represented on an intermediate node. Each intermediate node contains values and pointers to leaf pages. Each leaf page contains values and row ids (rids) that point to individual rows in the table. Additional intermediate nodes may be added depending on cardinality of the index and the size of the table.
While there are other index types listed in the db2 Info Center (http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0020180.html), those are options on the index, and the underlying structure is always B-tree.
DB2 does not have hash or bit map or other index types
Index Creation Options
Unique indexes enforce the uniqueness of a single column or group of columns. In DB2 there is no concept of a unique constraint outside of a unique index. Primary keys are created via CREATE TABLE or ALTER TABLE statements, and may use either a previously existing unique index or create their own unique index.
Clustered indexes are indexes where the actual data in the table is placed at least roughly in order of the index. If a clustered index exists on a table, DB2 will attempt to insert data in the order of the clustering index. Order is not guaranteed, and may degrade over time depending on insert patterns, data volume, and settings controlling the structure of data pages. Tables cannot be REORGed on an index other than the clustering index. Using the CLUSTER option on the create index statement creates a clustering index. Only one clustering index may exist on a table at a time. Clustered indexes are a wholly different concept from tables defined with Multi-Dimensional Clustering(MDC).
Direction of Index Scan
By specifying DISALLOW REVERSE SCANS, index scans can only occur in the order in which the index was defined and not in reverse order. I cannot think of a technical reason you would want to do this – it used to be the default behavior, so I’m in the habit of always specifying ALLOW REVERSE SCANS, though it is not technically required since it is the default.
When using table partitioning, we can choose to have indexes partitioned along with your data or to have them non-partitioned, at least in DB2 9.7.
Why Care About Indexes?
One of the strengths of any RDBMS system is the ability to find an access path to the data without the end user having to understand how to physically retrieve the data. Accessing the data well is critical to database performance. CRITICAL.
I have seen missing indexes in e-commerce databases bring not just the database server to its knees, but have also seen that cause high utilization on all tiers (app and web) with an impact on site availability. And this is on database servers that were over-sized in the first place.
Indexes reduce the amount of time, memory, I/Os, sort time, cpu cycles, and really hardware needed to return the data in a reasonable amount of time. This is especially true as the data grows in size. With small tables, you may be able to get by without indexes, but as the tables grow, it becomes more and more important to make sure they are properly indexed. That’s not to say that small tables don’t need indexes too – they do.
The best analogy I have here is imagine walking into a library or bookstore, knowing the title of a book you want to find. But the books aren’t in any order and there are no card catalogs or computer catalogs. You would have to look at each and every book before you found the one you were interested in. That’s a table scan, and it works to find something every time, but it sure does take a lot of time. In fact if a human were faced with that prospect, most of us would just give up. Instead, if you have a card or computer catalog(analogous to an index in our example), you can look up the exact location of the book and go straight to it.
If indexes are so powerful, why don’t we index on every single column and every possible combination of columns?
The first reason is insert performance. Each added index will decrease insert performance, and certainly in e-commerce database, insert performance can be quite critical on some tables. Indexes exist physically on disk, so each insert also has to update each index. More indexes = more time to insert.
The second reason is disk space. Technically indexes are duplicate data. They’re storing data that is already in the table, just in a specific order. Since indexes exist physically on disk, they take up disk space. I have actually seen a database where the indexes took up as much space as the data did, because every column in the table was in an index. In that case it was necessary, but such duplication should not be needed for every database or table. I can easily see some tables taking more space in indexes than the data themselves.
Maintenance for Indexes
One of the things I was shocked about when I took an Oracle class was that in Oracle databases, indexes can be considered invalid and need to be rebuilt. This doesn’t happen in DB2 databases. DB2 might choose not to use the index if statistics are old or if it doesn’t match what the query needs well enough, but the index is always valid.
For DB2 to select the best indexes to use, it is important to regularly (weekly) be running RUNSTATS on the tables in your database with this syntax:
db2 runstats on schema.table with distribution and detailed indexes all
This ensures that DB2 has correct statistical information on the distribution of data and on all indexes so it can choose the best path to retrieve data.
If you use clustering indexes, you should also use REORGCHK on a weekly basis to check for tables that need to be re-ordered to match their clustering indexes better.
Indexing as an art
Overall, indexing is an art that it takes a while to master. I hope the basics in this post give some of my readers a good starting point on understanding indexes.