Calculating Row Width and Choosing a Page Size in DB2
A DB2 DBA must be able to calculate the row width. The row width can be used to estimate table size, and also to pick a tablespace with the correct page size. DB2 is more strict about a row fitting on a page. For general purposes, a DB2 table must be in a tablespace where any row, minus LOBs fits on a single page. There is an option for extended row size. This feature, in my head, is only there for Oracle compatibility, and should not be used when there are other choices.
How to Calculate Row Width
For Tables that Do Not Yet Exist
Working from the create table statement, this is simply an exercise in detail-oriented work. For each column, you need to map the data type and specification to the number of bytes that will be used when the table is created. This sounds simple, but can be a bit more complex. I recommend a spreadsheet with a row for each column so you can figure them out one by one and then add all the numbers together.
Before going through how to determine the minimum or maximum bytes a column will consume, there are two pieces of information you need about every column, in addition to the data type specification. These two factors determine how much space any column takes up:
- Is value compression active? Value compression is invoked via create or alter table statements. If this is a new table, you will see `COMPRESS SYSTEM DEFAULT` in the definition for the individual column. “Value Compression” is NOT active when the `COMPRESS YES` and related keywords are used at the table level.
- Is the column nullable? A column is nullable if the key words `NOT NULL` are not specified for that column.
When you know these details for each column, you can use the table from the IBM DB2 Knowledge Center to understand just how much space each column would take up.
By data type, there are some considerations and details:
Most of the numeric data types take up a fixed amount of space in the database, no matter what number is represented within them. For these, each number must fit within the range defined for the type. These columns are easy to map a data type to a specific number of bytes based on the two factors previously discussed. For example, an INTEGER in a nullable column will take up 5 bytes:
With the decimal data type, the size depends on the precision (total number of digits), so you must also have the defined precision before mapping things back to that table from the IBM DB2 Knowledge center. That table defines the bytes that the column takes up in terms of precision or P. When you define columns with these numbers, you must include the precision, so a create table statement will have this value available. For example, if a column is defined as
NOT NULL is not part of the column definition, the precision part of this definition is 10, and using the table, that would be (10/2)+2 or 7:
There are three date/time data types in DB2. The most frequently used is timestamp, but date and time types are also available. A precision can be specified with the timestamp data type, but if no precision is specified, the default of 6 (microseconds) is used. Precision on timestamp can be from 0 to 12. For date and time, the number of bytes can be read directly from the table. For timestamp, it is a function of precision. For example, if a column is defined simply as
TIMESTAMP, the precision is 6, and the number of bytes is (6+1)/2+8 or 11 (taking the integral part of the number):
All character types include some value or limit for their length. For CHAR, the value supplied is the exact value of space occupied. For VARCHAR, the number supplied is the upper limit of space that may be occupied. When we’re calculating the row size in terms of varying data types like this, we consider the maximum row size instead of minimum or average size, because our pages have to have enough space to accommodate the maximum. For example, a column defined as
VARCHAR(3000 OCTETS) would have a maximum size of 3005 bytes:
Octets means that the number specified is in bytes, and that column can hold 3000 bytes of data – the actual number of characters allowed depends on whether those characters are single-byte or double-byte characters.
For small columns that you expect to be populated all or most of the time, consider this – a
CHAR(2) column would take up only 3 bytes per row (if nullable), while a
VARCHAR(2) column would take up 7 bytes per row. If most rows have values for a column, CHAR is often a more efficient choice than VARCHAR for very short strings.
VARCHAR(1) should never be used and any value in the single digits for VARCHAR is questionable and valuable only if a majority of the values are left empty in that column.
LOB and XML types are a special case. For XML and LOB data, the data is stored outside of the traditional table structure. This allows for larger amounts of text or other data that would never fit on a single page. LOBs can be performance problems, particularly for transactional databases. For LOBs, a LOB Locator is stored within the table, and the size of that LOB locator depends on the maximum size of the LOB allowed in that column. Additionally, if LOBs of a smaller size are common, the table specification can allow the smaller lobs to be “inline” with the other table data. These inline lobs must fit on the page, and some LOBs may fit for some rows, while other LOBs for other rows do not fit. So if INLINE is specified, the LOB contributes that INLINE size plus a couple of bytes to the row. For example, if a column is defined as
CLOB(2M), without inlining, the space in the row would be 169 bytes:
You don’t have to account for the full size of the LOB when calculating row sizes, however, when calculating the table size or tablespace size, you still have to account for the LOB space, because it will be occupied on disk. Note that when a table is created, a “LONG” tablespace can be specified and in this case LOBs may be stored in a different tablespace than the rest of the data for a table.
For Existing Tables
This is all much easier for tables that already exist. This excellent blog entry includes a stored procedure with all the SQL to calculate this for an existing table. I can’t improve on excellent work like this by Serge Rileau.
A rough estimate can be achieved for an existing table by doing a
DESCRIBE TABLE and add up the numbers, which are generally roughly in bytes, though the LOB values will include the full size of the LOB which may be stored in a different place, and may not need to be a part of your row width calculation, depending on your purpose in calculating.
$ db2 describe table sales Data type Column Column name schema Data type name Length Scale Nulls ------------------------------- --------- ------------------- ---------- ----- ------ SALES_DATE SYSIBM DATE 4 0 Yes SALES_PERSON SYSIBM VARCHAR 15 0 Yes REGION SYSIBM VARCHAR 15 0 Yes SALES SYSIBM INTEGER 4 0 Yes
Using the rough calculation method for this table, we come up with 38 for an estimated row width. If we refer back to the tables in the IBM DB2 Knowledge Center, we would come up with 20 for each VARCHAR(15) column, 5 for the date column, and 5 for the Integer column for an actual row width of 50. Not exact, but may get you to the ballpark if you need a fast answer.
It may also be useful for tables that have variable columns to include a look at the average row size. This is calculated when runstats are collected and stored in syscat.tables. You can query it like this:
$ db2 "select AVGROWSIZE from syscat.tables where tabschema='DB2INST1' and tabname='SALES' with ur" AVGROWSIZE ---------- 44 1 record(s) selected.
Of course, you’ll need to replace DB2INST1 and SALES with the table schema and table name for the table you’re looking at.
How to Choose a Page Size
Once you have the row width, you can decide which page size is best for a table. Generally in OLTP or transaction processing databases, you want to place most tables on the smallest possible page size that will work. The reason for this is that it optimizes for the singleton-row queries that are the most critical performance focus of these databases. With the smallest possible page size, the database has to handle the smallest possible amount of data for small queries that should run fast. For Data Warehouses, usually the best idea is to go with 32K pages because the critical workload is handling large volumes of data, and it is more efficient to fetch more data at once.
In any case, it is a better idea to choose one or two page sizes that work for you rather than managing tablespaces and bufferpools for all 4 possible page sizes. This simplifies administration.
Consider things other than just “does this row fit on X page”, but also what makes sense geometrically. For example, if you have a row size of 2003 bytes that is always fully used, that means that only one row will ever fit on a page, and the other 2002 bytes per page will be wasted, leaving nearly half of the allocated space always empty. If you instead place that table in a tablespace with a 32K page size, you’ll be wasting 620 bytes per page, and with fewer pages, you’ll have to waste only about 2% of the space in the tablespace. That’s a huge difference that a few minutes of math can save, especially when multiplied across the multiple MB or GB of a good-sized table.