dbclean and data pruning intro
One of the most important things for performance of Commerce databases is the proper use of dbclean and data pruning.
It is a simple fact of life for OLTP databases that the same database with less data will perform better. This is especially true of Commerce databases, and even before a site has gone live, I try to have data pruning planned, scripted, and running on a regular basis.
In reality, you can have Commerce databases that perform fine for months or even years without data pruning. I’ve seen severe performance issues caused by a lack of data pruning as soon as a month or two after go-live.
If you google dbclean, you’ll find dozens if not hundreds of misc. references on the topic – most saying you must run dbclean and giving some basic syntax example. What I’d like to do in future posts is to cover the most critical areas in excruciating detail and talk about methodologies for dealing with common issues. This is my general post on dbclean and data pruning.
How long do I keep this stuff?
For most pruning areas, you’ll need to work out how long to keep them. This decision is often heavily influenced by the Business requirements. How long do users need to be able to access these objects? How long do you want to be able to access the objects for marketing or selling purposes? How long is the data even relevant? Is the Commerce database your “system of record” for the pruning area? On each area, I will cover some of the considerations you may want to make on each area.
How do I get rid of data safely?
There are also multiple methods of pruning some areas. Not all areas can be pruned through all methods. The major methods include:
- Direct programmatic pruning
- Commerce Scheduler Jobs
- DBClean standard jobs
- DBClean custom jobs
- Other custom pruning scripts
There are also cases where you may be able to reduce or limit creation of data.
dbclean itself has a number of options. Let’s look at some of the highlights
Where dbclean runs
DBClean can run from any commerce application server for any database it can connect to. Some clients prefer to run all of their pruning from one application server and others run cleaning on an application server that matches the environment (stage, prod, dev, etc). If you have separate tiers for your database server(s), then you cannot run dbclean on those unless you also install Commerce on them, which is not normally done for licensing reasons.
What ID runs DBClean
There are two ids we’re talking about here. First is the id from whose crontab dbclean is run – the id that you log into to execute the command. This can be any id on the app server that has access to the dbclean executable, the logs directory, and the instance xml.
The other id is the one you specify as the “dbuser” on execution. This id must have select and delete privileges on the cleanconf table and any tables being deleted from. It must also be one you’re willing to have the password exposed for in a ‘ps’ list of processes. That’s right, the full text of the dbclean command which includes a password is visible in a process list.
Deletes are expensive
And deletes in a database with RI like Commerce databases are really, really expensive. A delete on the member table for example can cascade to 300-400 tables. Because of this, deletes must be broken up into smaller chunks. DBclean offers the “-commit” option for this, which I recommend always using and setting to something between 500 and 1000.
What about things that DBClean doesn’t do out of the box?
There are two options when pruning things that DBClean doesn’t have already figured out. The first is to create your own entry in cleanconf and then run dbclean for this custom area. This often works well, and keeps things fairly standardized. The other option is to write your own script to delete things using any of a half-dozen or more different methods at the db2 level. I’ll give you some ideas in a few of the area-specific posts on this. The key thing for either option is to very thoroughly test it with as close as you can get to real-world data. You’d be amazed some of the data inconsistencies that you can run across.
Do you have time to clean?
If you start out behind on cleaning, you may not ever be able to catch up again. Why? Consider the fact that deletes are horribly expensive in commerce databases. My base assumption when I have no data on hardware or database size is that I can only delete 150 users per minute. Consider a database in which I have 5,000,000 guest users to delete. That means I’m looking at 23 days of cleaning running straight – assuming I can run it during normal volume without performance impact, which is rarely true. It’s unlikely in that scenario that I’ll ever be able to delete the extra users. Most of the time in this kind of scenario, I have far fewer users that I actually want to keep. In this case, I can take an outage to export the data I want to keep, clear out the table, and then import everything I want to keep back into the table using replace. And very, very carefully setting integrity. Usually I can do that in 2-8 hours(after 20-40 hours of testing the method). Think these numbers are preposterous? They come from at least one real-world scenario.
The other thing that is implied here is that I can keep up if I delete every day. That 5,000,000 users could have accumulated over years, and if I’m looking at only say 40,000 per day, then I can keep up in a smaller window of deletes every night.
Keeping a Commerce database as clean as possible contributes to performance. It will also help if you have a Performance-related PMR open with IBM support – they can’t tell you to run dbclean if you’re already doing it.
Over the next weeks (months?), I’ll go through some entires on the specific pruning areas and go into more detail than you ever thought you wanted.