On December 12th, I presented on the DB2 Night Show on HADR and TSAMP Advanced Topics. See the replay here: http://www.dbisoftware.com/blog/db2nightshow.php?id=531

The Basics of Code Pages in DB2

You’ve got your database code page set as UTF-8, what more do you need to know, right? DB2 automatically handles all that conversion stuff for you, right?

I’ve been dealing with significant amounts of data using UTF-8 characters that differ from the unix/linux standard code page(in the US) of 819 for the first time in a couple of years, and thought I would share some of the things I’ve learned.

First, this article is awesome for understanding code pages and their importance. http://www.joelonsoftware.com/articles/Unicode.html

Seriously, go read that article now, and then come back for the rest of this post.

The important thing is that it is not just the codepage of your database, but also of the application you’re using, and of any files you’re working with. In my case, I noticed that characters came out ok when they got into the database through JDBC (through WebsphereCommerce or the Control Center or Squirrel or DataStudio), but not when I did things directly from the command line. After a bunch of testing and some pretty in-depth conversations with developers both on top of it and not so on top of it, I figured out that the problem is that when I’m at the command line, I was using code page 819.

How the Problem Looked

When querying the database at the command line, I would see this:

select CITY from dbatest.test_jurst where jurst_id = 100103

CITY
--------------------------------------------------------------------------------------------------------------------------------
LA UNIÃN

  1 record(s) selected.

In fact, all “special” characters showed up as “Ô at my command line, but turned into a number of other characters when accessed through a JDBC client.

The application or any JDBC application would return the data properly:

 CITY     
 -------- 
 LA UNIÓN

Naturally, my first (wrong) conclusion was that the data was being stored in the database incorrectly – the Command Line is always right, right?

But that was wrong. When I updated data from the command line, then tried to query it, it looked fine at the command line:

CITY
--------------------------------------------------------------------------------------------------------------------------------
LA UNIÓN

But when the data looked like this, my JDBC apps would literally fail to read it. Here’s what I got using Data Studio:

 CITY 
 ---- 
 NULL

That value for CITY sure isn’t NULL. Sometimes it would be outright errors.

Fixing the Problem

The fix was super easy for loading/importing CSVs – you can specify the code page the csv is encoded with using “modified by CODEPAGE=819″ or “modified by CODEPAGE=1208″.

It was actually tougher at the command line – for insert and update statements. What I finally came to, is that I had to change the codepage in PuTTY (which I use for connections). In the settings, under Window -> Translation, you can set “Received data assumed to be in which character set”. It looks like this:
Image of putty translation screen

If you’re using PuTTYCM, this is easy to do across all sessions by setting the default session settings that way. Not sure if that works for plain vanilla PuTTY too or not.

Now when I test the same actions that got me data that caused any JDBC client to fail in the table before, I get the expected results.

There may be downsides. My db2top looks a bit odd now:
Image of db2top using UTF-8 encoding

But I can live with that if it means I don’t get problem data in my database.

I hope this helps someone else who is looking into a similar issue. I thought for a while that I’d have to set DB2CODEPAGE, but there were enough warnings about it that I took longer to research and understand what I was really doing first.

Any good suggestions or links about dealing with code page issues, whether simple or complicated?

4 comments… add one

  • There are plenty of other issues, too:

    * On some Linux and UNIX systems you have to make sure that the appropriate language code pages have been installed with the operating system. You can check what languages are available using the `locale -a` command. Linux typically has UTF-8 codepages installed, but AIX often does not.

    * On Linux and UNIX systems, make sure you also look at the LANG environment variable, which controls what codepage you are using. The default value of this setting when you log in may not be Unicode (typically `en_US.UTF-8`, but also `EN_US.UTF-8` on AIX), so you may have to set this environment variable in your .profile. You can see all of your language-level settings by executing the `locale` command.

    * Even if you’ve got everything else set up, if the display font you are using is missing glyphs for various Unicode code points, you may see blanks or “garbage” on screen even when everything else is talking happily in UTF-8.

    Reply
    • I actually checked some of that first. LANG is particularly important, and was set to en_US.UTF-8 already in my environment. Good tips in this comment. For other readers, here is an example of executing the locale command:

      $ locale
      LANG=en_US.UTF-8
      LC_CTYPE="en_US.UTF-8"
      LC_NUMERIC="en_US.UTF-8"
      LC_TIME="en_US.UTF-8"
      LC_COLLATE="en_US.UTF-8"
      LC_MONETARY="en_US.UTF-8"
      LC_MESSAGES="en_US.UTF-8"
      LC_PAPER="en_US.UTF-8"
      LC_NAME="en_US.UTF-8"
      LC_ADDRESS="en_US.UTF-8"
      LC_TELEPHONE="en_US.UTF-8"
      LC_MEASUREMENT="en_US.UTF-8"
      LC_IDENTIFICATION="en_US.UTF-8"
      LC_ALL=
      
      Reply
  • A few additional comments that might help with understanding conversion a little bit better or at least give you an idea how it works internally:

    DB2 codepage conversion always happens on the receiving side. In UTF8 databases there is one exception where data is not stored in UTF8, but in UTF16BE (there is also one exception to that).
    It also depends on how data is bound by the client application.
    Here are all the exceptions as test cases (client is using UTF16LE, database is UTF8):

    case 1a) client inserts a varchar (bound as wchar)
    CLI does the byte swapping and sends UTF16BE. server converts UTF16BE to UTF8

    case 1b) client inserts a graphic (bound as dbchar)
    data is sent as is – no conversion
    the assumption is that the dbchar value is always in big endian even on a little endian machine

    case 1c) client inserts a graphic (bound as wchar)
    CLI does the byte swapping and sends UTF16BE. no conversion at server

    case 2a) client selects a varchar (bound as wchar)
    server sends UTF8. client converts from UTF8 to UTF16LE

    case 2b) client selects a graphic (bound as dbchar)
    data is sent as is – no conversion
    the assumption is that the dbchar value is always in big endian even on a little endian machine

    case 3c) client selects a graphic (bound as wchar)
    server sends UTF16BE. CLI does the byte swapping to UTF16LE


    Helmut K. C. Tessarek
    DB2 Performance and Development
    IBM Toronto Lab

    Reply
  • Hi,

    What value should I add in locale to display the hebrew character in AIX environment?
    When I used “locale -a” command, I have got the below reult.
    locale -a
    C
    POSIX
    EN_US.UTF-8
    EN_US
    en_US.8859-15
    en_US.ISO8859-1
    en_US

    Thank You!!!

    Reply

Leave a Comment