What to do with a Character you Cannot Query

This post comes out of one of the more challenging technical problems I have encountered recently.

Problem Description

An international client went live recently. We did the work of building their new WebSphere Commerce 7 website, and go-live involved loading some data from their old site. I’m not sure exactly how that load was done, but I suspect massload (a WebSphere Commerce Utility) was used. Somehow the data in that load file was messed up, and there were issues where the last name should have been:
MUÑOZ
but instead showed up as:
MU&#x2592OZ
This was put very literally into the database, so we had bad data in the database.

It wasn’t just that character. I identified several other incorrect representations of either the same character or other “special” charaters, including:



ã

Ã

The thing is that all of the others were easy to query for and find. As inefficient as a LIKE on ‘%string%’ can be, when it is a one-time thing and critical, it worked.

But when I tried a LIKE on ‘%▒%’, I got no results, even though I knew the value existed. Even a search on = ‘MU▒OZ’ returned nothing when I knew there were dozens of them. Even a count on it pulled up nothing. I even tried LIKE ‘%@▒%’ escape ‘@’, and that did not work.

The even bigger problem here is that when anything with this character was queried using application tools, the application tool (such as CMC) simply stopped and reported “error”. The application was unable to proceed. This actually appeared to happen for all JDBC tools since I got similar results in CC and Data Studio (yes, I stooped to the use of GUIs to check things out).

The problem statement became: How do I query a character that does not really seem to be there – that DB2 SQL for some reason cannot read?

Finding a Resolution

I am more and more convinced that a expert is not someone who knows everything about a topic, but someone who knows a lot and knows who to ask when they get stuck. After googling and searching the info center and IBM support site, I sent an email out to all of my DBA friends looking for some help. One person was able to get the question in front of a well-known IBMer who gave me the suggestion that worked.

They suggested looking at the actual binary values that made up the string. Using that, I was able to come up with the answers. To figure out what binary the character was, I looked at some of them using this query:

select substr(lastname,1,18) as lastname, cast(lastname as varchar(50) for bit data) as bin_lstname from wscomusr.address where lastname like 'MU%' with ur

LASTNAME           BIN_LSTNAME
------------------ -------------------------------------------------------------------------------------------------------
MU;OZ              x'4D553B4F5A'
MU▒OZ              x'4D55D14F5A'
MUÑOZ              x'4D55C3914F5A'
...

From this, I could tell that the problem character was ‘D1’. So assuming that D1 wasn’t the random combination of the end of some other character and beginning of some other character, I found the rows using:

select lastname from wscomusr.address where locate(x'D1',lastname) > 0 with ur

LASTNAME
--------------------------------------------------------------------------------------------------------------------------------
MU▒OZ
MU▒OZ
MU▒OZ
...

Since we had proof that ▒ was not always Ñ, we chose not to try to replace anything. We were dealing with just a few thousand occurrences, so I queried the primary keys of the rows with issues, and the developer used the input file to write appropriate update statements for each one.

This was all days of work to work through, of course – don’t get the impression that any of this was easy.

I put this out there so that anyone else who runs across this or a similar issue can learn from my mistakes, and hopefully find an answer without having to email every DBA they know.

You may also like...

3 Responses

  1. Éric says:

    Great post Ember!!

    Congrats

  2. Manali says:

    Good one!

Leave a Reply

Your email address will not be published. Required fields are marked *