Thursday, April 19, 2007

Compress follies

The excellent Data Steps blog discusses the compress() function here. It's terrific, particularly with the latest expansions that let you add characters by classes instead of just individually.

But yesterday I found one of these (╔) in a delimited file I had to parse. Compress would not filter it by character class no matter what I did with the incantations, and if Ron Cody knew how I didn't spot it in his book. This persisted even after a call to tech support. So I treated it as a special case and filtered it individually, because it seemed to be the only really pathological character in the data anyway, and I really didn't intend to make a career of this.

I already had the data in SAS by this time. I tried cutting and pasting one of the little buggers from a VIEWTABLE display, but that didn't work and in fact the variable didn't even display the same way it did in PROC PRINT.

Next step - show it in hex. So I wrote something like this:

data _null_;
set data_with_goofy_char'
put field_with_goofy_char=;
put field_with_goofy_char= $hex2100.;
if _n_>20 then stop;
run;

Here we use the $hexw. format. Make sure that w is at least twice the SAS field length because the hex will display 2 chars for each char of input.

If the data hadn't already been in SAS instead of in a text file, there really aren't any universally available utilities on Windows that I'm aware of that would have let me see the actual ASCII codes in the data. Another approach would have been to look at ASCII charts. In Unix the od command would work, and on mainframes I probably would have tried to open the flat file with ISPF in hex.

(Ever find an unprintable character in source code? My comments are unprintable too. The compiler can see it even if you can't, so looking at your source in hex might be a late-stage debugging method.)

Anyway, I looked at the log and there it cowered before me. Now that I knew what to filter, I wound up using some code that looked something like this:

retain kludge 'C9'x;
field_with_goofy_char=compress(field_with_goofy_char,kludge,"og");

Actually I did a lot more than that to get the job done. A little engineering was appropriate because I'm expecting to see more data from that source. But this is enough for this post, anyway.

No comments: