Friday, April 27, 2007

Compression tricks

Well, it's been over a year since the last update of this blog, and I see I've actually had more hits from other people than I've had from me by now. And I see that this post has been sitting around waiting to be published. So hey, let's throw it out there and see what happens. Here we go...

Saving resources is always a good idea, and often there's a lot of low hanging fruit. Every extra byte you have to store and transmit means it'll be that much later before you can go home.

So compression is a good way to go because it saves you disk space, bandwidth and I/O. Of course nothing is free - the CPUs have to make up the difference. When you consider how much faster CPUs are than comm lines and disk I/O, that's a good trade to make, especially on a PC where the CPU is goofing off much of the time anyway.

First the obvious - if you don't need a variable, drop it. Especially little throwaway accumulators/counters you might be using in loop or array processing.

SAS compression is the next easiest thing to do, either via an OPTIONS statement or a data set option. But it doesn't always save space, and in fact can make tables bigger. There is more than one way to specify it, which I can't recall at the moment - RTFM if you want details.

Better yet is to squeeze the space out of the data at the design stage. For instance, we often use variables that only assume a very small range of integer values. In this case there's no point in using the default numeric variable size of 8 bytes - squeeze it down to the minimum size permitted by your system (2 on mainframes on 3 on Unix and PCs if I remember right). Or maybe make them 1 byte character variables and let SAS change their data types at run time if you're really tight for space and CPU capacity. The smaller the data is to start with the less good compression will do you, but it still pays to be frugal with space.

Character variables can be set to no longer than what you'll ever need, but this is riskier if you're not familiar with your data. Let the compression capability earn it's keep.

If you're really hurting for space you can use other compression utilities like gzip, pkzip or winzip. They'll shrink the file even tighter than SAS can, but then of course you have to have the space to make the compressed files. So this works best when you have a lot of smaller files rather than one huge file, so you can use the same workspace over and over again.

If you're working with flat files, compress them and use the PIPE access method. I've used it on Unix - I'm not sure if it works on others, but the companion manual for your operating system should tell you if it will work there. Anyway, the idea is that you write a process that writes to standard output, and if you work with Unix and don't know what that is it's time you learned. You'd best RTFM for details - working from memory you'd write something like this:

filename bigflat pipe "gunzip big_flat_file";
data out(compress=yes);
file bigflat missover lrecl=whatever........;
input ..........;
run;

Likewise there may be a use for the FTP or other newer access methods. The FTP access method lets you transfer a file and read it in one step without forcing you to have two copies of a big flat file lying around. Yep, RTFM again.

Wanna get kinkier yet? If you're also writing code that writes the flat file in something like COBOL, CorWorks, Pro*C or the like, or you have a friendly geek to do it for you, try something like this:

Design your flat file generating process to write to the named pipe.
Write your SAS process to read from the named pipe.
Create the named pipe (mkfifo if I recall correctly).
Start the SAS process.
Start the FFG process.
Run it all from a shell script.

It's very complicated and a very special situation. But if you do this then the flat file you're generating never ever lands on the disk, which saves space, I/O and runtime, and believe it or not I've actually done this.

Thursday, April 19, 2007

Less is more

Yep, it's true, at least wrt I/O and performance - less I/O yields more performance. If you're only working with a few hundred or thousand records, I/O and runtime are trivial unless you're doing multiple levels of Cartesian joins (more later). If you need more, here's some schematic code (ie untested) that can help.

data presort/view=presort;
set that;
new variable=whatever;
.....
run;

proc sort data=presort out=sorted noequals;
by key;
run;

What happens? The data in "presort" is never written to the disk (at least not as a permanent file). Then when you use the data, you don't read it *again*. That's two trips to the disk you saved (writing it out and reading it in), and with big tables it adds up big time.

And of course that's one less big object sitting around. I remember a Unix filesystem of terabyte scale that sometimes recovered back ~20+% or more of its capacity on Friday afternoons. Why? People shut their PCs down and finally released huge objects that they've had stashed in their work directories courtesy of Enterprise Guide et al. Of course it's no trick to delete them as you go (more later), but better yet if they never even existed.

Data can't get stale that way either. If it's in a view, you know it's fresh picked.

Of course then you don't have a copy of your data either. If you pulled a tiny subset of the input file and you're going to hit the results again and again, this makes no sense because you aren't saving I/O anymore. Fine - use this for intermediate files like in the example. I do it a lot if only because I'm always having to do some last minute finagling before running a sort - I'll never need such sort input datasets for anything else, and this way they're self-disposing.

Now let's get kinkier:

data detail(drop=great_big_honking_string_label)
ids(keep=key great_big_honking_string_label compress=yes)
/view=detail;

set denormalized;
by key;
...
if first.key then output key;
output detail;
run;

proc summary data=detail;
class key;
output out=summarized sum=;
run;

data presort/view=presort;
merge summarized ids;
by key;
run;

proc sort data=presort out=lib.keeper noequals;
by great_big_honking_string_label;
run;

So what's the big deal? You're running the proc on dataset "detail" at the same time you're writing "ids". Yeah, you do have to write "ids", but you can compress it and you can design things so it's naturally smaller too(more later). The total runtime in my experience is shorter if done this way than if "detail" were also written to the disk and then the procs were run.

Maybe you never run into scenarios where this would be useful, but when your datasets get bigger you'll start looking for ways to get more done in a given pass through the data to minimize your I/O.

Compress follies

The excellent Data Steps blog discusses the compress() function here. It's terrific, particularly with the latest expansions that let you add characters by classes instead of just individually.

But yesterday I found one of these (╔) in a delimited file I had to parse. Compress would not filter it by character class no matter what I did with the incantations, and if Ron Cody knew how I didn't spot it in his book. This persisted even after a call to tech support. So I treated it as a special case and filtered it individually, because it seemed to be the only really pathological character in the data anyway, and I really didn't intend to make a career of this.

I already had the data in SAS by this time. I tried cutting and pasting one of the little buggers from a VIEWTABLE display, but that didn't work and in fact the variable didn't even display the same way it did in PROC PRINT.

Next step - show it in hex. So I wrote something like this:

data _null_;
set data_with_goofy_char'
put field_with_goofy_char=;
put field_with_goofy_char= $hex2100.;
if _n_>20 then stop;
run;

Here we use the $hexw. format. Make sure that w is at least twice the SAS field length because the hex will display 2 chars for each char of input.

If the data hadn't already been in SAS instead of in a text file, there really aren't any universally available utilities on Windows that I'm aware of that would have let me see the actual ASCII codes in the data. Another approach would have been to look at ASCII charts. In Unix the od command would work, and on mainframes I probably would have tried to open the flat file with ISPF in hex.

(Ever find an unprintable character in source code? My comments are unprintable too. The compiler can see it even if you can't, so looking at your source in hex might be a late-stage debugging method.)

Anyway, I looked at the log and there it cowered before me. Now that I knew what to filter, I wound up using some code that looked something like this:

retain kludge 'C9'x;
field_with_goofy_char=compress(field_with_goofy_char,kludge,"og");

Actually I did a lot more than that to get the job done. A little engineering was appropriate because I'm expecting to see more data from that source. But this is enough for this post, anyway.

Wednesday, April 18, 2007

Explicit...

I'm a raving SAS partisan. Yeah, it's my current meal ticket, but it's not the only thing on my resume - it's just that good. But that's not to say that there isn't room for improvement.

For instance, SAS doesn't require declaration of variables. That's part of why it's very nimble - it's incredible what you can do with very little code. If you only write tiny code blocks and run them one at a time, it's terrific. And if you're the kind of person who writes a data step to rename a variable you might never regret this.

Today's hardware can be lightning fast, CPUs and disk are multiplying and SAS procs have been refined such that running through a few million records isn't the kind of thing you might need to coordinate any more. Run time for processing a few billion is on the order of hours instead of days or being simply impossible.

Yet capacity is still finite, and I/O takes forever relatively speaking. I remember the old days, and when I read gigabytes of data I want to wring it dry before I let it go. That means writing more code and thus having more opportunities for error.

Many of those opportunities involve typos. Mistyping a variable name in data step code is a new variable, a syntax error or a crossed wire. If it's not a syntax error SAS can't help you.

If you're paying attention to your SAS log, those typo variables will generally result in a note like "variable TYPO is uninitialized". (If you can't explain notes like that I might have to come by and bop you upside the head.)

Other languages like perl and BASIC derivatives like VB let you do without declarations too, so the compiler can't help keep you out of trouble. But perl and VB* also permit you to force declarations too (use strict, Option Explicit).

It seems that providing a SAS option like strict or explicit could do a lot of good without a tremendous amount of work. It's not as if you'd have to rewrite every PROC - my SWAG is that the impact would be confined to the compiler, and it's not as if nobody ever taught a compiler to demand declarations.

Enough for now. Don't go mistyping any variable names, OK?

Tuesday, April 17, 2007

Grand opening

Welcome to what I hope will be a regularly updated blog focused on SAS.

Among other things in a bizarre career, I've used SAS on mainframes, various Unixen and PCs since version 6. Part of that time was spent as a contractor in about a dozen locations in several industries. A common theme was very large scale data processing, millions of records at a time, coming from wherever it might be found. That usually has meant DB2 or Oracle, but SAS can use it wherever it comes from.

More often than not I've been the local SAS geek, applying it to whatever might come up for greenhorns, statisticians, business analysts and others. But there's lots of expertise out there, and I hope some of it starts showing up here. I don't know everything (yet...), so let's share ideas, code and war stories!

And let's *not* share politics. I'm not innocent, but I have another blog for that. Heaven forbid that even SAS should become a partisan issue.

Comments are welcome, but my word is law and I will moderate and delete as ruthlessly as necessary to satisfy my whims. Hint - if it's about personalities, it's likely to be out of bounds.