Friday, April 27, 2007

Compression tricks

Well, it's been over a year since the last update of this blog, and I see I've actually had more hits from other people than I've had from me by now. And I see that this post has been sitting around waiting to be published. So hey, let's throw it out there and see what happens. Here we go...

Saving resources is always a good idea, and often there's a lot of low hanging fruit. Every extra byte you have to store and transmit means it'll be that much later before you can go home.

So compression is a good way to go because it saves you disk space, bandwidth and I/O. Of course nothing is free - the CPUs have to make up the difference. When you consider how much faster CPUs are than comm lines and disk I/O, that's a good trade to make, especially on a PC where the CPU is goofing off much of the time anyway.

First the obvious - if you don't need a variable, drop it. Especially little throwaway accumulators/counters you might be using in loop or array processing.

SAS compression is the next easiest thing to do, either via an OPTIONS statement or a data set option. But it doesn't always save space, and in fact can make tables bigger. There is more than one way to specify it, which I can't recall at the moment - RTFM if you want details.

Better yet is to squeeze the space out of the data at the design stage. For instance, we often use variables that only assume a very small range of integer values. In this case there's no point in using the default numeric variable size of 8 bytes - squeeze it down to the minimum size permitted by your system (2 on mainframes on 3 on Unix and PCs if I remember right). Or maybe make them 1 byte character variables and let SAS change their data types at run time if you're really tight for space and CPU capacity. The smaller the data is to start with the less good compression will do you, but it still pays to be frugal with space.

Character variables can be set to no longer than what you'll ever need, but this is riskier if you're not familiar with your data. Let the compression capability earn it's keep.

If you're really hurting for space you can use other compression utilities like gzip, pkzip or winzip. They'll shrink the file even tighter than SAS can, but then of course you have to have the space to make the compressed files. So this works best when you have a lot of smaller files rather than one huge file, so you can use the same workspace over and over again.

If you're working with flat files, compress them and use the PIPE access method. I've used it on Unix - I'm not sure if it works on others, but the companion manual for your operating system should tell you if it will work there. Anyway, the idea is that you write a process that writes to standard output, and if you work with Unix and don't know what that is it's time you learned. You'd best RTFM for details - working from memory you'd write something like this:

filename bigflat pipe "gunzip big_flat_file";
data out(compress=yes);
file bigflat missover lrecl=whatever........;
input ..........;
run;

Likewise there may be a use for the FTP or other newer access methods. The FTP access method lets you transfer a file and read it in one step without forcing you to have two copies of a big flat file lying around. Yep, RTFM again.

Wanna get kinkier yet? If you're also writing code that writes the flat file in something like COBOL, CorWorks, Pro*C or the like, or you have a friendly geek to do it for you, try something like this:

Design your flat file generating process to write to the named pipe.
Write your SAS process to read from the named pipe.
Create the named pipe (mkfifo if I recall correctly).
Start the SAS process.
Start the FFG process.
Run it all from a shell script.

It's very complicated and a very special situation. But if you do this then the flat file you're generating never ever lands on the disk, which saves space, I/O and runtime, and believe it or not I've actually done this.

No comments: