Thursday, April 19, 2007

Less is more

Yep, it's true, at least wrt I/O and performance - less I/O yields more performance. If you're only working with a few hundred or thousand records, I/O and runtime are trivial unless you're doing multiple levels of Cartesian joins (more later). If you need more, here's some schematic code (ie untested) that can help.

data presort/view=presort;
set that;
new variable=whatever;
.....
run;

proc sort data=presort out=sorted noequals;
by key;
run;

What happens? The data in "presort" is never written to the disk (at least not as a permanent file). Then when you use the data, you don't read it *again*. That's two trips to the disk you saved (writing it out and reading it in), and with big tables it adds up big time.

And of course that's one less big object sitting around. I remember a Unix filesystem of terabyte scale that sometimes recovered back ~20+% or more of its capacity on Friday afternoons. Why? People shut their PCs down and finally released huge objects that they've had stashed in their work directories courtesy of Enterprise Guide et al. Of course it's no trick to delete them as you go (more later), but better yet if they never even existed.

Data can't get stale that way either. If it's in a view, you know it's fresh picked.

Of course then you don't have a copy of your data either. If you pulled a tiny subset of the input file and you're going to hit the results again and again, this makes no sense because you aren't saving I/O anymore. Fine - use this for intermediate files like in the example. I do it a lot if only because I'm always having to do some last minute finagling before running a sort - I'll never need such sort input datasets for anything else, and this way they're self-disposing.

Now let's get kinkier:

data detail(drop=great_big_honking_string_label)
ids(keep=key great_big_honking_string_label compress=yes)
/view=detail;

set denormalized;
by key;
...
if first.key then output key;
output detail;
run;

proc summary data=detail;
class key;
output out=summarized sum=;
run;

data presort/view=presort;
merge summarized ids;
by key;
run;

proc sort data=presort out=lib.keeper noequals;
by great_big_honking_string_label;
run;

So what's the big deal? You're running the proc on dataset "detail" at the same time you're writing "ids". Yeah, you do have to write "ids", but you can compress it and you can design things so it's naturally smaller too(more later). The total runtime in my experience is shorter if done this way than if "detail" were also written to the disk and then the procs were run.

Maybe you never run into scenarios where this would be useful, but when your datasets get bigger you'll start looking for ways to get more done in a given pass through the data to minimize your I/O.

No comments: