Recently, I was tasked with reducing the processing time of a long-running datastep.
The datastep read in a large dataset and split it into new ones according to the value of a variable; taking over 5 hours to complete the subetting. Using a hash table approach reduced processing time to 5 minutes, mainly because of efficiencies in writing outputs. This was a special case for my SAS environment, but had a large impact thanks to the 60x increase in speed.
Input and Output
Input data was a single SAS dataset that was both long and reasonably wide; 120 million rows and 120Gb total, meaning around 1kb per row. This is easily the largest single dataset that gets used in my SAS environment and is aptly named huge.sas7bdat. Importantly, the input dataset was unsorted and any solution needed to avoid explicitly sorting it; the overhead of sorting 120Gb by the splitting variable was unworkable. The dataset could be read from disk, or produced in-memory; the end user was flexible on this point.
Output was over 1000 SAS datasets stored on disk, split according to the value of a variable in the input dataset.
With the available information, my suspicion was that the limiting factor in this datastep is how quickly SAS can write the output datasets. This makes sense because:
- The CPU is only making a simple comparison, not doing lots of computation;
- The dataset can fit easily into the memory of our SAS server;
- The variable determining which output set a row should go in was unsorted. This can result in a lot of file open/close operations as SAS sprays rows across the 1000+ output files.
Filing Cabinet Problem
I’d like to expand on the last point using a thought experiment, to help demonstrate why the new approach improved performance. This is a pretty crude simplification of what’s happening in different subsetting techniques, but it works pretty well for our situation.
Imagine you have an unsorted pile of documents and a filing cabinet in which to store them. Each drawer will be used to store the documents relating to an individual client. Only one drawer of the filing cabinet may be open at any time. However, you can insert more than one document into a drawer; as long as documents from the same client appear sequentially within the pile. When filing the documents you spend time doing one of three actions:
- Opening a drawer;
- Inserting document(s);
- Closing a drawer.
For any unsorted pile of documents, the time spent inserting them into drawers is related to the size of the pile. The action of inserting 100 files into a drawer will take roughly 100x the time of inserting 1 file, this is much is obvious. However, time spent opening/closing drawers isn’t related to the size of the pile but to how random its order is. If you only had two drawers marked A and B, but a pile of documents in A, B, A, B… order, then you’d spend an awful lot of time opening and closing those drawers!
If we’re going to minimise the time spent opening/closing drawers, we need a way to reliably group our documents; then we can lump them into their correct drawer in one hit. We could explicitly sort the pile first, but that will involve reading each document and then reinserting it back into the pile according to the client. With 120 million documents (or, rows in a dataset) and over 1000 clients (groups for output files), this would be a large task. When it came to insertion, we would also then have to read every document in the pile a second time, as we wouldn’t easily know where the boundary between two client’s documents was.
Instead, we could nip on down to our local office furniture store and pick up some paper trays; one paper tray for each draw in our filing cabinet. Paper trays give really quick access, you can fit a lot into them, but you don’t have to open or close them to add documents. Now, rather than filing our documents directly into the cabinet, we group them first in paper trays. When the pile is exhausted, simply empty the contents of each paper tray into the correct draw.
This approach means we only read each document once to check which client it belongs to. It also means the time spent opening and closing draws is now related to the amount of draws (1 open/close for each draw), rather than how random our pile is. We haven’t explicitly sorted the data either, because the groupings are separate piles rather than one large one; although we could join them back together to finish the sort. If our thought experiment were a very dull short film, the credits would look something like this…
Pile of Documents Input Dataset
Paper Trays Memory of the Computer
Filing Cabinet Disk
You CPU running SAS
Enough About Filing Cabinets!
Yes, enough about filing cabinets.
With the above out of the way, let’s look at two pieces of code. This piece of code is a simplified version of the original 5 hour datastep. In reality, a macro loop produced a huge amount of “when” statements, but this is functionally comparable.
Data Out.A Out.B Out.C; Set myData; Select (var); When "A" output Out.A; When "B" output Out.B; When "C" output Out.C; End; Run;
myData is read in, then split into various output datasets according to the value of
myVar. Ignoring efficiency tricks at the hardware level (buffering etc.), every time an
output statement is hit, the current row is written into another dataset on disk. This is like the first way of filing documents, where the number of file (drawer) open/close operations for each output dataset is dependent on how random the data is.
This next extract is pseudo-code, because SAS hash table code is pretty darn ugly. I’ll put a macro I experimented with in another post.
Data _NULL_; *Initialise hash table to contain the others; *Loop over data; *New hash table if new value for myVar, add to container; *Otherwise add data to an existing hash table; *End; *Loop over the hash tables; *Output each hash table; *End; Run;
In this datastep, the data is read from the input once; there’s only one
set statement. Every row of the data is read and grouped into a hash table (paper tray) according to the value of a variable. When this process is finished, each individual hash table is output to disk as a SAS dataset; much like the second approach to filing documents.
After all that about filing cabinets earlier, there’s not too much more to say. Apart from to reiterate that this approach reduced a 5 hour datastep to 5 minutes. In reproducing this approach for an internal seminar, it was tried on a 30Gb dataset with rows about 1.5kb wide; halving subsetting time from 40 to 20 minutes. Finally it’s worth noting that subsetting using hash tables has extra efficiency built in, thanks to how they work internally. Subsetting into SAS in-memory datasets first, then copying them to disk was no quicker.
If you want a really comprehensive explanation of how to use hash tables in SAS, I’d recommend this paper.