SAS Hash Table Subsetting - In Memory Splitting

Subsetting long and wide data (30Gb+) to disk, can be quicker using a hash table compared to a Select ... When.

Code

This code came about after I needed to do some quicker subsetting. It’s a macro based on examples 8-10 in this paper.

Obviously, you should understand this code before running it. It’s also provided as an example of something that worked for me, without warranty or support (along with everything else on this blog).

%Macro hashSubset(in=, out=, by=, keep=);
%let colon = :;
%If %index(&out., .) %then %do;
  %Put ERROR&colon. This macro should only output to the WORK library (in memory), it does not work with encrypted locations on disk.;
%Abort cancel;
%End;

%let keepVars = ;
%do iVar = 1 %to %sysFunc(countW(&keep.));
  %let keepVars = &keepVars., "%scan(&keep., &iVar.)";
%end;
%global lastOut;

data _null_ ;
  dcl hash hoh (ordered: 'a') ;
  dcl hiter hih ('hoh' ) ;
  hoh.definekey ("&by." ) ;
  hoh.definedata ("&by.", 'hh' ) ;
  hoh.definedone () ;

  dcl hash hh () ;
  do _n_ = 1 by 1 until ( eof ) ;
    set &in. end = eof ;
    if hoh.find () ne 0 then do ;
      hh = _new_ hash (ordered: 'a') ;
      hh.definekey ("&by.", '_n_') ;
      hh.definedata ("&by." &keepVars. );
      hh.definedone () ;
      hoh.replace () ;
    end ;
    hh.replace() ;
  end ;

  iOut = 0;
  do rc = hih.next () by 0 while ( rc = 0 ) ;
    iOut = iOut + 1;
    hh.output (dataset: "work.&out." || strip(put(iOut, $16.))) ;
    rc = hih.next() ;
  end ;
  call symputx("lastOut", iOut);
stop;
run;

%put NOTE&Colon. &lastOut. subsets created (access this number using lastOut macro var).;
%MEnd hashSubset;
%hashSubset(in=[dataset], out=[output dataset], by=[splitting variable], keep=[space separated list of variables to keep]);