Dynamic Parallel Processing in SAS (RSubmit)

Code with lots of datasteps doesn’t tend to take advantage of multi-core environments. If you’ve got the correct hardware, data and code to manage it; there is potential for large performance increases.

This post follows on from the fundamentals of parallel processing, which will help users understand why this approach can improve performance.

Datasteps, Procs and Multi-Threading

The vast majority of code run in my environment, does not take advantage of our processor rich servers. Code written by my current users tends to be datastep heavy, with some proc’s involved for SQL statements, sorting and summarising.

Whilst there are multi-threading opportunities in the proc’s, the datastep code remains stubbornly single-threaded. This comes from how the SAS Program Data Vector works, operating on one row of a dataset at a time, from the beginning to the end of the file. Despite the advantages in specialised and highly optimised procs, on average each running SAS process only utilises a single core.

Given that we already store our permanent data on quick disk, have in-memory work libraries and super fast connections to our file servers, the limiting factor in most of our long-running code is a single core’s speed. The logical step to increase performance is therefore to split the work done across many more cores; which is achieved by utilising more SAS sessions.

SignOn and Rsubmit

If you don’t have a SAS grid, or other explicit parallel computing solution attached to your SAS install, then RSubmit is still an option to run code in parallel. There’s lots in the SAS documentation about RSubmit that you should get familiar with, but I’ll go over some of the headlines here.

The SignOn and RSubmit commands are used in SAS to create a new SAS process and then send code to it. Assuming you can split your data appropriately, you can therefore have more than one SAS session processing it at the same time.

The main technical difficulty with parallel processing in this way, is ensuring that you specify the correct sascmd= option to the SignOn command. If you’re lucky, then a plain SignOn [SessionName] will create a new remote session for your current one to interact with. However, you may need to check with your SAS administrator if this doesn’t work.

An error stating, A communication subsystem partner link setup request failure has occurred, indicates you probably need to manually specify startup options using sascmd=. This shouldn’t be too much of a barrier, you just need to specify where the SAS.exe file lives and some basic options. Occasionally, I have problems starting remote SAS sessions and find that the following works as a minimum (though I add more performance specific options)…

SignOn [valid sas name] sascmd="[path to SAS.exe] -Work ""[path to work lib]""";

Other important things to note about remote sessions include:

their inability to share work libraries (which is a good idea);
the need to manually push across macro variables;
the need to compile macro code within them (this can’t be pushed across);
you should specify a log= option, so the log doesn’t come back to the spawning session and interleave (which gets ugly quickly!);
remote sessions can be told to exit immediately after execution has finished (connectPersist=no), or manually disposed of using a SignOff command;
your original SAS sessions can wait for _all_, any or specific remote sessions to finish processing using a WaitFor command.

Dynamic Parallel Processing

The approach I’ll present here has been termed dynamic because it adapts to variable sized jobs being processed. A previous version of this code assigned jobs to sessions during initialisation. This meant that if one session ended up with lots more work than the others (think, processing XML files of different sizes), then performance became limited by the speed of that single session.

In this approach, crafty use of a WaitFor _any_ and the cMacVar macro variable means that sessions only ever process one job each, then die and a new session starts for the next job. This ensures that the number of sessions running is always as close to the upper limit as possible.

For ease of use, the code below is split into three macros. One to initialise a job queue, one to add information to the job queue and one to run the jobs across remote sessions. The breakthrough which enabled this to be dynamic in nature came from a colleague of mine (Jade Taffs), who had the good sense to turn my previous approach to the problem by 90 degrees and use the WaitFor _any_.

Initialise Job Queue

This macro sets up a dataset, to contain jobs that will be processed by the remote sessions. A space-separated list of variables is passed in, the values of these will be set for each job when the job is added to the queue.

%Macro initialiseJobQueue(vars=);
  Proc DataSets Lib=Work noDetails noList;
    Delete _jobQueue;
  Quit;

  Data _jobQueue;
    Format code
    %Do iVar = 1 %to %sysFunc(countW(&vars.));
      %scan(&vars., &iVar.)
    %End;
    $256.;

    *need to initialise to avoid error...;
    code = "";
    %Do iVar = 1 %to %sysFunc(countW(&vars.));
      %scan(&vars., &iVar.) = "";
    %End;

    *...but delete the blank row;
    Delete;
  Run;
%MEnd initialiseJobQueue;

### Add To Job Queue

This macro specifies the code file to run for the job and adds the values for the variables defined during %initialiseJobQueue. You’ll need to make sure the values provided are in the same order that the variables were defined.

%Macro addToJobQueue(code=, values=);
Let valuesList = "&code.", ;
%Do iValue = 1 %to %sysFunc(countW(&values.));
  %If %eval(&iValue. < %sysFunc(countW(&values.))) %then %do;
    %Let valuesList = &valuesList. "%scan(&values., &iValue.)",;
  %End;
  %Else %do;
    %Let valuesList = &valuesList. "%scan(&values., &iValue.)";
  %End;
%End;

%Put &valuesList.;
Proc SQL;
  Insert into _jobQueue
  Values (&valuesList);
Quit;
%MEnd addToJobQueue;

### Run jobs

This is the macro that actually kicks off the remote sessions. First, the jobs are numbered up, making management of variables easier. A loop from 1 to the number of jobs is entered, with this index being used to pick up the variable values for each job. A job’s macro variable names/values are created in the spawning SAS session using Proc SQL. Once the session has started using SignOn, then the values are pushed across to the remote session with %sysLPut _all_.

The number of sessions is managed by counting the &cMacVar. variables that equal 2; a value of 2 means the related session is running. As soon as the number of remote sessions reaches the upper limit, the spawning session will wait for one of them to finish. When one of the remote sessions finishes, the loop continues and launches another session. When the number of sessions launched equals the number of jobs, then the loop stops launching more. The final WaitFor _all_ ensures the macro waits for the final sessions to finish, before execution continues within the spawning session.

%Macro runJobs(sessions=, force=N);
%Let colon=:;
%If %eval(&sessions. > 12) and &force. eq N %then %do;
  %Put WARNING&colon. runJobs is limited to 12 sessions by default. If you know what you are doing, use Force=Y to ignore.;
%Abort Cancel;
%End;

*number up the jobs;
Data _jobQueue;
  Set _jobQueue;
  job = _N_;
Run;

*find out how many jobs we have;
Proc SQL noPrint;
  Select count(job)
  Into :nJobs
  From _jobQueue
  ;
Quit;

*find out what variables we have to pass over;
Proc Contents data=_jobQueue noDetails noPrint out=_jobVars;
Run;

Proc SQL noPrint;
  *job vars;
  Select name
  Into :jobVars separated by "|"
  From _jobVars
  Where lower(name) ne "code"
  ;

  *number of job vars;
  Select count(name)
  Into :nJobVars
  From _jobVars
  Where lower(name) ne "code"
  ;
Quit;

* loop that launches remote sessions;
%Do iJob = 1 %to &nJobs.;
    Proc SQL noPrint;

      Select code
      Into :code
      From _jobQueue
      Where job = &iJob.
      ;

      *create jobVar values locally so they can be pushed across;
      %Do iJobVar = 1 %to &nJobVars.;
      %Let jobVar = %scan(&jobVars., &iJobVar., |);

      Select &jobVar.
      Into :&jobVar.
      From _jobQueue
      Where job = &iJob.
      ;
  %End;
    Quit;

  SignOn R&iJob. Wait=Yes;

  %sysLPut _ALL_;

  RSubmit Log="[\\path to some log location]\R&iJob..log" New Wait=No ConnectPersist=No cMacVar=R&iJob.;
    Option noSyntaxCheck;
    %Include "&code.";
  EndRSubmit;

  *use cMacVar to tell how many are still running;
  %Let running = 0;
  %Do iCMacVar = 1 %to &iJob.;
    %If &&R&iCMacVar. = 2 %then %do;
      %Let running = %eval(&running. + 1);
    %End;
  %End;

  %Put running=&running.;

  %If %eval(&running. >= &sessions.) %then %do;
    WaitFor _ANY_;
  %End;
%End;
WaitFor _ALL_;
%MEnd runJobs;

Phew! That's a lot of code! As usual, it's here as a reference of something that worked for me, you're welcome to use it but do so at your own risk. If you'd like to test the code, construct a datastep that uses the `sleep()` function, with the length of the sleep determined by a macro variable. Then, create and run some jobs using different values for the `&sleepLength.` variable, to simulate jobs with differing work loads.