Batching jobs: multiple CPUs, multiple hosts, long list...

Further addendum: Now I've also found GNU parallel, http://www.gnu.org/software/parallel/ which works rather like xargs, reading commands from stdin; it permits multiple jobs on the local and remote machines, transfer of files, etc. etc.

Addendum: I see a very similar purpose is served by 'ppss', http://code.google.com/p/ppss/. This will almost certainly be more 'professionally' packaged than the offering below. But you're welcome to it anyway...

Purpose

The scripts offered here are useful when one has several non-interactive jobs to be done on a computer. The user sets up a set of jobs, each as a single executable file (any script or binary). A 'coordinator' script is run; every time its current job finishes, it finds the next un-taken job and runs that. Multiple coordinators can be run, to allow multiple CPUs to be used. Coordinators can even be run on multiple computers, using a shared filesystem to share the jobs and data and to coordinate which next job to take.

There is no need to have multiple computers or multiple CPUs: the scripts can run a list of jobs in sequence on one CPU (but perhaps the common command `batch', part of `at', would be more suitable?), or indeed on multiple CPUs on one computer. These are just particular cases of the settings.

The methods used have been extremely helpful in many applications, and the scripts have been mixed with several other environments (perl, matlab, other shell-scripts) for applications ranging from FEM-simulations within optimisation loops, through to video-transcoding.

Limitations

This system is not a (direct) help in getting a single job to run faster (multi-threading, etc.), but only in running many jobs automatically; however, a lot of the heavy simulation work that I've been involved in has the feature of a single or few programs, that need to be applied to many independent input parameters. These parameters can then be run in completely separate instances of the program, a perfect situation for the use of these scripts.

Since the scripts are slow things, and they need some communication through a shared filesystem for choosing which next job is free, this method is not suitable for running really short jobs (i.e. order of 1s). There will in this case be lots of overhead, and a higher risk of trouble from simultaneous attempts to run one job (file-locking over NFS could doubtless be done better than I've done here).

Files

The loose files are in batch_jobs/. Download instead a a tarball of the whole lot if actually wanting to trying running this.

Variations

Note that all the setting up of a batch base-directory, scripts, etc., and collection of results, can be done from many other environments, e.g. directly from octave or matlab, which can then run the shell scripts that start and control the batch. An example can be seen in grsim_simsweeppar.m, which was done for an older version of the batching scripts, in which each job's executable was in batchbase/exec/00000N.

The README


-----------------

Flexible, simple batching of jobs.
Multiple processor, multiple hosts.

Nathaniel, 2007-06.  Modified up to 2009-06.
Done as a quick fix for some simulation work.
Later handy for video transcoding jobs.
Public domain.

-----------------

The means of sharing information about jobs, results and progress, is a shared
directory.  This is expected to be an NFS mount, with little (preferably no)
caching, and with the same path on each host.

The means of accessing the multiple hosts for starting jobs is ssh with
public key.  Assuming that the user's home directory is shared over NFS by
all hosts, it should only be necessary to do
        ssh-keygen -t rsa
(then enter a few times; no password on the key) and then
        cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
to make ssh between the hosts be possible without a password.

----------------

A new batch of jobs has a new base directory created, in which a directory
called `work' contains a numbered child directory for each job.  These job
directories are given six-digit numbers, sequentially from work/000001 .
Another subdirectory of the batch's base directory is `logs', in which
information is kept about the running of the jobs.

An executable file named `run' must be in each job directory in order for
the job to be attempted.  This can be any type of executable file, e.g.
a binary, or a script (shell, perl, python, octave, ...).  Some useful
possibilities are:  a shell script that compiles a piece of sourcecode
(from the job directory or from some other, common, file) then runs it
perhaps with particular parameters for that job number; or a shell script
that starts for example scilab or matlab with a list of commands as its
input.

The provided script ./new_batch can generate the directory layout of a new
batch, along with an example run-script in each.


The file `hosts.list' gives hostnames and numbers of simultaneous
processes to run on each.  This list is used for starting and for checking
on the jobs.

The scripts from  ./batching_scripts/ are copied into each new batch's
base directory.  The copies can then be run: they identify their own
parent directory as the base directory of the batch.

 = start_distributed_jobs
 logs into each host in the hosts.list, and starts
 local_job_background_start on each

 = local_job_background_start
 starts a specified number of backgrounded instances of
 local_job_coordinator on the host where it is run

 = local_job_coordinator
 go through numbers 000001, 000002, ... ; if a directory of this
 name doesn't exist under work/, then exit; else, if no run-file
 exists in that directory, try the next number; else, run the
 run-file, waiting for its exit

 = report_job_status
 run with argument hosts, jobs or findur (final durations) for some neatly
 printed information obtained from the files under logs/

 = check_running_processes
 log into the hosts listed in hosts.list; check the current user's
 running processes, CPU and memory use

 = mail_when_finished
 wait till all jobs are finished, then send email to a given address

--------------

So, to get started:

* make sure a shared NFS dir and ssh public-key login are available
if multiple hosts are to be used

* modify new_batch to have suitable paths, number of jobs, content of job run-file

* run ./new_batch , and cd to the new batch's directory

* edit hosts.list if necessary; do any necessary changes to the run-files
or to other possible (e.g. input data) files in the job directories

* for multiple-host batches, run ./start_distributed_jobs ; for single-host
the local_job_background_start can be used, though start_distributed_jobs
is fine too

* run ./report_job_status jobs
to check on progress

* write your own program for assembling all the output data, if it's written
to each working directory separately rather than being appended to a common
file

--------------

Page started: 2009-06-17
Last change: 2010-08-22