Further addendum:
Now I've also found GNU parallel
,
http://www.gnu.org/software/parallel/
which works rather
like xargs
, reading commands from stdin; it permits
multiple jobs on the local and remote machines, transfer of files,
etc. etc.
Addendum: I see a very similar purpose is served by 'ppss', http://code.google.com/p/ppss/. This will almost certainly be more 'professionally' packaged than the offering below. But you're welcome to it anyway...
The scripts offered here are useful when one has several non-interactive jobs to be done on a computer. The user sets up a set of jobs, each as a single executable file (any script or binary). A 'coordinator' script is run; every time its current job finishes, it finds the next un-taken job and runs that. Multiple coordinators can be run, to allow multiple CPUs to be used. Coordinators can even be run on multiple computers, using a shared filesystem to share the jobs and data and to coordinate which next job to take.
There is no need to have multiple computers or multiple CPUs: the scripts can run a list of jobs in sequence on one CPU (but perhaps the common command `batch', part of `at', would be more suitable?), or indeed on multiple CPUs on one computer. These are just particular cases of the settings.
The methods used have been extremely helpful in many applications, and the scripts have been mixed with several other environments (perl, matlab, other shell-scripts) for applications ranging from FEM-simulations within optimisation loops, through to video-transcoding.
This system is not a (direct) help in getting a single job to run faster (multi-threading, etc.), but only in running many jobs automatically; however, a lot of the heavy simulation work that I've been involved in has the feature of a single or few programs, that need to be applied to many independent input parameters. These parameters can then be run in completely separate instances of the program, a perfect situation for the use of these scripts.
Since the scripts are slow things, and they need some communication through a shared filesystem for choosing which next job is free, this method is not suitable for running really short jobs (i.e. order of 1s). There will in this case be lots of overhead, and a higher risk of trouble from simultaneous attempts to run one job (file-locking over NFS could doubtless be done better than I've done here).
batchbase/exec/00000N
.
----------------- Flexible, simple batching of jobs. Multiple processor, multiple hosts. Nathaniel, 2007-06. Modified up to 2009-06. Done as a quick fix for some simulation work. Later handy for video transcoding jobs. Public domain. ----------------- The means of sharing information about jobs, results and progress, is a shared directory. This is expected to be an NFS mount, with little (preferably no) caching, and with the same path on each host. The means of accessing the multiple hosts for starting jobs is ssh with public key. Assuming that the user's home directory is shared over NFS by all hosts, it should only be necessary to do ssh-keygen -t rsa (then enter a few times; no password on the key) and then cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys to make ssh between the hosts be possible without a password. ---------------- A new batch of jobs has a new base directory created, in which a directory called `work' contains a numbered child directory for each job. These job directories are given six-digit numbers, sequentially from work/000001 . Another subdirectory of the batch's base directory is `logs', in which information is kept about the running of the jobs. An executable file named `run' must be in each job directory in order for the job to be attempted. This can be any type of executable file, e.g. a binary, or a script (shell, perl, python, octave, ...). Some useful possibilities are: a shell script that compiles a piece of sourcecode (from the job directory or from some other, common, file) then runs it perhaps with particular parameters for that job number; or a shell script that starts for example scilab or matlab with a list of commands as its input. The provided script ./new_batch can generate the directory layout of a new batch, along with an example run-script in each. The file `hosts.list' gives hostnames and numbers of simultaneous processes to run on each. This list is used for starting and for checking on the jobs. The scripts from ./batching_scripts/ are copied into each new batch's base directory. The copies can then be run: they identify their own parent directory as the base directory of the batch. = start_distributed_jobs logs into each host in the hosts.list, and starts local_job_background_start on each = local_job_background_start starts a specified number of backgrounded instances of local_job_coordinator on the host where it is run = local_job_coordinator go through numbers 000001, 000002, ... ; if a directory of this name doesn't exist under work/, then exit; else, if no run-file exists in that directory, try the next number; else, run the run-file, waiting for its exit = report_job_status run with argument hosts, jobs or findur (final durations) for some neatly printed information obtained from the files under logs/ = check_running_processes log into the hosts listed in hosts.list; check the current user's running processes, CPU and memory use = mail_when_finished wait till all jobs are finished, then send email to a given address -------------- So, to get started: * make sure a shared NFS dir and ssh public-key login are available if multiple hosts are to be used * modify new_batch to have suitable paths, number of jobs, content of job run-file * run ./new_batch , and cd to the new batch's directory * edit hosts.list if necessary; do any necessary changes to the run-files or to other possible (e.g. input data) files in the job directories * for multiple-host batches, run ./start_distributed_jobs ; for single-host the local_job_background_start can be used, though start_distributed_jobs is fine too * run ./report_job_status jobs to check on progress * write your own program for assembling all the output data, if it's written to each working directory separately rather than being appended to a common file --------------
Page started: 2009-06-17
Last change: 2010-08-22