Faster calculations allow more work to be done. Waiting long for calculations to finish is, anyway, psychologically disturbing(!): it puts one off trying new things out if the input is almost forgotten when the output arrives. Unless the accuracy of results is compromised or the user time taken on improving a program's speed exceeds the consequent saving in time, some work on optimisation is worthwhile.
Sometimes, a better understanding of the nature of the calculation or of the computer would allow programs to be used or written more efficiently. Sometimes, the user needs to be aware of the existence of profilers and of programs that report system use so as to show where bottlenecks are occurring.
Repeatable results are reasonably expected from computers, but certain optimisations of speed can cause variation of results depending on which part of the processor is used for calculations. This is worth reading about when trying to use, or especially to optimise, calculations that deal with extreme values or that tend to accumulate errors.
Some pointers and links are given here, about ways to speed up calculations, and a little about floating point arithmetic.
In other words, improving the speed of a single run of a program on a single processor: even if you're going to use multiple threads and parallel jobs, etc., getting the basics efficient will be a help!
If a scripting language (e.g. Python or Matlab) is being used, the calculation will likely be much faster in a compiled language such as C or Fortran. Of course, if the interpreted program is actually spending most of its time running a few big compiled modules, e.g. some sort of solver, there won't be much difference. Changing the language is clearly not usually going to be a practical option. Choose wisely at the outset, between times expected for programming and for running: if it's a small program that takes a lot of time and will be run thousands of times during your work, then a harder language is perhaps worthwhile. If using Matlab, note that using the `compiler' (mcc) to make a stand-alone executable will not at all definitely improve speed: depending on which functions you're using, this standalone thing may be an enormous mess of java run-time and java functions linked with a little C, ending up (in my ODE case) being slower than running a script (and much slower than using a free solver in Fortran).
`Profilers' are readily available for most languages: they report on the times taken for various parts of the running of a program. A common maxim is that 90% of the running time is generally taken in 10% of the program, but that even experts tend not to predict which part at all reliably; use of a profiler allows efficient use of programming time on the most critical sections. The GNU profiler for GCC is `gprof', and is described briefly in the GCC Introduction as well as in its manual page (`man gprof'). Other environments' profilers may be found by searching their menus, documentation or the web. A little more on optimisation is good and short reading. Matlab has a `profile' command; in some ODE/PDE solution cases, I've managed to get the time down to less than a half by generating a new m-file for each case with the parameters already written as numbers rather than being passed as local or global variables.
This is not very easy, since hardware of a modern computer is so complex, with several different places where calculations may be performed, pipelining of instructions, pre-fetching of data from memory, several levels (speed/size) of cache, etc. Consider, at least, checking that the program's memory isn't being swapped out to disk during the calculation (too little physical memory), which could slow things down a lot. Consider that intensive work with small amounts of data may be much faster if the data-block being worked on fits within the CPU's few-megabyte L2 cache rather than being in main memory.
Sometimes, two different computers (e.g. AMD and Intel CPUs) may be such that one is as much as twice the speed of the other for one sort of calculation, and then the speeds are the other way round for another sort (real example: LU-factorization [Intel wins] versus ODE solution [AMD wins even more strongly] with a particular case of LU and ODE solver). We have several computers available, so this could be worth trying.
If using a compiled language, the options given to the compiler can make a big difference (doubling) in the speed of the executable output. The optimisation switch `-O2' (optimisation level 2) to GCC allows a lot of changes to the structure of the code, such as making short functions be `inline' (part of the main program) and unrolling short loops. Permitting the compiler to use the SSE extension for some floating point operations may increase speed. Other options can set the exact CPU family to allow best choices of optimisations: the settings found in the CFLAGS variable in the file /etc/make.conf will give (for each system) the extra gcc options used in compiling the operating system; these are considered `stable' but were aimed only at specifying the processor, not the floating point behaviour. Allowing the compiled program to use the machine's native precision rather than some arbitrary precision, may make large differences in the speed. The gcc optimisation parent-page (and the following pages) of the GCC Introduction explain more. Bear in mind possible alterations in the numerical behaviour in some special cases, when doing these sorts of optimisations without specifying for example the `-mieee-fp' switch to the compiler (see more below on floating point).
`Shared memory' systems have multiple processors (CPUs) sharing the same memory which can then be used by different `threads' of execution running on the CPUs. This can't be used between different computers of the simple sort that we have. `Message passing' interface is a system for coordinating threads on different CPUs and potentially on different systems. It can be applied more generally, via shared memory or filespace or network connections.
Any sort of multi-processor programming is harder than single-threaded programming. Existing libraries with multi-processor ability are probably the only sensible way for us to make use of multiple processors. The BLAS libraries, along with some parts of recent Scilab, Matlab and Comsol support some multi-threading. The CPUs used must, in these programs, be on the local computer.
It should be noted that increasing the number of program threads doesn't have to increase the program speed, even with many processors available: the need for some steps in a program to be performed sequentially, and the delay in communicating between threads, can make some programs very inefficient at multi-processor work.
This is not an exhaustive list, but just hints on widely used programs. Assume fairly safely that any proprietary program not mentioned here doesn't have multithreading support.
Matlab v7.5 (R2007b) has some support for multithreading, for some functions. For setting the maximum number of threads within a session (command line) or script can be done with the function `maxNumCompThreads(N);' which sets the maximum to N, and returns the old number. Multithreading settings can be set permanently (for all sessions) in the "File" -> "Preferences" -> "General" dialogue.
Comsol 3.3a has some multi-threading support in the "pardiso" solver: we see this a little, but other solvers such as umfpack are about as fast for our models. Comsol 3.4 is claimed to have multi-threading support in all of its main steps of solution. If run within matlab (by `comsol-3.4 matlab') then it's matlab's settings, mentioned above, that determine comsol's threading. If run outside matlab, there are command-line options and environment variables that set the number of processors to use. `comsol-3.4 -np 4' sets this to 4. More options for multithreading and alternative libraries can be seen by typing `comsol-3.4 -help'.
Warning: So far, it seems that any amount of multithreading in comsol achieves only meagre increase in performance (e.g. ~20%, for wasting much of 4 CPUs' time), uses up quite a lot of CPU time that other processes might be able to use, and that increasing the number beyond about 4 could make things overall slower! When running multiple processes (batched simulation or sharing the computer) the multithreading slows things a lot. It was also noticed that comsol-3.3 was faster than single-threaded comsol-3.4 for our examples. Therefore, 3.3 has been left (for now) as the default, and comsol's multithreading is suggested as not being worth trying except for mild improvements in a single simulation's speed.
Several MPI-based additions exist for Matlab, to allow multiple computers to be used, e.g. MatlabMPI from MIT, mexMPI from Ohio. A `Matlab Parallel' overview page has more links.
In many cases that we work with, a program has been written, and it must be run with many different sets of parameters. This is an ideal case for treating each set of parameters as an independent job, and running one job on each CPU until all are finished. For a few, this could be done manually. The commands `at' and `batch' allow jobs to be submitted to be run at a certain time or when the system load is low. Scripting in shell or Perl or Matlab could be used to run each of a list of parameters, and one such script could be started for each processor.
There exist operating-system kernel level programs for computers to distribute jobs transparently to the least loaded one, but these require considerable changes by the administrator (and seem not very much updated). Thoroughly adequate for the needs of several users, for whom each job of a large batch takes at least many seconds to run, has been a set of shell-scripts that uses ssh to start jobs on remote machines and uses a shared NFS home-directory for communication between jobs. This has been used to batch compiled Fortran code and Matlab/Comsol programs. The programs must run in purely text mode, though modifications could be made to cater for non-interactive GUI programs. An apparently fairly similar functionality for matlab alone is offered by matlab's own distributed toolbox extension (which we don't use).
This section may be of interest as an introduction to some modern CPU terminology that can then be looked up further by any user so concerned with speed or numerical results as to want to have a good grasp of the compiler options and of roughly what's going on in the CPU.
The simple picture of a microprocessor is a few `registers' (each is a memory of a few bytes, operated on all at once) whose data can be read from and written to locations in memory (RAM), and whose contents can be read into and written from operations (e.g. bit-shifts, additions, logical operations) in the arithmetic and logic unit (ALU), according to a program's instruction codes stored in memory.
A modern microprocessor for desktop computers makes the picture, though fundamentally similar, very much more complex.
The hideous complexity of all this is generally hidden behind the compiler. Options given to the compiler may make some difference in speed, and may also show a difference between floating-point operations in extended precision in the FPU as opposed to normal double precision in the SSE2, due to rounding. In particular, comparisons for equality that may be true when rounded might never return true when under higher precision. Since the order of calculations -- and therefore the particular registers and calculation units used -- can vary with the compiler and its optimisations and other options, it may be important for some algorithms that suitable options are chosen for forcing all rounding to be the same.
In the 1980s the IEEE compiled a standard for floating-point arithmetic, which has been widely adopted and can be considered as `the' floating-point standard supported by common hardware and software. IEEE 754 specifies various standard precisions: It also specifies features of the floating-point implementation that allow adjustment to different problems, for example by modifying the way that rounding is done or that problems with an operation are dealt with. IEEE 754 is implemented in practically every bit of hardware around, but often not all of the subtler options are accessible through compilers and software.
The following work is referenced pretty much everywhere as giving a good base (probably rather above what we normally need): What every computer scientist should know about floating-point arithmetic
Sun has a good introduction: the Numerical Computation Guide
Section 9, `Scientific Computation' in Introduction to Programming [in Java] gives a lot of examples of floating-point troubles and `good practices'. Later sections go into other interesting matters of simulation, differential equations, etc.
The GNU Scientific Library manual has a little to say about the using floating point in the GSL.
The main page of W. Kahan, a numerical analyst with a long list of interesting (though often hideously typeset) easily accessible works. E.g. `How JAVA's Floating-Point Hurts Everyone Everywhere', `Matlab's Loss is Nobody's Gain', `Why we needed a floating-point standard'.
Several pages refer to problems with porting of not-very-portably-written code to the Linux-on-Intel platform which defaults to extended precision: here , here and here
A recent book (LiU, no less) has interesting-looking contents but
wasn't available when requested as an acquisition for KTHB:
Accuracy and Reliability in Scientific Computing
Page started: 2007-11-xx
Last change: 2008-01-30