2.27. Example: Parallel execution (MPI)

|Nmag|‘s numerical core (which is part of the |nsim| multi-physics library) has been designed to carry out numerical computation on several CPUs simultaneously. The protocol that we are using for this is the wide spread Message Passing Interface (MPI). There are a number of MPI implementations; the best known ones are probably MPICH1, MPICH2 and LAM-MPI. Currently, we support MPICH1 and MPICH2.

Which mpi version to use? Whether you want to use mpich1 or mpich2 will depend on your installation: currently, the installation from source provides mpich2 (which is also used in the virtual machines) whereas the Debian package relies on mpich1 (no Debian package is provided after release 0.1-6163).

2.27.1. Using mpich2

Before the actual simulation is started, a multi-purpose daemon must be started when using MPICH2.


The ``.mpd.conf`` file

MPICH2 will look for a configuration file with name .mpd.conf in the user’s home directory. If this is missing, an attempt to start the multi-purpose daemon, will result in an error message like this:

$> mpd
configuration file /Users/fangohr/.mpd.conf not found
A file named .mpd.conf file must be present in the user's home
directory (/etc/mpd.conf if root) with read and write access
only for the user, and must contain at least a line with:
MPD_SECRETWORD=<secretword>
One way to safely create this file is to do the following:
  cd $HOME
  touch .mpd.conf
  chmod 600 .mpd.conf
and then use an editor to insert a line like
  MPD_SECRETWORD=mr45-j9z
into the file.  (Of course use some other secret word than mr45-j9z.)

If you don’t have this file in your home directory, just follow the instructions above to create it with some secret word of your choice (Note that the above example is from a Mac OS X system: on Linux the home directory is usually under /home/USERNAME rather than /Users/USERNAME as shown here.)


Let’s assume we have a multi-core machine with more than one CPU. This makes the mpi setup slightly easier, and is also likely to be more efficient than running a job across the network between difference machines.

First, we need to start the multi-purpose daemon:

$> mpd &

It will look for the file ~/.mpd.conf as described above. If found, it will start silently. Otherwise it will complain.

2.27.1.1. Testing that nsim executes in parallel

First, let’s make sure that nsim is in the search path. The command which nsim will return the location of the executable if it can be found in the search path. For example:

$> which nsim
/home/fangohr/new/nmag-0.1/bin/nsim

To execute nsim using two processes, we can use the command:

$> mpiexec -n 2 nsim

There are two useful commands to check whether nsim is aware of the intended MPI setup. The fist one is ocaml.mpi_status() which provides the total number of processes in the MPI set-up:

$> mpiexec -n 2 nsim
>>> ocaml.mpi_status()
MPI-status: There are 2 nodes (this is the master, rank=0)
>>>

The other command is ocaml.mpi_hello() and prints a short ‘hello’ from all processes:

>>> ocaml.mpi_hello()
>>> [Node   0/2] Hello from beta.kk.soton.ac.uk
[Node   1/2] Hello from beta.kk.soton.ac.uk

For comparison, let’s look at the output of these commands if we start nsim without MPI, in which case only one MPI node is reported:

$> nsim
>>> ocaml.mpi_status()
MPI-status: There are 1 nodes (this is the master, rank=0)
>>> ocaml.mpi_hello()
[Node   0/1] Hello from beta.kk.soton.ac.uk

Assuming this all works, we can now start the actual simulation. To use two CPUs on the local machine to run the bar30_30_100.py program, we can use:

$> mpiexec -n 2 nsim bar30_30_100.py

To run the program again, using 4 CPUs on the local machine:

$> mpiexec -n 4 nsim bar30_30_100.py

Note that mpich2 (and mpich1) will spawn more processes than there are CPUs if necessary. I.e. if you are working on some Intel Dual Core processor (with 2 CPUs and one core each) but request to run your program with 4 (via the -n 4 switch given to mpiexec), than you will have 4 processes running on the 2 CPUs.

If you want to stop the mpd daemon, you can use:

$> mpdallexit

For diagnostic purposes, the mpdtrace command can be use to track whether a multipurpose daemon is running (and which machines are part of the mpi-ring).

Advanced usage of mpich2

To run a job across different machines, one needs to start the multi-purpose daemons on the other machines with the mpdboot command. This will search for a file (in the current directory) with name mpd.hosts which should contain a list of hosts to participate (very similar to the machinefile in MPICH1).

To trace which process is sending what messages to the standard out, one can add the -l switch to the mpiexec command: then each line of standard output will be preceded by the rank of the process who has issued the message.

Please refer to the official MPICH2 documentation for further details.

2.27.2. Using mpich1

Note: Most users will use MPICH2 (if they have compiled Nmag from the tar-ball): see Using mpich2

Suppose we would like to run Example 2: Computing the time development of a system of the manual with 2 processors using MPICH1. We need to know the full path to the nsim executable. In a bash environment (this is pretty much the standard on Linux and Mac OS X nowadays), you can find the path using the which command. On a system where nsim was installed from the Debian package:

$> which nsim
/usr/bin/nsim

Let’s assume we have a multi-core machine with more than one CPU. This makes the mpi setup slightly easier, and is also likely to be more efficient than running a job across the network between difference machines. In that case, we can run the example on 2 CPUs using:

$> mpirun -np 2 /usr/bin/nsim bar30_30_100.py

where -np is the command line argument for the Number of Processors.

To check that the code is running on more than one CPU, one of the first few log messages will display (in addition to the runid of the simulation) the number of CPUs used:

$> mpirun -np 2 `which nsim` bar30_30_100.py

    nmag:2008-05-20 12:50:01,177   setup.py  269    INFO Runid (=name simulation) is 'bar30_30_100', using 2 CPUs

To use 4 processors (if we have a quad core machine available), we would use:

$> mpirun -np 4 /usr/bin/nsim bar30_30_100.py

Assuming that the nsim executable is in the path, and that we are using a bash-shell, we could shortcut the step of finding the nsim executable by writing:

$> mpirun -np 4 `which nsim` bar30_30_100.py

To run the job across the network on different machines simultaneously, we need to create a file with the names of the hosts that should be used for the parallel execution of the program. If you intend to use nmag on a cluster, your cluster administrator should explain where to find this machine file.

To distribute a job across machine1.mydomain, machine2.mydomain, and machine3.mydomain we need to create the file machines.txt with content:

machine1.mydomain
machine2.mydomain
machine3.mydomain

We then need to pass the name of this file to the mpirun command to run a (mpi-enabled) executable with mpich:

mpirun -machinefile machines.txt -np 3 /usr/bin/nsim bar30_30_100.py

For further details, please refer to the MPICH1 documentation.

2.27.3. Visualising the partition of the mesh

We use Metis to partition the mesh. Partitioning means to allocate certain mesh nodes to certain CPUs. Generally, it is good if nodes that are spatially close to each other are assigned to the same CPU.

Here we demonstrate how the chosen partition can be visualised. As an example, we use the Example: demag field in uniformly magnetised sphere. We are Using mpich2:

$> mpd &
$> mpiexec -l -n 3 nsim sphere1.py

The program starts, and prints the chose partition to stdout:

 nfem.ocaml:2008-05-28 15:11:07,757    INFO Calling ParMETIS to partition the me
sh among 3 processors
 nfem.ocaml:2008-05-28 15:11:07,765    INFO Processor 0: 177 nodes
 nfem.ocaml:2008-05-28 15:11:07,765    INFO Processor 1: 185 nodes
 nfem.ocaml:2008-05-28 15:11:07,766    INFO Processor 2: 178 nodes

If you can’t find the information on the screen (=stdout), then have a look in sphere1_log.log which contains a copy of the log messages that have been printed to stdout.

If we save any fields spatially resolved (as with the sim.save_data(fields='all') command), then nmag will create a file with name (in this case) sphere1_dat.h5. In addition to the field data that is saved, it also stores the finite element mesh in the order that was used when the file was created. In this example, this is the mesh ordered according to the output from the ParMETIS package. The first 177 nodes of the mesh in this order are assigned to CPU0, the next 185 are assigned to CPU1, and the next 178 are assigned to CPU2.

We can visualise this partition using the nmeshpp command (which we apply here to the mesh that is saved in the sphere1_dat.h5 file):

$> nmeshpp --partitioning=[177,185,178] sphere1_dat.h5 partitioning.vtk

The new file partitioning.vtk contains only one field on the mesh, and this has assigned to each mesh node the id of the associated CPU. We can visualise this, for example, using:

$> mayavi -d partitioning.vtk -m SurfaceMap
../_images/sphere3partitions.png

The figure shows that the sphere has been divided into three areas which carry values 0, 1 and 2 (corresponding to the MPI CPU rank which goes from 0 to 2 for 3 CPUs). Actually, in this plot we can only see the surface nodes (but the volume nodes have been partitioned accordingly).

The process described here is a bit cumbersome to visualise the partition. This could in principle be streamlined (so that we save the partition data into the _dat.h5 data file and can generate the visualisation without further manual intervention). However, we expect that this is not a show stopper and will dedicate our time to more pressing issues. (User feedback and suggestions for improvements are of course always welcome.)

2.27.4. Performance

Here is some data we have obtained on an IBM x440 system (with eight 1.9Ghz Intel Xeon processors). We use a test simulation (located in tests/devtests/nmag/hyst/hyst.par) which computes a hysteresis loop for a fairly small system (4114 mesh nodes, 1522 surface nodes, BEM size 18MB). We use overdamped time integration to determine the meta-stable states.

Both the setup and the time required to write data will not become significantly faster when run on more than one CPU. We provide:

total time: this includes setup time, time for the main simulation loop and time for writing data (measured in seconds)

total speedup: The speed up for the total execution time (i.e. ratio of execution time on one CPU to execution time on n CPUs).

sim time: this is the time spend in the main simulation loop (and this is where expect a speed up)

sim speedup: the speedup of the main simulation loop

CPUs total time total speedup sim time sim speedup
1 4165 1.00 3939 1.00
2 2249 1.85 2042 1.93
3 1867 2.23 1659 2.37
4 1605 2.60 1393 2.83

The numbers shown here have been obtained using mpich2 (and using the ssm device instead of the default sock device: this is available on Linux and resulted in a 5% reduction of execution time).

Generally, the (network) communication that is required between the nodes will slow down the communication. The smaller the system, the more communication has to happen between the nodes (relative to the amount of time spent on actual calculation). Thus, one expects a better speed up for larger systems. The performance of the network is also crucial: generally, we expect the best speed up on very fast networks and shared memory systems (i.e. multi-CPU / multi-core architectures). We further expect the speed-up to become worse (in comparison to the ideal linear speed-up) with an increasing number of processes.

2.28. Restarting MPI runs

There is one situation that should be avoided when exploiting parallel computation. Usually, a simulation (involving for example a hysteresis loop), can be continued using the --restart switch. This is also true for MPI runs.

However, the number of CPUs used must not change between the initial and any subsequent runs. (The reason for this is that the _dat.h5 file needs to store the mesh as it has been reordered for n CPUs. If we continue the run with another number of CPUs, the mesh data will not be correct anymore which will lead to errors when extracting the data from the _dat.h5 file.)

Note also that there is currently no warning issued (in Nmag 0.1) if a user ventures into such a simulation.