This is an old revision of the document!


Outdated

This Wiki page is outdated, please go to: https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:mpi4py

Collective Communications

Point to point communication is the basic tool for parallel computing within the messige passing programming model. More general communication patterns involving more than two tasks can be realised by combining several of these basic communication operations. For example synchronising all tasks in a communicator can be achieved by handshakes of one tasks with all other tasks and informing all tasks about the completion of all these handshakes. This is illustrated with the code for the function synchron.

def synchron():
   buf = None
   if rank == 0:
      for ip in range(1,size):
         comm.send(buf,dest=ip)
         buf=comm.recv(source=ip)
      for ip in range(1,size):
         comm.send(buf,dest=ip)
   else:
     buf=comm.recv(source=0)
     comm.send(buf,dest=0)
     buf = comm.recv(source=0)


This kind of synchronisation can be thought of as a barrier, which stops every task until all tasks have arrived. A barrier may be used for example to force the output of several tasks, which without intervention will follow one after the other in a runtime dependent order, into a prescribed seqence:

for ip in  range(0,size):
   if rank == ip:
      print 'rank',rank,'ready'
   synchron()


Communication involving all tasks of a communicator is called collective within the MPI standard. The MPI standard provides subroutines for barrier synchronisation and for a large number of other frequently used patterns of collective data manipulations.

These patterns can be classified according to the kind of data manipulation involved

  • Global Synchronisation, no data are involved
  • Global data communication, data are scattered ore gathered
  • Global data reduction, data of different tasks are reduced to new data

Global Synchronisation

The only global synchronisation operation in MPI is the barrier. The syntax for the barrier synchronisation function in the mpi4py Comm class equivalent to the synchron function above is

Barrier().


A task posting a call to Barrier() will halt until all other tasks of the communicator have posted the same call. Therefore in a correct mpi4py program the number of calls to Barrier() must be equal for all tasks.

Global Data Communication

Data communication between all tasks of a communicator can have two directions:

  • distributing data coming from one tasks to all tasks
  • collecting data coming from all tasks to one or all tasks

For distributing data mpi4py provides functions for

  • broadcasting: Bcast, bcast
  • scattering from one task: Scatter, Scatterv, scatter

For collecting data mpi4py provides functions for

  • gathering to one task: Gather, Gatherv, gather
  • gathering to all tasks: Allgather, Allgatherv, allgather

In addition, there are functions, which combine collection and distribution of data:

  • Alltoall, Alltoallv, Alltoallw, alltoall

broadcast

The syntax of the broadcast method is

comm.Bcast(buf, int root=0)
buf = comm.bcast(obj=None, int root=0)


An example for broadcasting an NumPy array is

bcast_array.py:

data = numpy.empty(5,dtype=numpy.float64)
if rank == 0:
    data = numpy.arange(5,dtype=numpy.float64)
comm.Bcast([data,3,MPI.DOUBLE],root=0)
print 'on task',rank,'after recv:    data = ',data


An example broadcasting a python data object is

bcast.py:

rootdata = None
if rank == 0:
    rootdata = (1,'a','z',3.14)
data = comm.bcast(rootdata,root=0)
print 'on task',rank,'after bcast:    data = ',data


A difference between the two broadcast functions is, that in Bcast the data to be broadcasted stay in place in the root task, whereas in bcast, they are copied to the object returned by the call to bcast.

scatter

Whereas a broadcast operation distributes the same object from the root task to all other tasks, a scattering operation sends a different object from root to every task. The syntax of the methods for scattering is

comm.Scatter(sendbuf, recvbuf, int root=0)
comm.Scatterv(sendbuf, recvbuf, int root=0)
comm.scatter(sendobj=None, recvobj=None, int root=0)


In the upper case function Scatter the sendbuf and recvbuf arguments must be given in terms of a list (as in the point to point function Send):

buf = [data, data_size, data_type]


where data must be a buffer like object of size data_size and of type data_type. The size and type descriptions can be omitted, if they are implied by the buffer object, as in the case of a NumPy array. Equally sized sections of the data in sendbuf will be sent from root to the other tasks, the i-th section to task i, where they will be stored in recvbuf. The data_size in sendbuf therefore must be a multiple of size, where size is the number of tasks in the communicator. The data_size in recvbuf must be at least as large as the size of the data section received.

In the vector variant of this function, Scatterv, the size and the location of the sections of senddata to be distributed may be freely chosen. The sendbuf in this case must include a description of the layout of the sections in the form

sendfbuf = [data, counts, displacements, type]


where counts and displacements are integer tuples with as many elements as tasks are in the communicator. counts[i] designates the size of the i-th segment, displacements[i] the number of the element in data used as the first element of the i-th section.

The calls of Scatter and Scatterv are illustrated in the next example

scatter_array,py:

a_size = 4
recvdata = numpy.empty(a_size,dtype=numpy.float64)
senddata = None
if rank == 0:
   senddata = numpy.arange(size*a_size,dtype=numpy.float64)
comm.Scatter(senddata,recvdata,root=0)
print 'on task',rank,'after Scatter:    data = ',recvdata

recvdata = numpy.empty(a_size,dtype=numpy.float64)
counts = None
dspls = None
if rank == 0:
   senddata = numpy.arange(100,dtype=numpy.float64)
   counts=(1,2,3)
   dspls=(4,3,10)
comm.Scatterv([senddata,counts,dspls,MPI.DOUBLE],recvdata,root=0)
print 'on task',rank,'after Scatterv:    data = ',recvdata


In the scatter function the sendobj to be scattered from the root task must be a sequence of exactly size objects, where size is the number of tasks in the communicator. Each element in this sequence can be a data object of any type, and the i-th object of the sequence will be sent from root to the i-th task and will be received in the i-th tasks as the value returned from scatter. This is shown in the following example program

scatter.py:

rootdata = None
if rank == 0:
    rootdata = [1,2,3,(4,5)]
data = comm.scatter(rootdata,root=0)
print 'on task',rank,'after bcast:    data = ',data


gather

The gather operations collects data from all tasks and delivers this collection to the root task or to all tasks. The syntax of the gathering methods is

comm.Gather(sendbuf, recvbuf, int root=0)
comm.Gatherv(sendbuf, recvbuf, int root=0)
comm.Allgather(sendbuf, recvbuf)
comm.Allgatherv(sendbuf, recvbuf)
comm.gather(sendobj=None, recvobj=None, int root=0)
comm.allgather(sendobj=None, recvobj=None)


Compared to the scatter methods, the roles of sendbuf and recvbuf are exchanged for Gather and Gatherv. In Gather sendbuf must contain the same number of elements in all tasks and the recvbuf on root must contain size times that number of elements, where size is the total number of tasks. For Gatherv the integer tuples counts and displacements characterize the layout of recvbuf, as described in the scatter section. Here sendbuf must contain exactly counts[rank] elements in task rank. The following code demonstrates this.

gather_array.py:

a_size = 4
recvdata = None
senddata = (rank+1)*numpy.arange(a_size,dtype=numpy.float64)
if rank == 0:
   recvdata = numpy.arange(size*a_size,dtype=numpy.float64)
comm.Gather(senddata,recvdata,root=0)
print 'on task',rank,'after Gather:    data = ',recvdata 

counts=(2,3,4)
dspls=(0,3,8)
if rank == 0:
   recvdata = numpy.empty(12,dtype=numpy.float64)
sendbuf = [senddata,counts[rank]]
recvbuf = [recvdata,counts,dspls,MPI.DOUBLE]
comm.Gatherv(sendbuf,recvbuf,root=0)
print 'on task',rank,'after Gatherv:    data = ',recvdata


The lower case function gather communicates generic python data objects analogous to the scatter function. An example is

gather.py:

senddata = ['rank',rank]
rootdata = comm.gather(senddata,root=0)
print 'on task',rank,'after bcast:    data = ',rootdata


The “all” variants of the gather methods deliver the collected data not only to the root task, but to all tasks in the communicator. These methods therefore lack the root parameter.

alltoall

alltoall operations combine gather and scatter. In a communicator with size tasks every task has a sendbuf and a recvbuf with exactly size data objects. The alltoall operation takes the i-th object from the sendbuf of task j and copies it into the j-th object of the recvbuf of task i. The operation can be thought of as a transpose of the matrix with tasks as columns and data objects as rows. The syntax of the alltoall methods is

comm.Alltoall(sendbuf, recvbuf)
comm.Alltoallv(sendbuf, recvbuf)
comm.alltoall(sendobj=None, recvobj=None)


The data objects in sendbuf and recvbuf have types depending on the method. In Alltoall the data objects are equally sized consecutive sections of buffer like objects. In Alltoallv they are sections of varying sizes and varying displacements of buffer like objects. The layout must be described the in counts und diplacements tuples, which have to be defined for sendbuf and recvbuf in every tasks in a way consistent with the intended transposition in the (task,object) matrix. Finally in the lower case alltoall function the data objects can be of any allowed Python type, provided that the data objects in sendbuf conform to those in recvbuf, which they will replace.

An example for Alltoall is

alltoall_array.py:

a_size = 1
senddata = (rank+1)*numpy.arange(size*a_size,dtype=numpy.float64)
recvdata = numpy.empty(size*a_size,dtype=numpy.float64)
comm.Alltoall(senddata,recvdata)


Next an example for alltoall, to be started with two tasks.

alltoall.py:

data0 = ['rank',rank]
data1 = [1,'rank',rank]
senddata=(data0,data1)
print 'on task',rank,'    senddata = ',senddata
recvdata = comm.alltoall(senddata)
print 'on task',rank,'    recvdata = ',recvdata


Global Data Reduction

The data reduction functions of MPI combine data from all tasks with one of a set of predefined operations and deliver the result of this “reduction” to the root tsks or to all tasks. mpi4py provides the following operations for reduction:

MPI.MIN        minimum
MPI.MAX        maximum
MPI.SUM        sum
MPI.PROD       product
MPI.LAND       logical and
MPI.BAND       bitwise and
MPI.LOR        logical or
MPI.BOR        bitwise or
MPI.LXOR       logical xor
MPI.BXOR       bitwise xor
MPI.MAXLOC     max value and location
MPI.MINLOC     min value and location


The reduction operations need data of the appropriate type.

reduce

The syntax for the reduction methods is

comm.Reduce(sendbuf, recvbuf, op=MPI.SUM, root=0)
comm.Allreduce(sendbuf, recvbuf, op=MPI.SUM)
comm.reduce(sendobj=None, recvobj=None, op=MPI.SUM, root=0)
comm.allreduce(sendobj=None, recvobj=None, op=MPI.SUM)


The followimg example shows the use of Reduce and Allreduce. sendbuf and recvbuf must be buffer like data objects with the same number of elements of the same type in all tasks. The reduction operation is performed elementwise using the corresponding elements in sendbuf in all tasks. The result is stored into the corresponding element of recvbuf in the root task by Reduce and in all tasks by Allreduce. If the op paramter is omitted, the default operation MPI.SUM is used.

reduce_array.py:

a_size = 3
recvdata = numpy.zeros(a_size,dtype=numpy.int)
senddata = (rank+1)*numpy.arange(a_size,dtype=numpy.int)
comm.Reduce(senddata,recvdata,root=0,op=MPI.PROD)
print 'on task',rank,'after Reduce:    data = ',recvdata

comm.Allreduce(senddata,recvdata)
print 'on task',rank,'after Allreduce:    data = ',recvdata


The use of the lower case methods reduce and allreduce operating on generic python data objects is limited, because the reduction operations are undefined for most of the data objects (like lists, tuples etc.).

reduce_scatter

The reduce_scatter functions operate elementwise on size sections of the buffer like data objects sendbuf. The sections must have the equal number of elements in all tasks. The result of the reductions in section i is copied to recvbuf in task i, which must have an appropriate length. The syntax for the reduction methods is

comm.Reduce_scatter_block(sendbuf, recvbuf, op=MPI.SUM)
comm.Reduce_scatter(sendbuf, recvbuf, recvcounts=None, op=MPI.SUM)


In Reduce_scatter_block the number of elements in all sections must be equal and the number of elements in sendbuf must be size times that number. An example code is the following

reduce_scatter_block:

a_size = 3
recvdata = numpy.zeros(a_size,dtype=numpy.int)
senddata = (rank+1)*numpy.arange(size*a_size,dtype=numpy.int)
print 'on task',rank,'senddata  = ',senddata
comm.Reduce_scatter_block(senddata,recvdata,op=MPI.SUM)
print 'on task',rank,'recvdata = ',recvdata


In Reduce_scatter the number of elements in the sections can be different. They must be given in the integer tuple recvcounts. The number of elements in sendbuf must be sum of the numbers of elements in the sections. On task i recvbuf must have the length of section i of sendbuf. The following code gives an example for this.

reduce_scatter:

recv_size = range(1,size+1)
recvdata = numpy.zeros(recv_size[rank],dtype=numpy.int)
send_size = 0
for i in  range(0,size):
   send_size =send_size + recv_size[i]
senddata = (rank+1)*numpy.arange(send_size,dtype=numpy.int)
print 'on task',rank,'senddata  = ',senddata
comm.Reduce_scatter(senddata,recvdata,recv_size,op=MPI.SUM)
print 'on task',rank,'recvdata = ',recvdata


Reduction with MINLOC and MAXLOC

The reduction operations MINLOC and MAXLOC differ from all others: they return two results, the minimum resp. maximum of the values in the different tasks and the rank of a task, which holds the extreme value. mpi4py provides the two operations only for the lower case reduce and allreduce mehods for comparing a single numerical data object in every task. An example is given in

reduce_minloc.py:

inp = numpy.random.rand(size)
senddata = inp[rank]
recvdata=comm.reduce(senddata,None,root=0,op=MPI.MINLOC)
print 'on task',rank,'reduce:    ',senddata,recvdata 

recvdata=comm.allreduce(senddata,None,op=MPI.MINLOC)
print 'on task',rank,'allreduce: ',senddata,recvdata


Code Examples

The python codes for all examples described in this tutorial are available from http://wwwuser.gwdg.de/~ohaan/mpi4py_examples/

Scientific Computing