wiki:hpc:scalapack
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | wiki:hpc:scalapack [2021/02/20 12:54] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== | ||
+ | |||
+ | ===== Parallel Linear Algebra with ScaLAPACK | ||
+ | |||
+ | ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. On GWDG's scientific computing cluster ScaLAPACK is available from the Intel® Math Kernel Library, currently installed in version 11.1. ScaLAPACK provides often used algorithms for dense and banded matrices: the solution of systems of equations, of east squares problems, eigenvalue problems, and singular value problems. | ||
+ | |||
+ | The distributed memory parallelisme of ScaLAPACK is based on the Basic Linear Algebra Communication Subprograms (BLACS) supporting the 2-dimensional data- and process-patterns used in ScaLAPACK for the efficient parallelization of the matrix algorithms of linear algebra. The BLACS are built on top of a message passing library, which is MPI for the Intel MKL implementation of BLACS. The parallel basic matrix operations used in ScaLAPACK are provided by the Parallel Basic Linear Algebra Subprograms (PBLAS), which in turn at the single process level call the Basic Linear Algebra Subprograms (BLAS) for matrix computations. This software hierarchy of ScaLAPACK is displayed in the following figure: | ||
+ | |||
+ | {{: | ||
+ | |||
+ | The performance of the ScaLAPACK routines depends on the performance of the underlying implementations of BLAS and the message passing library. With Intel MPI and Intel MKL optimal performance will be achieved on GWDG's clusters. | ||
+ | |||
+ | Documentation for ScaLAPACK and related software can be obtained from the web: | ||
+ | [[http:// | ||
+ | [[http:// | ||
+ | [[http:// | ||
+ | [[http:// | ||
+ | [[http:// | ||
+ | [[http:// | ||
+ | Furthermore the [[https:// | ||
+ | |||
+ | |||
+ | Speed up from parallel computing for linear algebra algorithms can easily be realized by the ScaLAPACK procedures. But even for users familiar with parallel applications based on MPI the calling of ScaLAPACK procedures needs some explanation. This concerns the use of the additional communication interface of the BLACS and the special block-cyclically distributed data layout for matrices, which is the basis for a good load balance in the parallelized matrix multiplication and matrix factorization operations of linear algebra algorithms. | ||
+ | |||
+ | ===== First Steps for Using ScaLAPACK on GWDG Compute Clusters | ||
+ | |||
+ | ==== A Parallel HELLO_WORLD Program with BLACS ==== | ||
+ | The following example code shows how to set up the BLACS environment needed for calling ScaLAPACK procedures: | ||
+ | |||
+ | < | ||
+ | program hello_from_BLACS | ||
+ | implicit none | ||
+ | ! local variables | ||
+ | integer | ||
+ | : | ||
+ | : | ||
+ | |||
+ | ! determine rank of calling process and the size of processor set | ||
+ | call BLACS_PINFO(myid, | ||
+ | ! get the internal default context | ||
+ | call BLACS_GET( 0, 0, ctxt_sys ) | ||
+ | ! Set up a process grid for the process set | ||
+ | ctxt_all = ctxt_sys | ||
+ | call BLACS_GRIDINIT( ctxt_all, ' | ||
+ | call BLACS_BARRIER(ctxt_all,' | ||
+ | ! Set up a process grid of size 3*2 | ||
+ | ctxt = ctxt_sys | ||
+ | call BLACS_GRIDINIT( ctxt, ' | ||
+ | ! All processes not belonging to ctxt jump to the end of the program | ||
+ | if (ctxt.lt.0) go to 1000 | ||
+ | ! Get the process coordinates in the grid | ||
+ | call BLACS_GRIDINFO( ctxt, nprow, npcol, myrow, mycol ) | ||
+ | write(6,*) 'hello from process', | ||
+ | ! ... | ||
+ | ! now ScaLAPACK or PBLAS procedures can be used | ||
+ | ! ... | ||
+ | | ||
+ | ! return all BLACS contexts | ||
+ | call BLACS_EXIT(0) | ||
+ | stop | ||
+ | end | ||
+ | </ | ||
+ | |||
+ | For the MPI based implementation of the BLACS, the initial call to '' | ||
+ | |||
+ | The next step is setting up a two-dimensional process grid by calling '' | ||
+ | |||
+ | The '' | ||
+ | |||
+ | As the example shows, it is possible to create different grids with their own contexts form the available processor set, such that some processes will have multiple grid coordinates, | ||
+ | |||
+ | Next a grid of size 3x2 is created with context '' | ||
+ | |||
+ | ==== Compiling and Running the Program on GWDG Compute Clusters ==== | ||
+ | In order to compile this program for running on the GWDG compute cluster, the necessary modules for using Intels compiler, MPI and MKL must be loaded by the command | ||
+ | |||
+ | < | ||
+ | module load intel/ | ||
+ | </ | ||
+ | |||
+ | The compilation then must include linking the needed libraries, as described in the Intel MKL User's Guide. An example for using the sequential MKL library (one thread per process) is | ||
+ | |||
+ | < | ||
+ | mpiifort -o hello hello_from_BLACS | ||
+ | -lmkl_scalapack_lp64 | ||
+ | -lmkl_blacs_intelmpi_lp64 | ||
+ | -lmkl_intel_lp64 | ||
+ | -lmkl_sequential | ||
+ | -lmkl_core | ||
+ | </ | ||
+ | |||
+ | These libraries are also selected for linking by using the option '' | ||
+ | |||
+ | < | ||
+ | mpiifort -o hello hello_from_BLACS -mkl=cluster | ||
+ | </ | ||
+ | |||
+ | Finally, the program can be started with at least 6 processes interactively with the mpirun command, for example | ||
+ | |||
+ | < | ||
+ | mpirun -n 7 hello | ||
+ | </ | ||
+ | |||
+ | producing the output | ||
+ | |||
+ | < | ||
+ | hello from process | ||
+ | hello from process | ||
+ | hello from process | ||
+ | hello from process | ||
+ | hello from process | ||
+ | hello from process | ||
+ | </ | ||
+ | |||
+ | With the '' | ||
+ | |||
+ | < | ||
+ | bsub -q mpi-short -o out.%J -n 7 -a intelmpi mpirun.lsf | ||
+ | </ | ||
+ | |||
+ | More information on using the queueing system on the GWDG comte cluster is provided on the GWDG Wiki under the header [[https:// | ||
+ | |||
+ | ===== Block-Cyclic Data Distribution | ||
+ | In ScaLAPACK, all arrays are block-cyclically distributed over the available process grid. This distribution will be explained in detail for vector- and matrix-data in the next two sections. | ||
+ | ==== Block-Cyclic Vector Distribution | ||
+ | |||
+ | The block-cyclic distribution of a vector over a linearly arranged processor set is characterized by 4 numbers | ||
+ | |||
+ | < | ||
+ | n size of vector | ||
+ | nb size of blocks | ||
+ | nproc size of linear processs set | ||
+ | psrc process containing the first element of the vector | ||
+ | </ | ||
+ | |||
+ | In the block-cyclic distribution of a vector the blocks are placed cyclically to the available processes in the process set, starting on process psrc. The blocks are stored contiguously in the local memory of a process. | ||
+ | |||
+ | < | ||
+ | n=9, nb=2, nproc=2, psrc=1: | ||
+ | </ | ||
+ | | ||
+ | global vector: | ||
+ | |||
+ | \\ distributed vector: {{ : | ||
+ | |||
+ | |||
+ | |||
+ | In order to determine the individual sizes of local parts of the vector stored in process ip, the vector size n is split up as | ||
+ | |||
+ | < | ||
+ | n = (nblock-1)*nb + nlast, | ||
+ | </ | ||
+ | |||
+ | where '' | ||
+ | |||
+ | The '' | ||
+ | |||
+ | < | ||
+ | nblock = (lblock-1)*nproc + lfirst | ||
+ | </ | ||
+ | |||
+ | such that the first '' | ||
+ | |||
+ | ScaLAPACK provides the utility function NUMROC (number of rows or columns), which implements this calculation to determine the local number of elements on process ip, given the four numbers n, nb, prsc, nproc. A demo version providing the functionaly of NUMROC is | ||
+ | |||
+ | < | ||
+ | integer function NUMROC_demo( n, nb, ip, psrc, nproc ) | ||
+ | implicit none | ||
+ | integer n, nb, ip, psrc, nproc | ||
+ | integer nblock, nlast, lblock, lfirst, ip0 | ||
+ | |||
+ | ! ip0 is the distance of ip to psrc | ||
+ | ip0 = mod(nproc+ip-psrc, | ||
+ | nblock = (n+nb-1)/ | ||
+ | lblock = (nblock+nproc-1)/ | ||
+ | if ( ip0.lt.lfirst ) then | ||
+ | | ||
+ | else if ( ip0.eq.lfirst ) then | ||
+ | | ||
+ | else | ||
+ | | ||
+ | end if | ||
+ | return | ||
+ | end | ||
+ | </ | ||
+ | |||
+ | The maximal number | ||
+ | < | ||
+ | if (lfirst.eq.0) NUMROC ≤ (lblock-1)*nb + nlast < lblock*nb | ||
+ | if (lrest.gt.0) NUMROC ≤ lblock*nb. | ||
+ | </ | ||
+ | <div id=" | ||
+ | Therefore '' | ||
+ | |||
+ | < | ||
+ | LELM(n, | ||
+ | = nb*⌈n/ | ||
+ | ≤ ⌊(n-1)/ | ||
+ | </ | ||
+ | |||
+ | ==== Block-Cyclic Matrix Distribution | ||
+ | |||
+ | The block-cyclic distribution of a two dimensioinal array on an two dimensional process grid is a direct product of block-cyclic distribution in each dimension. The relevant parameters for the matrix distribution are | ||
+ | |||
+ | < | ||
+ | m,n number of rows and columns of the matrix | ||
+ | mb, nb sizes of blocks in columns and in rows | ||
+ | nprow, npcol | ||
+ | rsrc, csrc row and column index of the process containing the first element of the matrix | ||
+ | </ | ||
+ | |||
+ | The folowing figures illustrate the block-cyclic matrix distribution with parameters | ||
+ | |||
+ | < | ||
+ | m=5,n=7, mb=nb=2, nprow=npcol=2, | ||
+ | </ | ||
+ | | ||
+ | global matrix: {{ : | ||
+ | |||
+ | \\ distributed matrix:{{ : | ||
+ | |||
+ | \\ ==== Mapping between Global and Local Indices | ||
+ | In order to transform a global mxn matrix A to a block-cyclically distributed data set, an element A(i,j) of the global matrix must be stored as an element al(il,jl) of the local matrix al in a process with task tupel (ip,jp). This mapping '' | ||
+ | |||
+ | < | ||
+ | im = i-1 | ||
+ | i1 = im modulo nb | ||
+ | i2 = im - nb*i1 | ||
+ | k = i1 modulo np | ||
+ | rp = i1 - np*k | ||
+ | ip = sp + rp modulo np | ||
+ | il = k*nb + i2 + 1 | ||
+ | </ | ||
+ | |||
+ | and the reverse mapping '' | ||
+ | |||
+ | < | ||
+ | rp = ip - sp + np modulo np | ||
+ | ilm = il-1 | ||
+ | k = ilm modulo nb | ||
+ | i2 = ilm - nb*k | ||
+ | i1 = k*np + rp | ||
+ | i = i1*nb + i2+1 | ||
+ | </ | ||
+ | |||
+ | These mappings are provided by ScaLAPACK tool functions | ||
+ | |||
+ | < | ||
+ | i = INDXL2G(il, | ||
+ | il = INDXG2L(i, | ||
+ | ip = INDXG2P(i, | ||
+ | </ | ||
+ | |||
+ | where '' | ||
+ | |||
+ | Supposing, that the values for the element A(i, | ||
+ | |||
+ | < | ||
+ | ! copy the values of the global matrix | ||
+ | ! to the local parts of the distributed matrix | ||
+ | llda = max(1, | ||
+ | do j = 1 , n | ||
+ | jl = INDXG2L(j, | ||
+ | ipc = INDXG2P(j, | ||
+ | do i = 1 , m | ||
+ | il = INDXG2L(i, | ||
+ | ipr = INDXG2P(i, | ||
+ | if (ipr.eq.myrow.and.ipc.eq.mycol) | ||
+ | : | ||
+ | end do | ||
+ | end do | ||
+ | </ | ||
+ | |||
+ | Alternatively, | ||
+ | < | ||
+ | ! initialize in parallel the local parts of the distributed matrix | ||
+ | mll = NUMROC( m, mb, myrow, 0, nprow ) | ||
+ | nll = NUMROC( n, nb, mycol, 0, npcol ) | ||
+ | do jl = 1 , nll | ||
+ | j = INDXL2G(jl, | ||
+ | do il = 1 , mll | ||
+ | i = INDXL2G(il, | ||
+ | al(il+mll*(jl-1))= matval(i,j) | ||
+ | end do | ||
+ | end do | ||
+ | </ | ||
+ | |||
+ | A complete example program for using these mapping routines for initializing the distributed matrix can be copied from [[http:// | ||
+ | <div id=" | ||
+ | |||
+ | ==== Array Descriptors in ScaLAPACK | ||
+ | In ScaLAPACK the information about the block-cyclic data distribution for an array is encoded in an integer vector with 9 elements, called descriptor. Let '' | ||
+ | |||
+ | < | ||
+ | desc_A(1) = 1 only possible value for the descriptor type | ||
+ | desc_A(2) = ctxt_A | ||
+ | desc_A(3) = m_A number of rows of global matrix A | ||
+ | desc_A(4) = n_A number of columns of global matrix A | ||
+ | desc_A(5) = mb_A | ||
+ | desc_A(6) = nb_A row block size for the distribution of A | ||
+ | desc_A(7) = rsrc_A | ||
+ | desc_A(8) = csrc_A | ||
+ | desc_A(9) = llda | ||
+ | </ | ||
+ | |||
+ | The first 8 elements of the descriptor describe global properties of the distributed matrix, they must be equal for all participating processes in the grid context ctxt_A. Only the 9th element describes a local property of the distribution. '' | ||
+ | |||
+ | All calls to ScaLAPACK and PBLAS routines involve the descriptors for the distributed matrices participating in operations performed by the subroutines. The descriptors have to be initialized accordingly before calling the routines. The initialization can be done explicitely by assigning the appropriate values to the descriptor elements or by calling the special initialization routine DESCINIT. The calling sequence for DESCINIT is: | ||
+ | |||
+ | < | ||
+ | call DESCINIT( desc_A, m_A, n_A, mb_A, nb_A, rsrc_A, csrc_A, ctxt_A, llda, info ) | ||
+ | </ | ||
+ | |||
+ | Attention must be payed to the order of parameters in the calling sequence, which does not follow completely the numbering order of these values in the descriptor array. | ||
+ | |||
+ | If the descriptor '' | ||
+ | |||
+ | ==== The Matrix Redistribution Routine PDGEMR2D | ||
+ | |||
+ | A further possibility to initialize the local parts of a distributed matrix is provided within ScaLAPACK by the matrix redistribution subroutine '' | ||
+ | |||
+ | The calling sequence for the general case is | ||
+ | |||
+ | < | ||
+ | call PDGEMR2D(m, n, al, i_A, j_A, desc_A, bl, i_B, j_B, desc_B, ctxt) | ||
+ | m, n | ||
+ | al local matrix of size llda * nla, llda = desc_A(9), | ||
+ | nla is the number of columns of the global submatrix of A stored in the process, | ||
+ | equal to NUMROC( n_A, nb_A, mycol, csrc_A , npcol ) | ||
+ | on input, al must contain the local elements of the submatrix of the global matrix A | ||
+ | i_A, j_A | ||
+ | desc_A | ||
+ | for processes not belonging to ctxt_A: | ||
+ | | ||
+ | bl local matrix of size lldb * nlb, lldb = desc_B(9), | ||
+ | nlb is the number of columns of the global submatrix of B stored in the process, | ||
+ | equal to NUMROC( n_B, nb_B, mycol, csrc_B , npcol ) | ||
+ | on output, bl will contain the local elements of the submatrix of the global matrix B | ||
+ | i_B, j_B | ||
+ | desc_B | ||
+ | for processes not belonging to ctxt_B: | ||
+ | | ||
+ | ctxt valid context handel pointing to a process grid, | ||
+ | which contains all processes used in ctxt_A and ctxt_B | ||
+ | </ | ||
+ | |||
+ | |||
+ | For generating a block-cyclically distributed copy of a global matrix A, the global matrix is generated as an ordinary two-dimensional local array '' | ||
+ | |||
+ | < | ||
+ | call BLACS_PINFO(myid, | ||
+ | call BLACS_GET( 0, 0, ctxt_sys ) | ||
+ | ! Set up a process grid of size 3*2 with context ctxt | ||
+ | ctxt = ctxt_sys | ||
+ | call BLACS_GRIDINIT( ctxt, ' | ||
+ | ! Set up a process grid of size 1 with context ctxt_0 | ||
+ | ctxt_0 = ctxt_sys | ||
+ | call BLACS_GRIDINIT( ctxt_0, ' | ||
+ | ! Get the process coordinates in the grid context ctxt | ||
+ | call BLACS_GRIDINFO( ctxt, nprow, npcol, myrow, mycol ) | ||
+ | ! initialize the descriptor for the distributed mxn matrix: | ||
+ | llda = max(1, | ||
+ | call DESCINIT( desca, m, n, mb, nb, 0, 0, ctxt, llda, info ) | ||
+ | ! initialize the descriptor desca_0 for the global mxn matrix: | ||
+ | desca_0(2) = -1 | ||
+ | if (myid.eq.0) | ||
+ | : | ||
+ | ! on proc 0 initialize the global matrix a_0 | ||
+ | if (myid.eq.0) then | ||
+ | | ||
+ | do i = 1 , m | ||
+ | a_0(i + m * (j - 1)) = matval(i,j) | ||
+ | end do | ||
+ | end do | ||
+ | end if | ||
+ | ! copy a_0 on process 0 block-cyclically to the local arrays al in the process grid | ||
+ | call PDGEMR2D(m, n, a_0, 1, 1, desca_0, al, 1, 1, desca, ctxt) | ||
+ | </ | ||
+ | |||
+ | A complete example program for using '' | ||
+ | |||
+ | ===== Example Programs with PBLAS Routines | ||
+ | |||
+ | PBLAS provides basic linear algebra operations on vectors and matrices for parallel execution in a message passing programming environment. | ||
+ | A matrix operand in PBLAS is a submatrix sub(A) with m rows and n columns of a global matrix A with m_A rows and n_A columns, which starts at the element of A with rowindex i_A and column_index j_A. | ||
+ | |||
+ | < | ||
+ | sub(A)(1: | ||
+ | </ | ||
+ | <div id=" | ||
+ | For a valid submatrix the following restrictions have to be met: | ||
+ | |||
+ | < | ||
+ | 1< | ||
+ | </ | ||
+ | |||
+ | The local array al of a matrix operand in a call to a PBLAS routine is assumed to contain the elements of the global matrix A, which are mapped to the calling process according to the parameters in the descriptor '' | ||
+ | |||
+ | A vector operand in PBLAS is a special submatrix of a global matrix X with only one column or one row: | ||
+ | |||
+ | < | ||
+ | subcol(X)(1: | ||
+ | subrow(X)(1: | ||
+ | </ | ||
+ | |||
+ | ==== Parallel Matrix Vector Multiplication with PDGEMV | ||
+ | |||
+ | |||
+ | The parallel matrix vector multiplication provided by the PBLAS routine '' | ||
+ | |||
+ | < | ||
+ | subvec(Y) = alpha sub(A) subvec(X) + beta subvec(Y) | ||
+ | </ | ||
+ | |||
+ | where '' | ||
+ | |||
+ | For '' | ||
+ | |||
+ | < | ||
+ | sub(A)(1: | ||
+ | if (incx=1) | ||
+ | if (incx=m_X) subrow(X)(1: | ||
+ | if (incy=1) | ||
+ | if (incy=m_Y) subrow(Y)(1: | ||
+ | </ | ||
+ | |||
+ | For '' | ||
+ | |||
+ | < | ||
+ | sub(A)(1: | ||
+ | if (incx=1) | ||
+ | if (incx=m_X) subrow(X)(1: | ||
+ | if (incy=1) | ||
+ | if (incy=m_Y) subrow(Y)(1: | ||
+ | </ | ||
+ | |||
+ | The calling sequence for '' | ||
+ | |||
+ | < | ||
+ | call PDGEMV(trans, | ||
+ | : alpha, al, i_A, j_A, desc_A, | ||
+ | : xl, i_X, j_X, desc_X, incx | ||
+ | : beta, yl, i_Y, j_Y, desc_Y, incy ) | ||
+ | </ | ||
+ | |||
+ | The arguments have the following meaning: | ||
+ | |||
+ | < | ||
+ | trans: | ||
+ | if trans = ' | ||
+ | if trans = ' | ||
+ | m : | ||
+ | if trans = ' | ||
+ | if trans = ' | ||
+ | n : | ||
+ | if trans = ' | ||
+ | if trans = ' | ||
+ | alpha: | ||
+ | | ||
+ | al: | ||
+ | local array containing the local values of A | ||
+ | i_A, j_A: | ||
+ | row and column number in A, where the submatrix sub(A) starts | ||
+ | desc_A: | ||
+ | | ||
+ | xl: | ||
+ | local array containing the local values of X | ||
+ | i_X, j_X: | ||
+ | row and column number in X where the subvector subvec(X) starts | ||
+ | desc_X: | ||
+ | | ||
+ | incx: | ||
+ | if incx=1: | ||
+ | if incx=m_X: | ||
+ | beta: | ||
+ | | ||
+ | yl: | ||
+ | local array containing the local values of Y | ||
+ | i_Y, j_Y: | ||
+ | row and column number in Y where the subvector subvec(Y) starts | ||
+ | desc_Y: | ||
+ | | ||
+ | incy: | ||
+ | if incy=1: | ||
+ | if incy=m_Y: | ||
+ | </ | ||
+ | |||
+ | An example code for executing a matrix vector multiplication using '' | ||
+ | |||
+ | In the simplest case of the multiplication of an m x n matrix by an n vector yielding an m vector, the global matrix A has size m x n, the global matrix X has size n x 1, the global matrix Y has size m x 1, the respective | ||
+ | |||
+ | < | ||
+ | m_A = m, n_A = n, mb_A = mb, nb_A = nb, i_A = 1, j_A = 1 | ||
+ | m_X = n, n_X = 1, mb_X = nb, nb_X = 1 , i_X = 1, j_X = 1 | ||
+ | m_Y = m, n_Y = 1, mb_Y = mb, nb_Y = 1 , i_Y = 1, j_Y = 1 | ||
+ | incx = 1, incy = 1, rsrc = 0, csrc = 0 | ||
+ | </ | ||
+ | |||
+ | On a column major process grid of size nprow x npcol the matrix A will be distributed block-cyclically over all processes of the grid, whereas X and Y will be distributed over the processes of the first column in the grid with blocksizes nb resp. mb. The relevant parts of a code for this case are | ||
+ | |||
+ | < | ||
+ | m = 69; n = 77; mb=12; nb = 13 | ||
+ | ! Parameters for processor grid | ||
+ | nprow=2; npcol=3 | ||
+ | ! set data for A and X; A(i,j) = mati*i + matj*j, X(i,1) = veci*i+vecj | ||
+ | mati = 1; matj = 1; veci=1; vecj = 1 | ||
+ | ! Set up a process grid of size nprow*npcol | ||
+ | ctxt = ctxt_sys; CALL BLACS_GRIDINIT( ctxt, ' | ||
+ | ! Get the process coordinates in the grid | ||
+ | CALL BLACS_GRIDINFO( ctxt, nprow, npcol, myrow, mycol ) | ||
+ | ! number of rows and columns of local parts al, xl, yl for A, X, Y | ||
+ | m_al = NUMROC( m, mb, myrow, 0, nprow ) | ||
+ | n_al = NUMROC( n, nb, mycol, 0, npcol ) | ||
+ | m_xl = NUMROC( n, nb, myrow, 0, nprow ); n_xl = 1 | ||
+ | m_yl = m_al; n_yl = 1 | ||
+ | ! initializing descriptors for the distributed matrices A, X, Y: | ||
+ | llda = max(1, | ||
+ | call DESCINIT( desc_A, m, n, mb, nb, 0, 0, ctxt, llda, info ) | ||
+ | call DESCINIT( desc_X, n, 1, nb, 1, 0, 0, ctxt, lldx, info ) | ||
+ | call DESCINIT( desc_Y, m, 1, mb, 1, 0, 0, ctxt, lldy, info ) | ||
+ | ! initialize in parallel the local parts al of A | ||
+ | do jl = 1 , n_al | ||
+ | call ind_lc_gl(nb, | ||
+ | do il = 1 , m_al | ||
+ | call ind_lc_gl(mb, | ||
+ | al(il+m_al*(jl-1))= mati*i+matj*j | ||
+ | end do | ||
+ | end do | ||
+ | ! initialize in parallel the local parts xl of X | ||
+ | do il = 1 , m_xl | ||
+ | call ind_lc_gl(nb, | ||
+ | xl(il)= veci*i+vecj | ||
+ | end do | ||
+ | ! calculate matrix vector product | ||
+ | call PDGEMV( ' | ||
+ | : | ||
+ | : | ||
+ | : | ||
+ | </ | ||
+ | |||
+ | The complete program for the use of '' | ||
+ | |||
+ | |||
+ | ===== Example Programs with ScaLAPACK Driver Routines | ||
+ | |||
+ | ==== Parallel Equation Solution with PDGESV | ||
+ | |||
+ | The ScaLAPACK driver routine '' | ||
+ | |||
+ | < | ||
+ | a x = b | ||
+ | </ | ||
+ | |||
+ | where '' | ||
+ | |||
+ | Similar to the PBLAS routines, '' | ||
+ | |||
+ | < | ||
+ | sub(A)(1: | ||
+ | </ | ||
+ | |||
+ | and several (nrhs) right hand side vectors are given as a submatrix sub(B) of B | ||
+ | |||
+ | < | ||
+ | sub(B)(1: | ||
+ | </ | ||
+ | |||
+ | The nrhs solutions of these equations are stored in sub(B), overwriting the input of the right hand sides. | ||
+ | |||
+ | A pivot vector indicating the exchange of rows in the solution process is stored as a subvector of size n of a global block-cyclically distributed vector V of size m_A and block siza mb_A: | ||
+ | |||
+ | < | ||
+ | subvec(V)(1: | ||
+ | </ | ||
+ | |||
+ | The calling sequence for '' | ||
+ | |||
+ | < | ||
+ | call PDGESV (n, nrhs, | ||
+ | : | ||
+ | : | ||
+ | </ | ||
+ | |||
+ | The arguments have the following meaning: | ||
+ | |||
+ | < | ||
+ | n: | ||
+ | | ||
+ | and number of rows of the global matrix of right hand sides sub(B) | ||
+ | nrhs: | ||
+ | | ||
+ | al: | ||
+ | local array containing the local values of A | ||
+ | i_A, j_A: | ||
+ | row and column number in A, where the submatrix sub(A) starts | ||
+ | desc_A: | ||
+ | | ||
+ | ipvtl: | ||
+ | local integer array of size llda+mb_a. | ||
+ | This array contains the pivoting information. | ||
+ | If i is the index of the global row corresponding to the local row index il | ||
+ | then ipvtl(il) = index of the global row which was swapped with global row i. | ||
+ | bl: | ||
+ | local array containing the local values of B, | ||
+ | on entry the values of bl corresponding to sub(B) contain | ||
+ | the local parts of the right hand sides, | ||
+ | on return they contain the local parts of the solution vectors | ||
+ | i_B, j_B: | ||
+ | row and column number in B where the submatrix submat(B) starts | ||
+ | desc_B: | ||
+ | | ||
+ | info: | ||
+ | on return, if info=0 the coeeficient matrix sub(A) is nonsingular | ||
+ | if info > 0 , the coefficient matrix sub(A) is singular | ||
+ | and the solutions are not determined. | ||
+ | </ | ||
+ | |||
+ | As in the case of the PBLAS routines '' | ||
+ | |||
+ | < | ||
+ | Blocksizes: | ||
+ | mb_A = nb_A | ||
+ | mb_B = nb_A | ||
+ | Positions of submatrices: | ||
+ | i_A-1 modulo mb_A = 0 | ||
+ | j_A-1 modulo nb_A = 0 | ||
+ | i_B-1 modulo mb_B = 0 | ||
+ | the process row containing the row i_A of A | ||
+ | must also contain the row i_B of B | ||
+ | </ | ||
+ | |||
+ | If one of these conditions is violated, the program will abort with an error message. | ||
+ | |||
+ | An example code for solving a system of equations using '' | ||
+ | |||
+ | In the simplest case the global matrix A of size nxn is the coefficient matrix and the global matrix B of size nx1 contains one right hand side on entry and the solution on return, and the root process for A and B has grid coordinates (0,0). The parameters for the sizes and distributions of global matrices and sizes and positions of submatrices in this case are: | ||
+ | |||
+ | < | ||
+ | m_A = n, n_A = n, mb_A = nb, nb_A = nb, i_A = 1, j_A = 1 | ||
+ | m_B = n, n_B = 1, mb_B = nb, nb_B = 1 , i_B = 1, j_B = 1 | ||
+ | rsrc_A = 0, csrc_A = 0, rsrc_B = 0, csrc_B = 0 | ||
+ | </ | ||
+ | |||
+ | On a column major process grid of size nprow x npcol the matrix A will be distributed block-cyclically over all processes of the grid, whereas B will be distributed over the processes of the first column in the grid with blocksizes nb. The relevant parts of a code for this case are | ||
+ | |||
+ | < | ||
+ | ! Parameters for the distributed global matrices | ||
+ | n = 9; nb = 2 | ||
+ | ! Parameters for processor grid | ||
+ | nprow=2; npcol=3 | ||
+ | ! Set up a process grid of size nprow*npcol | ||
+ | call BLACS_GET( 0, 0, ctxt ); call BLACS_GRIDINIT( ctxt, ' | ||
+ | ! Get the process coordinates in the grid | ||
+ | call BLACS_GRIDINFO( ctxt, nprow, npcol, myrow, mycol ) | ||
+ | ! number of rows and columns of loacal parts al, bl for A, B | ||
+ | m_al = NUMROC( n, nb, myrow, 0, nprow ) | ||
+ | n_al = NUMROC( n, nb, mycol, 0, npcol ) | ||
+ | m_bl = NUMROC( n, nb, myrow, 0, nprow ) | ||
+ | n_bl = NUMROC( 1, 1, mycol,0, npcol ) | ||
+ | ! initializing descriptors for the distributed matrices A, B: | ||
+ | llda = max(1, | ||
+ | call DESCINIT( desc_A, n, n, nb, nb, 0, 0, ctxt, llda, info ) | ||
+ | call DESCINIT( desc_B, n, 1, nb, 1, 0, 0, ctxt, lldb, info ) | ||
+ | |||
+ | ! initialize in parallel the local parts of A in al and in al_s | ||
+ | do jl = 1 , n_al | ||
+ | j = INDXL2G(jl, | ||
+ | do il = 1 , m_al | ||
+ | i = INDXL2G(il, | ||
+ | al(il+m_al*(jl-1))= 1.d0/ | ||
+ | end do | ||
+ | end do | ||
+ | ! initialize in parallel the local parts of B in bl | ||
+ | do il = 1 , m_bl | ||
+ | i = INDXL2G(il, | ||
+ | | ||
+ | end do | ||
+ | |||
+ | ! solve the systems of equations | ||
+ | call PDGESV( n, 1, al, 1, 1, desc_A, vl, bl, 1, 1, desc_B, info) | ||
+ | </ | ||
+ | |||
+ | The complete program for the use of '' | ||
+ | |||
+ | ==== Parallel Eigenproblem Solution with PDSYEV | ||
+ | |||
+ | The ScaLAPACK driver routine '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | The symmetric matrix a must be provided as a nxn submatrix sub(A) of a global block-cyclically distributed matrix A | ||
+ | |||
+ | < | ||
+ | sub(A)(1: | ||
+ | </ | ||
+ | |||
+ | the ortonormal eigenvectors | ||
+ | |||
+ | < | ||
+ | sub(Z)(1: | ||
+ | </ | ||
+ | |||
+ | and all n eigenvalues λ< | ||
+ | |||
+ | The complete calling sequence for '' | ||
+ | |||
+ | < | ||
+ | call PDSYEV (jobz, uplow, n, al, i_A, j_A, desc_A, w, | ||
+ | : | ||
+ | : | ||
+ | </ | ||
+ | |||
+ | where the meaning of the arguments is | ||
+ | |||
+ | < | ||
+ | jobz: | ||
+ | if jobz = ' | ||
+ | if jobz = ' | ||
+ | uplo: | ||
+ | if uplo = ' | ||
+ | if uplo = ' | ||
+ | n: | ||
+ | order of sub(A) used in the computation | ||
+ | al: | ||
+ | local array containing the local values of A, | ||
+ | on return, the values of al corresponding to the upper or lower triangular part of sub(A) | ||
+ | are overwritten, | ||
+ | i_A, j_A: | ||
+ | row and column number in A, where the submatrix sub(A) starts | ||
+ | desc_A: | ||
+ | | ||
+ | w: | ||
+ | | ||
+ | zl: | ||
+ | local array for the local values of Z, | ||
+ | | ||
+ | i_Z, j_Z: | ||
+ | row and column number in Z where the submatrix submat(Z) starts | ||
+ | desc_Z: | ||
+ | | ||
+ | work: | ||
+ | local workspace array of size ≥ max(1, | ||
+ | if lwork = -1, work(1) returns the size of the needed workspace | ||
+ | lwork: | ||
+ | size of workspace reserved locally. If lwork >0 on input is less than this size, | ||
+ | the program aborts. | ||
+ | If on input globally lwork=-1, PDSYEV returns the needed workspace in work(1) | ||
+ | without solving the eigenproblem | ||
+ | info: | ||
+ | on return, | ||
+ | if info = 0, the execution is successful, | ||
+ | if info < 0: | ||
+ | if the i-th argument is an array and the j-entry had an illegal value, | ||
+ | then info = -(i*100+j), | ||
+ | if the i-th argument is a scalar and had an illegal value, | ||
+ | then info = -i, | ||
+ | if info= n+1, then PDSYEV has detected heterogeneity by finding | ||
+ | that eigenvalues were not identical across the process grid; | ||
+ | in this case, the accuracy of the results from PDSYEV cannot be guaranteed | ||
+ | </ | ||
+ | |||
+ | The [[http:// | ||
+ | The reason for possible inconsitencies is, that some operations in '' | ||
+ | |||
+ | |||
+ | As in the case of the PBLAS routines '' | ||
+ | |||
+ | < | ||
+ | Global matrices: | ||
+ | m_A = m_Z | ||
+ | n_A = n_Z | ||
+ | mb_A = nb_A = mb_Z = nb_Z = nb | ||
+ | | ||
+ | | ||
+ | Positions of submatrices: | ||
+ | i_A-1 modulo nb = 0 | ||
+ | j_A-1 modulo nb = 0 | ||
+ | i_Z-1 modulo nb = 0 | ||
+ | the process row containing the row i_A of A | ||
+ | must also contain the row i_Z of Z | ||
+ | </ | ||
+ | |||
+ | If one of these conditions is violated, the program will abort with an error message. | ||
+ | |||
+ | Furthermore, | ||
+ | |||
+ | Each of the local processes has to provide memory space for the local parts of the global matrices A and Z and for all n eigenvalues. In addition memory space for the local workarray '' | ||
+ | |||
+ | If '' | ||
+ | |||
+ | < | ||
+ | size_w = n*5 + lel + 1 | ||
+ | lel = workspace requirement for ScaLAPACK routine PDSYTRD | ||
+ | ≤ nb * max(LELM(n, | ||
+ | </ | ||
+ | |||
+ | where [[# | ||
+ | For blocksizes larger than 2 therefore the following value for '' | ||
+ | |||
+ | < | ||
+ | lwork = 5n + nb( ⌊(n-1)/ | ||
+ | </ | ||
+ | |||
+ | |||
+ | If '' | ||
+ | |||
+ | < | ||
+ | size_w = n*5 + max(2*(n-2), | ||
+ | lel1 = workspace requirement for ScaLAPACK routine PDORMTR when it's SIDE argument is ' | ||
+ | ≤ nb*(LELM(n+1, | ||
+ | lel2 = number of rows of a block-cyclically (blocksize nb) | ||
+ | | ||
+ | | ||
+ | ≤ LELM(n, | ||
+ | </ | ||
+ | |||
+ | A sufficient large value for '' | ||
+ | |||
+ | < | ||
+ | lwork = 5n + max(2n, | ||
+ | </ | ||
+ | |||
+ | An example code for the determination of eigenvalues and eigenvectors using '' | ||
+ | |||
+ | A demo code for simple values for all parameters is [[http:// | ||
+ | < | ||
+ | ! input parameters | ||
+ | n = 1000; nb = 24 | ||
+ | ! Parameters for processor grid | ||
+ | nprow=2; npcol=2 | ||
+ | ! Set up a process grid of size nprow*npcol | ||
+ | call BLACS_GET( 0, 0, ctxt ); call BLACS_GRIDINIT( ctxt, ' | ||
+ | ! Get the process coordinates in the grid | ||
+ | call BLACS_GRIDINFO( ctxt, nprow, npcol, myrow, mycol ) | ||
+ | ! number of rows and columns of local parts al, zl for A, Z | ||
+ | m_al = NUMROC( n, nb, myrow, 0, nprow ) | ||
+ | n_al = NUMROC( n, nb, mycol, 0, npcol ) | ||
+ | ! size of workarray | ||
+ | lel1 = nb**2*(n/ | ||
+ | lel2 = nb*((n-1)/ | ||
+ | lwork = n*5 + max(2*n, | ||
+ | ! initializing descriptors for the distributed matrices: | ||
+ | llda = max(1,m_al) | ||
+ | call DESCINIT( desc_A, n, n, nb, nb, | ||
+ | : | ||
+ | ! initialize in parallel the local parts of A in al | ||
+ | do jl = 1 , n_al | ||
+ | j = INDXL2G(jl, | ||
+ | do il = 1 , m_al | ||
+ | i = INDXL2G(il, | ||
+ | al(il+m_al*(jl-1))= 1./ | ||
+ | end do | ||
+ | end do | ||
+ | ! calculate eigenvalues and eigenvectors | ||
+ | call PDSYEV( ' | ||
+ | : | ||
+ | : | ||
+ | </ | ||
+ | |||
+ | [[Kategorie: | ||
wiki/hpc/scalapack.txt · Last modified: 2021/02/20 12:54 by 127.0.0.1