wiki:hpc:using_gpus
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | wiki:hpc:using_gpus [2019/02/08 16:02] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ======= Using GPUs from GWDG's Compute Cluster ======= | ||
+ | |||
+ | ===== GPGPU - General Purpose Computing on Graphics Processor Units ===== | ||
+ | |||
+ | GPGPU generalizes the ability of graphics processing units - to operate in parallel on a large number of pixels of a picture - | ||
+ | to perform in parallel numerical operations on a large number of elements of a general array of data. | ||
+ | |||
+ | The GPU has a large number of simple processing units which work on data from the GPU's own main memory. | ||
+ | The GPU is attached as a coprocessor via the PCI-bus to a multicore host processor, as shown in the following picture. | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | From the application running on the host suitable parts are split off and transfered for processing to the GPU. There are special | ||
+ | programming environments for specifying the host and coprocessor parts of an application. In particular, NVIDIA provides the | ||
+ | CUDA programming environment for GPGPU with their graphics coprocessors. | ||
+ | |||
+ | |||
+ | Compute systems for HPC consist of a large number of nodes connected by a high speed network, each node containing a number of | ||
+ | multicore cpus and a number of attached GPUs. | ||
+ | |||
+ | The following notes will describe and explain the different ways to use | ||
+ | single and mulitiple GPUs on GWDG's scientific compute cluster. | ||
+ | ===== GPUs on GWDG's Scientific Compute Cluster ===== | ||
+ | |||
+ | In July 2017, the following nodes in the cluster are equiped with NVIDIA GPUs: | ||
+ | |||
+ | * gwdo161-gwdo180, | ||
+ | * dge001-dge007, | ||
+ | * dge008-dge014, | ||
+ | * dge015, | ||
+ | * dte001-dte010, | ||
+ | |||
+ | The GeForce GPUs are ordinary graphics cards with focus on single precision operations, whereas the Tesla GPU has | ||
+ | a larger main memory and a larger number of cores for double precision operations as needed for numerical intensive applications. | ||
+ | |||
+ | This is detailed in the following table with properties of the different NVIDIA models. | ||
+ | |||
+ | |||
+ | ^ Model ^ Architecture ^ Compute\\ | ||
+ | ^ ::: ^ ::: ^ ::: ^ ::: ^ ::: ^ ::: ^ ::: ^ | ||
+ | | GeForce GTX 770| Keppler | 3.0 | 1110 | 2 | 1536 | 0 | | ||
+ | | GeForce GTX 980| Maxwell | 5.2 | 1126 | 4 | 2048 | 64 | | ||
+ | | GeForce GTX 1080| Pascal | 6.1 | 1733 | 8 | 2560 | 80 | | ||
+ | |Tesla K40m | Keppler | 3.5 | 745 | 12 | 2280 | 960 | | ||
+ | |||
+ | |||
+ | ===== A simple CUDA Example ===== | ||
+ | |||
+ | In the following, a simple application - adding two vectors - will be given as an example showing the basic mechanism for offloding | ||
+ | operations form host to graphics device within the CUDA programming environment. | ||
+ | |||
+ | CUDA is based on a standardized programming language and adds new language constructs to specify the operations to be executed on the GPU, | ||
+ | to move data between the memories of host and GPU and to start and synchronize the operations on the GPU. There are CUDA environments | ||
+ | for the C, C++ and Fortran languages. In the example the C language will be used. | ||
+ | |||
+ | CUDA programs are stored in files with the suffix **'' | ||
+ | of functions to be executed on the host and of functions to be executed on the GPU device. | ||
+ | |||
+ | Adding two vectors on a GPU is realized by the following program file **'' | ||
+ | |||
+ | < | ||
+ | #include < | ||
+ | |||
+ | __global__ void add_d( int N, float *a, float *b, float *c ) { | ||
+ | int i = threadIdx.x + blockIdx.x*blockDim.x; | ||
+ | if (i < N) c[i] = a[i] + b[i]; | ||
+ | } | ||
+ | |||
+ | int main(void) { | ||
+ | int N = 4, i; | ||
+ | float *a, *b, *c; | ||
+ | float *a_d, *b_d, *c_d; | ||
+ | |||
+ | a = (float *)malloc( sizeof(float)*N ); | ||
+ | b = (float *)malloc( sizeof(float)*N ); | ||
+ | c = (float *)malloc( sizeof(float)*N ); | ||
+ | |||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | for (i=0; i<N; i++) { | ||
+ | a[i] = i; | ||
+ | b[i] = 2*i; | ||
+ | } | ||
+ | |||
+ | | ||
+ | | ||
+ | |||
+ | | ||
+ | |||
+ | | ||
+ | |||
+ | | ||
+ | printf( "%f + %f = %f\n", a[i], b[i], c[i] ); | ||
+ | } | ||
+ | |||
+ | | ||
+ | } | ||
+ | |||
+ | </ | ||
+ | | ||
+ | In the main program, to be executed on the host, memory is allocated for two sets of three vectors: | ||
+ | * host memory for **'' | ||
+ | * device memory for **'' | ||
+ | |||
+ | After initializing the two input vectors **'' | ||
+ | CUDA function **'' | ||
+ | |||
+ | The offloading of the add operation onto the device is effected by the call **'' | ||
+ | This call instructs the device to execute the function | ||
+ | triple bracket **'' | ||
+ | |||
+ | The function **'' | ||
+ | executed on the device. Within a device function predefined thread local variables ( **'' | ||
+ | which give every thread a unique identity allowing every thread to work on different data. | ||
+ | |||
+ | The result of the vector addition is stored on the device memory in **'' | ||
+ | further call to a **'' | ||
+ | |||
+ | |||
+ | For a detailed explanation of CUDAs language constructs and the mechanism of mapping CUDA threads onto the cores in the given GPU | ||
+ | consult the [[http:// | ||
+ | can be found in the presentations of the GWDG course " | ||
+ | |||
+ | ===== Compiling and Executing CUDA Programs on GWDG's Compute Clusters ===== | ||
+ | |||
+ | The frontend nodes of GWGD's cluster **'' | ||
+ | produces executables from CUDA program files. The environment for CUDA must be prepared by loading the corresponding module file with | ||
+ | the command\\ \\ **'' | ||
+ | Invoking the compile and link steps with\\ \\ | ||
+ | **'' | ||
+ | produces the executable **'' | ||
+ | using GPUs with Compute Capability 3.0 and higher will be produced, so this executable will run on all GPU models in GWDG's cluster. | ||
+ | |||
+ | Jobs on GWDG's cluster are managed by IBM Spectrum LSF (Load Sharing Facility), formerly IBM Platform LSF. Jobs are submitted to various | ||
+ | queues by the **'' | ||
+ | nodes with one or more GPUs. | ||
+ | |||
+ | The executable | ||
+ | |||
+ | **'' | ||
+ | |||
+ | The option **'' | ||
+ | the host part of the CUDA program. | ||
+ | |||
+ | With the option **'' | ||
+ | queue **'' | ||
+ | ngpus_shared parameters of runnig jobs on a node to this maximal number. | ||
+ | |||
+ | A job utilizing permanently all gpus of a node | ||
+ | should be submitted with the maximal value for npgus_shared. This will grant this job the exclusive use of all the node's gpus. | ||
+ | Jobs using the gpu only for a small fraction of its total execution time should use a small value for npgus_shared, | ||
+ | thus allowing other jobs sharing the gpu ressources of this node. In particular 24 jobs can run simultaneously on a node with 24 cores, | ||
+ | if they all have been submitted with parameters **'' | ||
+ | |||
+ | With the option **'' | ||
+ | from which the job was submitted, where **''< | ||
+ | the output of the job will be sent by email to the submitter' | ||
+ | |||
+ | The options for the **'' | ||
+ | |||
+ | < | ||
+ | #!/bin/sh | ||
+ | |||
+ | #BSUB -q gpu | ||
+ | #BSUB -n 1 | ||
+ | #BSUB -R " | ||
+ | #BSUB -o out.%J | ||
+ | |||
+ | ./add.exe | ||
+ | </ | ||
+ | |||
+ | which can be submitted with the command\\ \\ | ||
+ | **'' | ||
+ | |||
+ | More options for the bsub command and the description for other LSF commands can be found at \\ | ||
+ | https:// | ||
+ | and in the man pages for the LSF commands. | ||
+ | |||
+ | ===== Splitting the Program File ===== | ||
+ | |||
+ | In order to demonstrate the different ways for using multiple gpus in a unified way, it is convenient to separate the program into two files:\\ | ||
+ | a file **'' | ||
+ | containing the code for executing the addition on the gpu device. | ||
+ | |||
+ | **'' | ||
+ | < | ||
+ | # | ||
+ | # | ||
+ | void add(int, int, float *, float *, float *); | ||
+ | |||
+ | int main(int argc,char **argv) | ||
+ | { | ||
+ | int dev_nbr = 0, N = 6, i; | ||
+ | float *a, *b, *c; | ||
+ | a = (float *)malloc( N*sizeof(float) ); | ||
+ | b = (float *)malloc( N*sizeof(float) ); | ||
+ | c = (float *)malloc( N*sizeof(float) ); | ||
+ | |||
+ | // Initialize a and b | ||
+ | for (i=0; i<N; i++){ | ||
+ | | ||
+ | } | ||
+ | | ||
+ | //call function add from file add.cu | ||
+ | add(dev_nbr, | ||
+ | |||
+ | for(i=0; i<N; i++) { | ||
+ | printf( "%f + %f = %f\n", a[i], b[i], c[i] ); | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | \\ \\ | ||
+ | **'' | ||
+ | < | ||
+ | __global__ void add_d( int N, float *a, float *b, float *c ) { | ||
+ | int i = threadIdx.x + blockIdx.x*blockDim.x; | ||
+ | if (i < N) c[i] = a[i] + b[i]; | ||
+ | } | ||
+ | |||
+ | extern " | ||
+ | { float *a_d, *b_d, *c_d; | ||
+ | |||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | | ||
+ | | ||
+ | |||
+ | | ||
+ | |||
+ | | ||
+ | |||
+ | | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | In addition to the CUDA commands in the **'' | ||
+ | This call sets the number of the device to be used for the execution of the following CUDA code. In the **'' | ||
+ | for **'' | ||
+ | with different values for **'' | ||
+ | Furthermore the | ||
+ | **'' | ||
+ | to make this function callable from the c program in **'' | ||
+ | Each of the two files now has to be compiled by the appropriate compiler: | ||
+ | < | ||
+ | gcc -c main_add.c | ||
+ | nvcc -c add.cu | ||
+ | </ | ||
+ | For linking of the two generated object files with the **'' | ||
+ | < | ||
+ | gcc main_add.o add.o -lcudart -o add.exe | ||
+ | </ | ||
+ | The submission of the executable **'' | ||
+ | ===== Using Several GPUs of a Single Node - Multiple Executables ===== | ||
+ | |||
+ | Each of the n gpus attached to a single node has a unique device number, running from zero to n-1. By default, a CUDA programm running | ||
+ | on the node will use device number 0. A specific device **''// | ||
+ | **'' | ||
+ | the device **''// | ||
+ | |||
+ | The simplest way to use the n gpus of a single node in parallel is therefore to prepare n different executables, | ||
+ | to **'' | ||
+ | Let e.g. **'' | ||
+ | the other the call **'' | ||
+ | |||
+ | < | ||
+ | #!/bin/sh | ||
+ | |||
+ | #BSUB -q gpu | ||
+ | #BSUB -W 1:00 | ||
+ | #BSUB -o out.%J | ||
+ | #BSUB -n 2 | ||
+ | #BSUB -R " | ||
+ | #BSUB -R " | ||
+ | |||
+ | ./exe0 > out0 & | ||
+ | ./exe1 > out1 & | ||
+ | </ | ||
+ | |||
+ | This script requires 2 cores of a node belonging to the queue **'' | ||
+ | will be submitted to a node with 2 gpus. The request for 24 gpu shares guaranties the job exclusive use of this node (because in GWDG's | ||
+ | cluster all nodes with 2 gpus have 24 cores). The **''&'' | ||
+ | these commands, such that the 2 executables will run simultaneously. | ||
+ | |||
+ | ===== Using Several GPUs of a Single Node - Multiple Threads ===== | ||
+ | |||
+ | In a multithreaded execution environment, | ||
+ | **'' | ||
+ | can be used to distribute work for simultaneous execution on several gpus of a single node. In the following example, the **OpenMP** | ||
+ | programming environment will be employed for the management of multiple threads. | ||
+ | |||
+ | Again the program part for setting up the host environment will be collected into a separate file, **'' | ||
+ | case has to start multiple threads, each of which will call the function add from the file **'' | ||
+ | The number of **'' | ||
+ | **'' | ||
+ | included in the cu-file **'' | ||
+ | < | ||
+ | extern " | ||
+ | int num_gpus; cudaGetDeviceCount(& | ||
+ | return num_gpus; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | The main program in **'' | ||
+ | by the OpenMP function call | ||
+ | **'' | ||
+ | and this number of threads is started by the OpenMP compiler directive | ||
+ | **''# | ||
+ | With the **'' | ||
+ | use the same address accessing these variables. On the other hand all variables declared within the parallel region | ||
+ | following the **'' | ||
+ | Within the parallel region the call to the OpenMP function **'' | ||
+ | the local variable **'' | ||
+ | device nummer is set to the local value of **'' | ||
+ | The following code shows how every thread defines its own range of indices, for which the vector addition will be performed. | ||
+ | |||
+ | **'' | ||
+ | < | ||
+ | #include < | ||
+ | #include < | ||
+ | #include < | ||
+ | extern void add(int, int, float *, float *, float *); | ||
+ | extern int devcount(); | ||
+ | |||
+ | int main(void) { | ||
+ | int N = 1000, i; | ||
+ | float *a, *b, *c; | ||
+ | |||
+ | a = (float *)malloc( sizeof(float)*N ); | ||
+ | b = (float *)malloc( sizeof(float)*N ); | ||
+ | c = (float *)malloc( sizeof(float)*N ); | ||
+ | |||
+ | for (i=0; i<N; i++) { | ||
+ | a[i] = i; | ||
+ | b[i] = 2*i; | ||
+ | } | ||
+ | |||
+ | // num_gpus: number of gpus on the node | ||
+ | int num_gpus = devcount(); | ||
+ | int nrct= N/num_gpus; int nrpl = N -nrct*num_gpus; | ||
+ | //activate num_gpus threads | ||
+ | | ||
+ | # | ||
+ | { int tid = omp_get_thread_num(); | ||
+ | // n_loc: number of elements for this thread | ||
+ | int n_loc = nrct; if (tid < nrpl) n_loc = nrct+1; | ||
+ | // offs: offset in global vector | ||
+ | int offs = tid*(nrct+1); | ||
+ | // gpu with device number tid adds elements with indices offs to offs+n_loc-1 | ||
+ | | ||
+ | | ||
+ | } | ||
+ | | ||
+ | printf( "%f + %f = %f\n", a[i], b[i], c[i] ); | ||
+ | } | ||
+ | | ||
+ | printf( "%f + %f = %f\n", a[i], b[i], c[i] ); | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | The compilation and link commands for this program are | ||
+ | < | ||
+ | gcc -fopenmp -c main_add.c | ||
+ | nvcc -c add.cu | ||
+ | gcc -fopenmp main_add.o add.o -lcudart -o add.exe | ||
+ | </ | ||
+ | |||
+ | A jobfile for submitting this executable to a node with 2 gpus and requesting exclusive use of the gpus is | ||
+ | < | ||
+ | #BSUB -q gpu | ||
+ | #BSUB -W 0:05 | ||
+ | #BSUB -n 2 | ||
+ | #BSUB -o out.%J | ||
+ | #BSUB -R " | ||
+ | #BSUB -R " | ||
+ | |||
+ | ./add.exe | ||
+ | </ | ||
+ | |||
+ | ===== Using GPUs on Different Nodes - Multiple MPI Tasks ===== | ||
+ | GWDG's cluster has a number of nodes containing gpus, which are connected by an Infiniband high speed network. In order to use | ||
+ | simultaneously the gpus on different nodes, the message passing MPI framework for distributed computing will be used. The main program file | ||
+ | **'' | ||
+ | CUDA addition program in **'' | ||
+ | device number 0. | ||
+ | |||
+ | **'' | ||
+ | < | ||
+ | #include " | ||
+ | # | ||
+ | # | ||
+ | extern | ||
+ | |||
+ | int main(int argc,char **argv) | ||
+ | { | ||
+ | int N = 1000, i, np, me, ip; | ||
+ | float *a, *b, *c, *a_n, *b_n, *c_n; | ||
+ | |||
+ | MPI_Init(& | ||
+ | MPI_Comm_size(MPI_COMM_WORLD,& | ||
+ | MPI_Comm_rank(MPI_COMM_WORLD,& | ||
+ | // allocate and initialize global vectors | ||
+ | if (me == 0) { | ||
+ | a = (float *)malloc( sizeof(float)*N ); | ||
+ | b = (float *)malloc( sizeof(float)*N ); | ||
+ | c = (float *)malloc( sizeof(float)*N ); | ||
+ | for (i=0; i<N; i++) { | ||
+ | a[i] = i; | ||
+ | b[i] = 2*i; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | int nrct= N/np; int nrpl = N -nrct*np; | ||
+ | // n_loc: number of elements on this task | ||
+ | int n_loc = nrct; if (me < nrpl) n_loc = nrct+1; | ||
+ | // offs: offset in global vector | ||
+ | int offs = me*(nrct+1); | ||
+ | // task me adds elements of global vector with indices offs to offs+n_loc-1 | ||
+ | // allocate and initialize local vectors | ||
+ | a_n = (float *)malloc( n_loc*sizeof(float) ); | ||
+ | b_n = (float *)malloc( n_loc*sizeof(float) ); | ||
+ | c_n = (float *)malloc( n_loc*sizeof(float) ); | ||
+ | MPI_Allgather(& | ||
+ | MPI_Allgather(& | ||
+ | MPI_Scatterv(a, | ||
+ | MPI_Scatterv(b, | ||
+ | int dev_nbr = 0; | ||
+ | add(dev_nbr, | ||
+ | MPI_Gatherv(c_n, | ||
+ | |||
+ | if (me == 0) { | ||
+ | for (i=0; i<3; i++) { | ||
+ | printf( "%f + %f = %f\n", a[i], b[i], c[i] ); | ||
+ | } | ||
+ | for(i=N-3; i<N; i++) { | ||
+ | printf( "%f + %f = %f\n", a[i], b[i], c[i] ); | ||
+ | } | ||
+ | } | ||
+ | MPI_Finalize(); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | The main program starts by allocationg memory for the global vectors **'' | ||
+ | global input vectors **'' | ||
+ | The first index and the number of indices to be handled by the different tasks are stored in the integer arrays **'' | ||
+ | **'' | ||
+ | to the corresponding local vectors **'' | ||
+ | of the different tasks using the MPI function **'' | ||
+ | vector pieces on its own gpu device. Finally the local results in **'' | ||
+ | global vector **'' | ||
+ | |||
+ | For the compilation of the MPI code in **'' | ||
+ | Intel-MPI and OpenMPI are available. The OpenMPI implementation based on the gcc compiler will be used here. The corresponding module file | ||
+ | for OpenMPI must be loaded in addition to the **'' | ||
+ | < | ||
+ | module load openmpi/gcc | ||
+ | </ | ||
+ | The compile and link steps are | ||
+ | < | ||
+ | mpicc -c main_add_mpi.c | ||
+ | nvcc -c add.cu | ||
+ | mpicc main_add_mpi.o add.o -lcudart -o add.exe | ||
+ | </ | ||
+ | |||
+ | With the following jobfile this executable will be submitted to run on two nodes: | ||
+ | < | ||
+ | #BSUB -q gpu | ||
+ | #BSUB -W 0:05 | ||
+ | #BSUB -n 2 | ||
+ | #BSUB -o out.%J | ||
+ | #BSUB -m " | ||
+ | #BSUB -R " | ||
+ | #BSUB -R " | ||
+ | #BSUB -a openmpi | ||
+ | |||
+ | mpirun.lsf ./add.exe | ||
+ | </ | ||
+ | |||
+ | The option **''# | ||
+ | only one task is running on a node such that the two tasks for this job will run on different nodes and **''# | ||
+ | exclusive use of the nodes' gpu for this job. | ||
+ | |||
+ | ===== Using GPUs on Different Nodes - More than One GPU per Node ===== | ||
+ | |||
+ | The previous section describes the use of multiple gpus from different nodes, with the restriction that on every node only one gpu is used. | ||
+ | But most of the nodes in GWDG's cluster equipped with gpus have two or more devices. The MPI parallelization can easily be generalized to | ||
+ | allow several devices on each node to be used. With the MPI function **'' | ||
+ | it is running on. The function **'' | ||
+ | a device number, such that tasks running on the same node get different device numbers: \\ \\ | ||
+ | | ||
+ | < | ||
+ | int dev_nbr_node(int tid) { | ||
+ | int me, np, ip; | ||
+ | MPI_Comm_rank( MPI_COMM_WORLD, | ||
+ | MPI_Comm_size( MPI_COMM_WORLD, | ||
+ | int namelen = 60, len; | ||
+ | char names[np][namelen], | ||
+ | MPI_Get_processor_name( name, &len ); | ||
+ | MPI_Allgather( name, namelen, MPI_CHAR, names, namelen, MPI_CHAR, MPI_COMM_WORLD ); | ||
+ | int ct = -1, dev_nbr; | ||
+ | for (ip=0; ip<np; ip++) { | ||
+ | if (strcmp(names[me], | ||
+ | ct = ct + 1; | ||
+ | if (tid == ip) dev_nbr = ct; | ||
+ | } | ||
+ | } | ||
+ | return dev_nbr; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | The compile and link steps for this generalization of the combined MPI-CUDA program are the same as in the previous section. | ||
+ | An example for a jobfile for the submission of a job using a total of 4 gpus on two nodes is the following: | ||
+ | |||
+ | < | ||
+ | #BSUB -q gpu | ||
+ | #BSUB -W 0:05 | ||
+ | #BSUB -n 4 | ||
+ | #BSUB -o out.%J | ||
+ | #BSUB -m " | ||
+ | #BSUB -R " | ||
+ | #BSUB -R " | ||
+ | #BSUB -a openmpi | ||
+ | |||
+ | mpirun.lsf ./a.out | ||
+ | </ | ||
+ | |||
+ | |||
+ | The use of multiple gpus in multiple nodes can also be achieved with an hybrid approach by starting one MPI task on every node and by | ||
+ | spawning in every task as many OMP threads as gpus to be used on the node. | ||
+ | |||
+ | ===== A Real World Application: | ||
+ | |||
+ | In the vector addition example every gpu just had to add its share of elements of the two vectors, no information had to be exchanged | ||
+ | between different gpus. In the following the simulation of diffusion on a two-dimensional grid will be discussed as an application, | ||
+ | in which data exchange between different gpus is required. Diffusion describes the change of some property , e.g. | ||
+ | temperatur, over time in a given domain of space, | ||
+ | starting from a given initial distribution of this property and given fixed values for the property on the boundary of the domain. | ||
+ | |||
+ | Diffusion will be simulated by calculating a new value of the property to be diffused at a grid point by combining the old values on this point | ||
+ | and on the four neighbouring points, and iterating this update procedure a number of times. | ||
+ | |||
+ | The following picture shows in its upper part the two dimensional grid of size N x N of values u(i1,i2), where each index runs from 0 to N-1. | ||
+ | Only the values of the inner grid points with i1,i2 = 1,..,n=N-2 will be updated, the boundary values with i1 = 0 or N-1 and i2 = 0 or N-1 | ||
+ | will stay fixed at their values given by the boundary condition of the problem. In the memory the two dimensinal array u(i1,i2) will be stored as | ||
+ | a one dimensional contiguous sequence of values u(i), i = 0, | ||
+ | with the choice i = i2+N*i1, correponding do string together the rows of the two dimensional array. Also shown in the picture is the | ||
+ | neighbourhood for updating the value on a particular point, both in the two dimensional and the linear setting. | ||
+ | |||
+ | |||
+ | {{: | ||
+ | |||
+ | The c-function '' | ||
+ | It reads the old values of the property from arrray u and stores the updated values in array v. Notice that all (n1+2)*(n2+2) elements | ||
+ | of array a with the old values are needed in order to calculate the inner n1*n2 elements of array v with the updated values. | ||
+ | |||
+ | < | ||
+ | void update (int n1, int n2, float *u, float *v) | ||
+ | { | ||
+ | int i1, i2, i; | ||
+ | float r = 0.2f, s = 1.f - 4.f*r; | ||
+ | |||
+ | for ( i2 = 1; i2 <= n2 ; i2++ ) { | ||
+ | for ( i1 = 1; i1 <= n1 ; i1++ ) { | ||
+ | i = i2 +(n2+2)*i1; | ||
+ | v[i] = s* u[i] | ||
+ | + r * ( u[i-1] + u[i+1] | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | The corresponding cuda-function '' | ||
+ | |||
+ | < | ||
+ | __global__ void update_d( int n1, int n2, const float * __restrict__ u, | ||
+ | float* __restrict__ v) { | ||
+ | int i2 = 1+threadIdx.x + blockIdx.x*blockDim.x; | ||
+ | int i1 = 1+threadIdx.y + blockIdx.y*blockDim.y; | ||
+ | int i = i2 + (n2+2)*i1; | ||
+ | float r = 0.2f, s = 1 - 4*r; | ||
+ | if( i1 <= n1 && i2 <= n2 ) v[i] = s* u[i] | ||
+ | + r * ( u[i-1] + u[i+1] | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | As in the vector addition example the complete code for the simulation of diffusion will be split into two parts: a file main_diff.c | ||
+ | containing the main program and all functions with code to be executed on the host, and a file diff_single_gpu.cu containing all functions | ||
+ | containing code to be executed on a gpu. | ||
+ | |||
+ | Main program: | ||
+ | |||
+ | < | ||
+ | int main (int argc, char **argv) | ||
+ | { | ||
+ | int n, nt, n_thrds; | ||
+ | |||
+ | /* ------------ input of run parameters ---------------- */ | ||
+ | | ||
+ | | ||
+ | |||
+ | /* ------------ allocate global array ------- */ | ||
+ | int size_gl = (n+2)*(n+2)*sizeof(float); | ||
+ | |||
+ | /* ------------ initialize initial and boundary values ------- */ | ||
+ | | ||
+ | |||
+ | /* --------- itertate updates -------------- */ | ||
+ | | ||
+ | |||
+ | /* ------------ output of results ------------------------------ */ | ||
+ | | ||
+ | </ | ||
+ | |||
+ | The main program invokes the function '' | ||
+ | |||
+ | < | ||
+ | extern " | ||
+ | { | ||
+ | int il, size = (n+2)*(n+2)*sizeof(float); | ||
+ | float *u; cudaMalloc((void **) &u, size); | ||
+ | float *v; cudaMalloc((void **) &v, size); | ||
+ | cudaMemcpy(u, | ||
+ | cudaMemcpy(v, | ||
+ | int blksz_x = 32, blksz_y = 32; | ||
+ | int grdsz_x = (n+2+blksz_x-1)/ | ||
+ | dim3 blks(blksz_x, | ||
+ | for ( il = 1; il<=nt; il++) { | ||
+ | update_d<<< | ||
+ | update_d<<< | ||
+ | } | ||
+ | cudaMemcpy(u_h, | ||
+ | cudaFree(u); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | As in the vector addition case, the function '' | ||
+ | The execution configuration '' | ||
+ | simple specification of the indices of the inner points of the property field in the device function '' | ||
+ | Since a single block of threads in cuda can have at most 1024 threads, this two dimensional set is partitioned into | ||
+ | a two dimensional grid of two dimensional blocks of threads. | ||
+ | |||
+ | The complete files '' | ||
+ | | ||
+ | |||
+ | ===== Parallel Diffusion with Multiple GPUs ===== | ||
+ | |||
+ | The obvious way to simulate diffusion on '' | ||
+ | devices, e.g. in a row-wise manner as shown in the following picture. | ||
+ | |||
+ | {{: | ||
+ | |||
+ | In order to calculate the updated values for its own rows, each device not only needs the old values in these rows but also the old values | ||
+ | of the boundary rows, which belong to the neighbouring devices. Therefore after each update step in the diffusion function the devices | ||
+ | have to exchange their first and last rows with the neighbouring devices. Because each device memory can communicate only with the host memory, | ||
+ | the data exchange between devices proceeds in three steps: | ||
+ | - each device copies the first and last row of newly calculated values to separate places in host memory | ||
+ | - in host memory, the rows coming from neighbouring devices are exchanged in an appropriate way | ||
+ | - each device copies from host memory the needed boundary rows | ||
+ | |||
+ | The main program and other functions with c code will be collected in '' | ||
+ | |||
+ | < | ||
+ | int main (int argc, char **argv) | ||
+ | { | ||
+ | int n, nt, n_gpus; | ||
+ | float *u; | ||
+ | |||
+ | /* ------------ input of run parameters ---------------- */ | ||
+ | | ||
+ | | ||
+ | t, n); | ||
+ | |||
+ | /* ------------ allocate global array ------- */ | ||
+ | int size_gl = (n+2)*(n+2)*sizeof(float); | ||
+ | |||
+ | /* ------------ initialize initial and boundary values ------- */ | ||
+ | | ||
+ | |||
+ | /* ------------ allocate boundary arrays------- */ | ||
+ | int size_h = (n+2)*n_gpus*sizeof(float); | ||
+ | cpf = malloc( size_h ); cpl = malloc( size_h ); | ||
+ | |||
+ | /* --------- set up partioning according to the number of gpus-- */ | ||
+ | int nrct= n/n_gpus; int nrpl = n -nrct*n_gpus; | ||
+ | | ||
+ | # | ||
+ | { int tid = omp_get_thread_num(); | ||
+ | | ||
+ | // n_rows: number of rows to be updated by this thread | ||
+ | int n_rows = nrct; if (tid < nrpl) n_rows = nrct+1; | ||
+ | // offs: offset of local array | ||
+ | int offs = tid*(nrct+1); | ||
+ | offs = offs*(n+2); | ||
+ | /* --------- itertate -------------- */ | ||
+ | | ||
+ | } | ||
+ | |||
+ | /* ------------ output of results ------------------------------ */ | ||
+ | | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | In addition to the case of a single gpu, the '' | ||
+ | each of size '' | ||
+ | Then '' | ||
+ | the global array is determined, before each thread calls the function '' | ||
+ | threads as additional parameters. | ||
+ | |||
+ | '' | ||
+ | which will be invoked from the '' | ||
+ | < | ||
+ | void *cpf, cpl; | ||
+ | void exchange(int tid, int n_gpus, int n, float *linef, float *linel) | ||
+ | { | ||
+ | int sil = n*sizeof(float); | ||
+ | if (n_gpus> | ||
+ | { | ||
+ | memcpy(& | ||
+ | #pragma omp barrier | ||
+ | if (tid ==0) | ||
+ | { memcpy(linel,& | ||
+ | else if (tid==n_gpus-1) | ||
+ | { memcpy(linef,& | ||
+ | else | ||
+ | { memcpy(linel,& | ||
+ | memcpy(linef,& | ||
+ | #pragma omp barrier | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | Of course, the gpu working on the first partition of the array only exchanges a row with its lower neighbour and the gpu working | ||
+ | on the last partition only exchanges a row with its upper neighbour. The '' | ||
+ | have finished the updates of their partition, such that the new value of first and last rows are available for exchange. | ||
+ | |||
+ | The '' | ||
+ | |||
+ | < | ||
+ | extern " | ||
+ | { | ||
+ | int il, size = (n1+2)*(n2+2)*sizeof(float), | ||
+ | float *linef; linef= (float *)malloc(sil); | ||
+ | cudaSetDevice(tid); | ||
+ | float *u; cudaMalloc((void **) &u, size); | ||
+ | float *v; cudaMalloc((void **) &v, size); | ||
+ | cudaMemcpy(u, | ||
+ | cudaMemcpy(v, | ||
+ | int blksz_x = 1024, blksz_y = 1; | ||
+ | int grdsz_x = (n2+2+blksz_x-1)/ | ||
+ | dim3 blks(blksz_x, | ||
+ | for ( il = 1; il<=nt; il++) { | ||
+ | update_d<<< | ||
+ | border(tid, n_ngpus, n1, n2, v , linef, linel); | ||
+ | update_d<<< | ||
+ | border(tid, n_gpus, n1, n2, u , linef, linel); | ||
+ | } | ||
+ | cudaMemcpy(u_h, | ||
+ | cudaFree(u); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | The only difference to the analogous function for the single gpu case are the additional parameters '' | ||
+ | needed to differentiate the work on the different gpus and the invoking of the '' | ||
+ | is responsible for the exchange of boundary rows between the gpus and has the following form: | ||
+ | |||
+ | < | ||
+ | extern " | ||
+ | int sil = n2*sizeof(float); | ||
+ | cudaMemcpy(linef,& | ||
+ | cudaMemcpy(linel,& | ||
+ | exchange(tid, | ||
+ | if (n_gpus> | ||
+ | { | ||
+ | if (tid ==0) | ||
+ | { cudaMemcpy(& | ||
+ | else if (tid==n_gpus-1) | ||
+ | { cudaMemcpy(& | ||
+ | else | ||
+ | { cudaMemcpy(& | ||
+ | cudaMemcpy(& | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | The full code for the multi-gpu simulation of diffusion in the files '' | ||
+ | [[http:// | ||
+ | |||
+ | [[Kategorie: | ||
wiki/hpc/using_gpus.txt · Last modified: 2019/02/08 16:02 by 127.0.0.1