wiki:hpc:pytorch_on_the_hpc_clusters
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | wiki:hpc:pytorch_on_the_hpc_clusters [2021/08/06 18:33] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== PyTorch on the HPC Clusters ====== | ||
+ | [[https:// | ||
+ | is an open source Machine Learning (ML) framework based on the [[https:// | ||
+ | The goal of this document is to help HPC users to deploy the [[https:// | ||
+ | |||
+ | By using // | ||
+ | |||
+ | In this document, we focus on the NCCL and MPI as the only backends that support GPU communication. | ||
+ | |||
+ | === Contents === | ||
+ | - Distributed training with pytorch | ||
+ | - Contents of the Pytorch_DL_HPC container | ||
+ | - Getting started | ||
+ | - MPI backend for distributed training | ||
+ | - NCCL backend for distributed training | ||
+ | |||
+ | === 1- Distributed training with pytorch === | ||
+ | Distributed Package of Pytorch uses three different backends (MPI, NCCL, and Gloo) for communication between processes. By default, NCCL and Gloo are built and included in the Pytorch framework. In addition, MPI backend could be added if pytorch is built from source. Moreover, to use Message Passing Interface ( MPI ) as a backend for GPU direct communication, | ||
+ | |||
+ | Since Gloo does not support distributed GPU training and its only suitable for Ethernet interconnect network, we do not consider this method. In this tutorial, we will explain how to run a simple distributed example of Pytorch by NCCL and MPI backends which support | ||
+ | |||
+ | It is recommended to use the image of the prepared container to run the distributed Pytorch script. All of the required packages were already installed on this container to deploy your code by MPI or NCCL backends. | ||
+ | === 2- Contents of the Pytorch_DL_HPC container === | ||
+ | The image of Pytorch_DL_HPC container can be downloaded from the singularity HUB. | ||
+ | This container includes the following modules: | ||
+ | |||
+ | -[[http:// | ||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | -[[https:// | ||
+ | |||
+ | == 3- Getting Started == | ||
+ | |||
+ | Pull the Image of the prepared container to your HOME directory: | ||
+ | |||
+ | < | ||
+ | module load singularity | ||
+ | singularity pull --name Pytorch_DL_HPC.sif shub:// | ||
+ | |||
+ | </ | ||
+ | |||
+ | Running the following commands for the first time to prepare the environment. | ||
+ | < | ||
+ | wget https:// | ||
+ | tar -xzf anaconda3.tar.gz -C $HOME/ | ||
+ | |||
+ | </ | ||
+ | |||
+ | | ||
+ | |||
+ | === 4- MPI backend for distributed training (It is outdated. Please using NCCL backend for synchronization (Section 5))=== | ||
+ | After executing the before mentioned command the environment is ready to deploy your python script with the following job script: | ||
+ | |||
+ | <file bash Job_script_MPI.sh> | ||
+ | |||
+ | #!/bin/bash | ||
+ | #SBATCH -t 02:00:00 | ||
+ | #SBATCH -p gpu | ||
+ | #SBATCH -N 2 | ||
+ | #SBATCH -n 4 | ||
+ | #SBATCH --gpus-per-node=gtx1080: | ||
+ | export OMP_NUM_THREADS=4 | ||
+ | |||
+ | module load singularity | ||
+ | module load gcc/9.2.0 | ||
+ | module load cuda10.1/ | ||
+ | module load openmpi/ | ||
+ | |||
+ | mpirun --mca btl vader, | ||
+ | |||
+ | </ | ||
+ | [[https:// | ||
+ | |||
+ | The pytorch code by MPI backend should have a structure looking something like: | ||
+ | |||
+ | |||
+ | <file python DistributedLearning_MPI.py> | ||
+ | import os | ||
+ | import socket | ||
+ | import torch | ||
+ | import torch.distributed as dist | ||
+ | from torch.multiprocessing import Process | ||
+ | |||
+ | def run(rank, size, hostname, gpu, ngpus_per_node): | ||
+ | print(f" | ||
+ | group = dist.new_group([0, | ||
+ | tensor = torch.ones(1).cuda() | ||
+ | dist.all_reduce(tensor, | ||
+ | print(' | ||
+ | |||
+ | def init_processes(rank, | ||
+ | """ | ||
+ | torch.cuda.set_device(gpu) | ||
+ | dist.init_process_group(backend, | ||
+ | | ||
+ | fn(rank, size, hostname, gpu, ngpus_per_node) | ||
+ | |||
+ | |||
+ | if __name__ == " | ||
+ | world_size = int(os.environ[' | ||
+ | world_rank = int(os.environ[' | ||
+ | gpu= int(os.environ[' | ||
+ | ngpus_per_node=torch.cuda.device_count() | ||
+ | hostname = socket.gethostname() | ||
+ | init_processes(world_rank, | ||
+ | |||
+ | |||
+ | </ | ||
+ | |||
+ | Job script is submitted by the following command: | ||
+ | |||
+ | < | ||
+ | sbatch < | ||
+ | </ | ||
+ | |||
+ | === 5- NCCL backend for distributed training === | ||
+ | Running a distributed learning application by NCCL backend needs a job script looking something like: | ||
+ | |||
+ | <file bash Job_script_NCCL.sh> | ||
+ | |||
+ | #!/bin/bash | ||
+ | #SBATCH -t 02:00:00 | ||
+ | #SBATCH -p gpu | ||
+ | #SBATCH -N 2 | ||
+ | #SBATCH -n 4 | ||
+ | #SBATCH --gpus-per-node=gtx1080: | ||
+ | |||
+ | module load singularity | ||
+ | |||
+ | srun singularity exec --nv Pytorch_DL_HPC.sif python -u DistributedLearning_NCCL.py | ||
+ | |||
+ | </ | ||
+ | |||
+ | The pytorch code by NCCL backend should have a structure looks something like following script. | ||
+ | |||
+ | |||
+ | <file python DistributedLearning_NCCL.py> | ||
+ | import os | ||
+ | import socket | ||
+ | import torch | ||
+ | import torch.distributed as dist | ||
+ | from torch.multiprocessing import Process | ||
+ | |||
+ | def run(rank, size, hostname, gpu, ngpus_per_node): | ||
+ | print(f" | ||
+ | group = dist.new_group([0, | ||
+ | tensor = torch.ones(1).cuda() | ||
+ | dist.all_reduce(tensor, | ||
+ | print(' | ||
+ | | ||
+ | def init_processes(Myrank, | ||
+ | """ | ||
+ | os.environ[' | ||
+ | os.environ[' | ||
+ | dist.init_process_group(" | ||
+ | |||
+ | print(" | ||
+ | hostname = socket.gethostname() | ||
+ | ip_address = socket.gethostbyname(hostname) | ||
+ | print(ip_address) | ||
+ | if dist.get_rank()%2==0: | ||
+ | torch.cuda.set_device(0) | ||
+ | gpu=0 | ||
+ | else: | ||
+ | torch.cuda.set_device(1) | ||
+ | gpu=1 | ||
+ | fn(Myrank, size, hostname, gpu, ngpus_per_node) | ||
+ | |||
+ | |||
+ | if __name__ == " | ||
+ | world_size = int(os.environ[' | ||
+ | world_rank = int(os.environ[' | ||
+ | ngpus_per_node=torch.cuda.device_count() | ||
+ | hostname = socket.gethostname() | ||
+ | init_processes(world_rank, | ||
+ | </ | ||
+ | |||
+ | Job script is submitted by the following command: | ||
+ | |||
+ | < | ||
+ | sbatch < | ||
+ | </ | ||
+ | |||
+ | |||
+ | --- // |
wiki/hpc/pytorch_on_the_hpc_clusters.txt · Last modified: 2021/08/06 18:33 by 127.0.0.1