====== PyTorch on the HPC Clusters ======
[[https://pytorch.org|Pytorch]]
 is an open source Machine Learning (ML) framework based on the [[https://research.fb.com/downloads/torch/|Torch]] library that is in essence, developed by Facebook's AI Research lab ([[https://ai.facebook.com/#notable-papers|FAIR]]).

The goal of this document is to help HPC users to deploy the [[https://pytorch.org/docs/stable/distributed.html|distributed package]] which is included in pytorch. This package helps researchers to parallelize their computations across processes and multiple nodes.

By using //torch.distributed// package, you can leverage messaging passing semantics for communication between processes. 

In this document, we focus on the NCCL and MPI as the only backends that support GPU communication.

=== Contents ===
  - Distributed training with pytorch
  - Contents of the Pytorch_DL_HPC container
  - Getting started
  - MPI backend for distributed training
  - NCCL backend for distributed training

=== 1- Distributed training with pytorch ===
Distributed Package of Pytorch uses three different backends (MPI, NCCL, and Gloo) for communication between processes. By default, NCCL and Gloo are built and included in the Pytorch framework. In addition, MPI backend could be added if pytorch is built from source. Moreover, to use Message Passing Interface ( MPI ) as a backend for GPU direct communication, an MPI Cuda-Aware library (like Open MPI) must be installed.

Since Gloo does not support distributed GPU training and its only suitable for Ethernet interconnect network, we do not consider this method. In this tutorial, we will explain how to run a simple distributed example of Pytorch by NCCL and MPI backends which support  inter and intra node GPU communication over InfiniBand.

It is recommended to use the image of the prepared container to run the distributed Pytorch script. All of the required packages were already installed on this container to deploy your code by MPI or NCCL backends.
=== 2- Contents of the Pytorch_DL_HPC container ===
The image of Pytorch_DL_HPC container can be downloaded from the singularity HUB. 
This container includes the following modules:

-[[http://releases.ubuntu.com/16.04/|ubuntu16.04]]

-[[https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#major-components|NVIDIA CUDA 10.1]] including [[https://docs.nvidia.com/cuda/cublas/index.html|cuBLAS]]

-[[https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/#rel_750|NVIDIA cuDNN 7.6.5]]

-[[https://developer.nvidia.com/nccl|NVIDIA NCCL 2.4.0]]

-[[https://www.open-mpi.org/software/ompi/v4.0/|OpenMPI 4.0.3]]

-[[https://www.mellanox.com|MLNX_OFED 4.2-1.0.0.0]]


-[[https://www.anaconda.com|Anaconda]]


-[[https://pytorch.org|Pytorch 5.1]]


-[[https://pytorch.org/docs/stable/torchvision/index.html|torchvision]]

-[[https://github.com/nvidia/apex|APEX]]

-[[https://jupyter-notebook.readthedocs.io/en/stable/index.html|Jupyter]] and [[https://github.com/jupyterlab/jupyterlab|JupyterLab]]

== 3- Getting Started ==

Pull the Image of the prepared container to your HOME directory:

<code>
module load singularity
singularity pull --name Pytorch_DL_HPC.sif shub://masoudrezai/Singularity:13

</code>
 
Running the following commands for the first time to prepare the environment. 
<code>
wget https://owncloud.gwdg.de/index.php/s/X2Qr4lkhJig9IBn/download -O anaconda3.tar.gz
tar -xzf anaconda3.tar.gz -C $HOME/anaconda_env_pytorch_apex_1

</code>

 **Note:** It results in the generation of a directory (anaconda_env_pytorch_apex_{V}) in your $HOME path which should not be changed for the next run. 

=== 4- MPI backend for distributed training (It is outdated. Please using NCCL backend for synchronization (Section 5))===
After executing the before mentioned command the environment is ready to deploy your python script with the following job script:

<file bash Job_script_MPI.sh>

#!/bin/bash
#SBATCH -t 02:00:00
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --gpus-per-node=gtx1080:2
export OMP_NUM_THREADS=4

module load singularity
module load gcc/9.2.0
module load cuda10.1/toolkit/10.1.105
module load openmpi/gcc/64/4.0.3_cuda-10.1

mpirun --mca btl vader,self,tcp -mca mca_btl_openib 1 -mca orte_base_help_aggregate 0 singularity exec --nv Pytorch_DL_HPC.sif python -u DistributedLearning_MPI.py

</file>
[[https://docs.oracle.com/cd/E19708-01/821-1319-10/mca-params.html|This link]] gives you all the required information about MCA parameters.

The pytorch code by MPI backend should have a structure looking something like:


<file python DistributedLearning_MPI.py>
import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size, hostname, gpu, ngpus_per_node):
    print(f"I am {rank} of {size} in {hostname}")
    group = dist.new_group([0, 1,2,3])
    tensor = torch.ones(1).cuda()
    dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
    print('Rank ', rank, ' has data ', tensor[0])

def init_processes(rank, size, gpu,ngpus_per_node, hostname, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    torch.cuda.set_device(gpu)
    dist.init_process_group(backend, rank=rank, world_size=size)
    
    fn(rank, size, hostname, gpu, ngpus_per_node)


if __name__ == "__main__":
    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
    world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
    gpu= int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
    ngpus_per_node=torch.cuda.device_count()
    hostname = socket.gethostname()
    init_processes(world_rank, world_size, gpu, ngpus_per_node, hostname, run, backend='mpi')


</file>

Job script is submitted by the following command:

<code>
sbatch <jobscript>.sh
</code>

=== 5- NCCL backend for distributed training ===
Running a distributed learning application by NCCL backend needs a job script looking something like:

<file bash Job_script_NCCL.sh>

#!/bin/bash
#SBATCH -t 02:00:00
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --gpus-per-node=gtx1080:2

module load singularity

srun singularity exec --nv Pytorch_DL_HPC.sif python -u DistributedLearning_NCCL.py

</file>

The pytorch code by NCCL backend should have a structure looks something like following script.


<file python DistributedLearning_NCCL.py>
import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size, hostname, gpu, ngpus_per_node):
    print(f"I am {rank} of {size} in {hostname}")
    group = dist.new_group([0, 1,2,3])
    tensor = torch.ones(1).cuda()
    dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
    print('Rank ', rank, ' has data ', tensor[0])
    
def init_processes(Myrank, size,ngpus_per_node, hostname, fn, backend='nccl'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = os.environ['SLURM_LAUNCH_NODE_IPADDR']
    os.environ['MASTER_PORT'] = '8933'
    dist.init_process_group("nccl",init_method='env://', rank=Myrank, world_size=size)

    print("Initialized Rank:", dist.get_rank())
    hostname = socket.gethostname()
    ip_address = socket.gethostbyname(hostname)
    print(ip_address)
    if dist.get_rank()%2==0:
        torch.cuda.set_device(0)
        gpu=0
    else:
	torch.cuda.set_device(1)
        gpu=1
    fn(Myrank, size, hostname, gpu, ngpus_per_node)


if __name__ == "__main__":
    world_size = int(os.environ['SLURM_NPROCS'])
    world_rank = int(os.environ['SLURM_PROCID'])
    ngpus_per_node=torch.cuda.device_count()
    hostname = socket.gethostname()
    init_processes(world_rank, world_size,ngpus_per_node, hostname, run, backend='nccl')
</file>

Job script is submitted by the following command:

<code>
sbatch <jobscript>.sh
</code>


 --- //[[masoud.rezai@gwdg.de|mrezai]] 2020/06/15 16:05//