PyTorch on the HPC Clusters

Pytorch is an open source Machine Learning (ML) framework based on the Torch library that is in essence, developed by Facebook's AI Research lab (FAIR).

The goal of this document is to help HPC users to deploy the distributed package which is included in pytorch. This package helps researchers to parallelize their computations across processes and multiple nodes.

By using torch.distributed package, you can leverage messaging passing semantics for communication between processes.

In this document, we focus on the NCCL and MPI as the only backends that support GPU communication.

1- Distributed training with pytorch

Distributed Package of Pytorch uses three different backends (MPI, NCCL, and Gloo) for communication between processes. By default, NCCL and Gloo are built and included in the Pytorch framework. In addition, MPI backend could be added if pytorch is built from source. Moreover, to use Message Passing Interface ( MPI ) as a backend for GPU direct communication, an MPI Cuda-Aware library (like Open MPI) must be installed.

Since Gloo does not support distributed GPU training and its only suitable for Ethernet interconnect network, we do not consider this method. In this tutorial, we will explain how to run a simple distributed example of Pytorch by NCCL and MPI backends which support inter and intra node GPU communication over InfiniBand.

It is recommended to use the image of the prepared container to run the distributed Pytorch script. All of the required packages were already installed on this container to deploy your code by MPI or NCCL backends.

2- Contents of the Pytorch_DL_HPC container

The image of Pytorch_DL_HPC container can be downloaded from the singularity HUB. This container includes the following modules:

-ubuntu16.04

-NVIDIA CUDA 10.1 including cuBLAS

-NVIDIA cuDNN 7.6.5

-NVIDIA NCCL 2.4.0

-OpenMPI 4.0.3

-MLNX_OFED 4.2-1.0.0.0

-Jupyter and JupyterLab

3- Getting Started

Pull the Image of the prepared container to your HOME directory:

module load singularity
singularity pull --name Pytorch_DL_HPC.sif shub://masoudrezai/Singularity:13

Running the following commands for the first time to prepare the environment.

wget https://owncloud.gwdg.de/index.php/s/X2Qr4lkhJig9IBn/download -O anaconda3.tar.gz
tar -xzf anaconda3.tar.gz -C $HOME/anaconda_env_pytorch_apex_1

Note: It results in the generation of a directory (anaconda_env_pytorch_apex_{V}) in your $HOME path which should not be changed for the next run.

4- MPI backend for distributed training (It is outdated. Please using NCCL backend for synchronization (Section 5))

After executing the before mentioned command the environment is ready to deploy your python script with the following job script:

Job_script_MPI.sh

#!/bin/bash
#SBATCH -t 02:00:00
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --gpus-per-node=gtx1080:2
export OMP_NUM_THREADS=4
 
module load singularity
module load gcc/9.2.0
module load cuda10.1/toolkit/10.1.105
module load openmpi/gcc/64/4.0.3_cuda-10.1
 
mpirun --mca btl vader,self,tcp -mca mca_btl_openib 1 -mca orte_base_help_aggregate 0 singularity exec --nv Pytorch_DL_HPC.sif python -u DistributedLearning_MPI.py

This link gives you all the required information about MCA parameters.

The pytorch code by MPI backend should have a structure looking something like:

DistributedLearning_MPI.py

import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
 
def run(rank, size, hostname, gpu, ngpus_per_node):
    print(f"I am {rank} of {size} in {hostname}")
    group = dist.new_group([0, 1,2,3])
    tensor = torch.ones(1).cuda()
    dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
    print('Rank ', rank, ' has data ', tensor[0])
 
def init_processes(rank, size, gpu,ngpus_per_node, hostname, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    torch.cuda.set_device(gpu)
    dist.init_process_group(backend, rank=rank, world_size=size)
 
    fn(rank, size, hostname, gpu, ngpus_per_node)
 
 
if __name__ == "__main__":
    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
    world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
    gpu= int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
    ngpus_per_node=torch.cuda.device_count()
    hostname = socket.gethostname()
    init_processes(world_rank, world_size, gpu, ngpus_per_node, hostname, run, backend='mpi')

Job script is submitted by the following command:

sbatch <jobscript>.sh

5- NCCL backend for distributed training

Running a distributed learning application by NCCL backend needs a job script looking something like:

Job_script_NCCL.sh

#!/bin/bash
#SBATCH -t 02:00:00
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --gpus-per-node=gtx1080:2
 
module load singularity
 
srun singularity exec --nv Pytorch_DL_HPC.sif python -u DistributedLearning_NCCL.py

The pytorch code by NCCL backend should have a structure looks something like following script.

DistributedLearning_NCCL.py

import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
 
def run(rank, size, hostname, gpu, ngpus_per_node):
    print(f"I am {rank} of {size} in {hostname}")
    group = dist.new_group([0, 1,2,3])
    tensor = torch.ones(1).cuda()
    dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
    print('Rank ', rank, ' has data ', tensor[0])
 
def init_processes(Myrank, size,ngpus_per_node, hostname, fn, backend='nccl'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = os.environ['SLURM_LAUNCH_NODE_IPADDR']
    os.environ['MASTER_PORT'] = '8933'
    dist.init_process_group("nccl",init_method='env://', rank=Myrank, world_size=size)
 
    print("Initialized Rank:", dist.get_rank())
    hostname = socket.gethostname()
    ip_address = socket.gethostbyname(hostname)
    print(ip_address)
    if dist.get_rank()%2==0:
        torch.cuda.set_device(0)
        gpu=0
    else:
	torch.cuda.set_device(1)
        gpu=1
    fn(Myrank, size, hostname, gpu, ngpus_per_node)
 
 
if __name__ == "__main__":
    world_size = int(os.environ['SLURM_NPROCS'])
    world_rank = int(os.environ['SLURM_PROCID'])
    ngpus_per_node=torch.cuda.device_count()
    hostname = socket.gethostname()
    init_processes(world_rank, world_size,ngpus_per_node, hostname, run, backend='nccl')

Job script is submitted by the following command:

sbatch <jobscript>.sh

— mrezai 2020/06/15 16:05