Using AlphaFold at GWDG

An AI-based solution to one of the grand challenges in modern biology

„AlphaFold“ is a word which may or may not sound familiar to you. It is an open-source, deep-learning application to predict protein structures, celebrated as a breakthrough to one of the grand challenges in biology.

Proteins are like the workhorse of a cell. They are involved in every biological process in every living organism. Knowing how a protein folds into its three-dimensional structure is fundamental to understanding its exact function. Proteins are built from single amino acids. Their sequence defines the protein’s structure, which, in turn, determines its function. Yet, due to the sheer number of possible configurations, knowing a protein’s amino acid sequence is not enough to work out its shape. Protein folding remained an open research question, impairing numerous biological questions.

With AlphaFold, an AI-based solution to protein folding has been devloped. It predicts a protein’s three-dimensional structure based on its amino acid sequence and a generic protein sequence database. Being low-effort, fast and of comparative accuracy, AlphaFold soon became a valuable addition to the experimental determination of protein structures. It is now an essential tool for scientists around the world and easily accessible to GWDG users as well. In the following section, we will give you a brief introduction to its usage at GWDG.

Here you can also find more detailed information: https://gitlab-ce.gwdg.de/hpc-team-public/science-domains-blog/-/blob/main/20230511_alphafold.md

AlphaFold at GWDG

GWDG offers AlphaFold as a service to its users, meaning there is both a ready-to-use software stack and community support for end-users. AlphaFold has been packaged into a Singularity container, ready to run on the HPC. In addition, albeith obsolete, there is also a conda environment for AlphaFold at the Scientific Compute Cluster (SCC). The full installation requires a >2TB generic protein sequence database and modern NVIDIA GPUs, both of which are availabe in our installation. On SCC, the following GPUs are available: v100, gtx1080 and rtx5000. Any of these can be specified in sbatch configuration, although we recommend using the newest GPU, v100, for all AlphaFold computations. The protein database and the singulity container are all stored in an AlphaFold project directory and available upon request. An example sbatch script is also in this dictory as well as given here:

Example sbatch script

alphafold_batch.sh

!/bin/bash
#SBATCH --job-name=alphafold
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --time=2-00:00:00
#SBATCH --mem=100GB
#SBATCH -G v100:2 
FASTA_PATHS="/your/input/path/here"
OUTPUT_PATH="/your/output/path/here"
/scratch/projects/alphafold/scripts/run_alphafold.sh --fasta_paths $FASTA_PATHS --output_dir $OUTPUT_PATH

Invoked as follows

$ sbatch alphafold_batch.sh

In terms of community support, GWDG provides a platform for community building. There is a AlphaFold user group at RocketChat (https://chat.gwdg.de/channel/alphafold) where users can discuss anything related to AlphaFold at GWDG, including issues. On top of that, GWDG offers direct help with issue resolving to its AlphaFold users. To further facilitate exchange between different users and between users and administrators, AlphaFold has been focus topic at a GöHPCoffee session in December 2022.

To use AlphaFold at SCC, simply issue a support request (https://www.gwdg.de/support), asking to be added to the AlphaFold user group. This will gain you access to the aforementioned project directory. The SCC is available to all researchers of the Max Planck Society and University of Göttingen. You might also consider writing an application to Emmy (https://www.hlrn.de/doc/display/PUB/Application+Process). Relying on GPU for computations, AlphaFold computations naturally qualify for the new GPU-based HPC cluster „Grete“, offering NVIDIA A100 GPUs.

Author

Stefanie Mühlhausen