{"id":23632,"date":"2024-02-01T09:22:17","date_gmt":"2024-02-01T08:22:17","guid":{"rendered":"https:\/\/info.gwdg.de\/news\/?p=23632"},"modified":"2024-02-01T09:22:17","modified_gmt":"2024-02-01T08:22:17","slug":"best-practices-for-machine-learning-with-hpc","status":"publish","type":"post","link":"https:\/\/info.gwdg.de\/news\/best-practices-for-machine-learning-with-hpc\/","title":{"rendered":"Best Practices for Machine Learning with HPC"},"content":{"rendered":"<p>The GWDG offers data scientists various services and training courses to support them in their work throughout their entire workflow. As the success of a machine learning project often depends on the available computing resources, the working group &#8222;Computing&#8220; operates various HPC systems with suitable accelerators to train deep learning models. This article explains model training using the highly energy-efficient HPC cluster &#8222;Grete&#8220;. The example is an excerpt from the course &#8222;Deep learning with GPU cores,&#8220; offered every six months.<\/p>\n<h2 dir=\"auto\" data-sourcepos=\"11:1-11:15\">Introduction<\/h2>\n<p dir=\"auto\" data-sourcepos=\"13:1-13:684\">The GWDG offers several services and trainings to support researchers in their machine-learning projects<sup class=\"footnote-ref\"><a id=\"fnref-services-9758\" href=\"#fn-services-9758\" data-footnote-ref=\"\">1<\/a><\/sup>. Deep Learning success is heavily compute-bound<sup class=\"footnote-ref\"><a id=\"fnref-ai-compute-9758\" href=\"#fn-ai-compute-9758\" data-footnote-ref=\"\">2<\/a><\/sup><sup class=\"footnote-ref\"><a id=\"fnref-gwdg-academy-9758\" href=\"#fn-gwdg-academy-9758\" data-footnote-ref=\"\">3<\/a><\/sup>; accelerators integrated into our HPC systems can help overcome this bottleneck. The GWDG offers several HPC systems<sup class=\"footnote-ref\"><a id=\"fnref-hpc-9758\" href=\"#fn-hpc-9758\" data-footnote-ref=\"\">4<\/a><\/sup> with modern GPU nodes and other accelerators<sup class=\"footnote-ref\"><a id=\"fnref-accelerators-9758\" href=\"#fn-accelerators-9758\" data-footnote-ref=\"\">5<\/a><\/sup> accessible to different user groups<sup class=\"footnote-ref\"><a id=\"fnref-hpc-access-9758\" href=\"#fn-hpc-access-9758\" data-footnote-ref=\"\">6<\/a><\/sup>. If you are wondering whether you have access to our systems, look at our science domains blog article from June <sup><a id=\"fnref-hpc-access-2\" href=\"#fn-hpc-access\" data-footnote-ref=\"\">6<\/a><\/sup>. Additionally, the KISSKI project <sup class=\"footnote-ref\"><a id=\"fnref-kisski-9758\" href=\"#fn-kisski-9758\" data-footnote-ref=\"\">7<\/a><\/sup> is not limited to researchers, but small and medium-sized enterprises are highly welcome.<\/p>\n<p dir=\"auto\" data-sourcepos=\"15:1-15:258\">Further, we aim to support different science domains in the scope of the NHR alliance <sup class=\"footnote-ref\"><a id=\"fnref-nhr-9758\" href=\"#fn-nhr-9758\" data-footnote-ref=\"\">8<\/a><\/sup>. Here we focus on researchers in the fields of artificial intelligence <sup class=\"footnote-ref\"><a id=\"fnref-sd-ai-9758\" href=\"#fn-sd-ai-9758\" data-footnote-ref=\"\">9<\/a><\/sup>, digital humanities, bioinformatics <sup class=\"footnote-ref\"><a id=\"fnref-sd-bioinfo-9758\" href=\"#fn-sd-bioinfo-9758\" data-footnote-ref=\"\">10<\/a><\/sup>, and forest science <sup class=\"footnote-ref\"><a id=\"fnref-sd-forestry-9758\" href=\"#fn-sd-forestry-9758\" data-footnote-ref=\"\">11<\/a><\/sup>.<\/p>\n<h2 dir=\"auto\" data-sourcepos=\"17:1-17:28\"><a id=\"user-content-machine-learning-workflow\" class=\"anchor\" href=\"#machine-learning-workflow\" aria-hidden=\"true\"><\/a>Machine learning workflow<\/h2>\n<p dir=\"auto\" data-sourcepos=\"19:1-19:322\">A typical machine-learning workflow can be seen below (see figure 1). However, this article will only focus on model development and evaluation best practices. Regardless, the GWDG offers services to support you in every step of this workflow <sup><a id=\"fnref-services-2\" href=\"#fn-services\" data-footnote-ref=\"\">1<\/a><\/sup>. If you need our assistance with anything, feel free to contact us.<\/p>\n<figure><a href=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/ml-workflow.png\" class=\"external\" rel=\"nofollow\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 100%; height: 100%;\" src=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/ml-workflow.png\" alt=\"\" width=\"2215\" height=\"1181\" \/><\/a><figcaption>Figure 1: &#8222;A typical machine learning workflow involves data collection, pre-processing, dataset building, model training and refinement, evaluation, deployment, and continuous monitoring and updating. This process is iterative and may require revisiting previous steps based on the results at later stages[^1][^2][^4][^5].&#8220;, explained by a transformer model deployed by Perplexity [^perplexity].<\/p>\n<p data-sourcepos=\"28:3-28:77\">The figure is adapted from <a href=\"https:\/\/fullstackdeeplearning.com\/course\/2022\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/fullstackdeeplearning.com\/course\/2022\/<\/a> .<\/p>\n<\/figcaption><\/figure>\n<h3 dir=\"auto\" data-sourcepos=\"32:1-32:15\"><a id=\"user-content-hpc-cluster\" class=\"anchor\" href=\"#hpc-cluster\" aria-hidden=\"true\"><\/a>HPC Cluster<\/h3>\n<p dir=\"auto\" data-sourcepos=\"34:1-34:419\">A high-performance computing (HPC) cluster consists of different essential components, which are critical to understand to use its resources efficiently. Here, we will briefly explain the basic terminology. If you are interested in a more detailed article on this topic, please read our science domain blog article &#8222;What scientists should know to efficiently use the Scientific Computing Cluster&#8220; <sup class=\"footnote-ref\"><a id=\"fnref-cluster-practical-9758\" href=\"#fn-cluster-practical-9758\" data-footnote-ref=\"\">12<\/a><\/sup>.<\/p>\n<p dir=\"auto\" data-sourcepos=\"36:1-36:122\">First, a user can access the login node of the HPC cluster via an SSH (Secure Shell Protocol) connection (1 in figure 2).<\/p>\n<p dir=\"auto\" data-sourcepos=\"38:1-40:475\">In the example workflow below, you can see that <code>glogin9<\/code> is the login node of our GPU system Grete<sup class=\"footnote-ref\"><a id=\"fnref-hlrn-gpu-9758\" href=\"#fn-hlrn-gpu-9758\" data-footnote-ref=\"\">13<\/a><\/sup>. On the login node, jobs are submitted with the job scheduler Slurm<sup class=\"footnote-ref\"><a id=\"fnref-slurm-9758\" href=\"#fn-slurm-9758\" data-footnote-ref=\"\">14<\/a><\/sup> (2 in figure 2) A job on SLURM is a combination of which program you would like to run and which computing resources (hardware) you need. Compute nodes take care of the actual computing of your jobs, similar to your personal computer. Those compute nodes consist of cores and memory (RAM). Cores are processing units, such as CPU or GPU. Each node can have a different number of cores. Each node also has its own (temporary) random-access memory (RAM) used for temporary computations. Various storage systems with different characteristics help to handle and organize data efficiently (3 in figure 2). While using the <code>shared scratch<\/code> for your computation has great results, it is not great for long-term storage. This type of file system has no backup! However, your home folder has a backup. As indicated in the figure by the connection to the tape archive (right center). These are literal tapes that, after being written, are stored as physical components not connected to any electricity. Lastly, connecting S3 buckets to the whole cluster (bottom right) is also possible.<\/p>\n<figure><a href=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/hpc.png\" class=\"external\" rel=\"nofollow\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 100%; height: 100%;\" src=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/hpc.png\" alt=\"\" width=\"1920\" height=\"1080\" \/><\/a><figcaption>Figure 2: Schematic visualization of an HPC system, similar to the system Grete. On the left is the user and the front nodes that are used for login. In the center, the compute nodes handle the computation of the submitted jobs. Those are connected with different storage systems on the right.<\/p>\n<p data-sourcepos=\"49:3-49:94\">The figure was taken from our deep learning with GPU course offered every half year.<sup class=\"footnote-ref\"><a id=\"fnref-dlgpu-9758\" href=\"#fn-dlgpu-9758\" data-footnote-ref=\"\">15<\/a><\/sup><\/p>\n<\/figcaption><\/figure>\n<h2 dir=\"auto\" data-sourcepos=\"53:1-53:9\"><a id=\"user-content-coding\" class=\"anchor\" href=\"#coding\" aria-hidden=\"true\"><\/a>Coding<\/h2>\n<p dir=\"auto\" data-sourcepos=\"55:1-55:358\">In order to use the HPC resources, you will need to write code and develop programs. The programming language depends on the target application. For machine learning and deep learning, Python is often the best choice due to its moderate learning curve and availability of comprehensive packages and frameworks, such as PyTorch, JAX (with Flux) or Tensorflow.<\/p>\n<h3 dir=\"auto\" data-sourcepos=\"57:1-57:24\"><a id=\"user-content-tools-and-frameworks\" class=\"anchor\" href=\"#tools-and-frameworks\" aria-hidden=\"true\"><\/a>Tools and Frameworks<\/h3>\n<p dir=\"auto\" data-sourcepos=\"59:1-59:307\">For development, we recommend VSCode<sup class=\"footnote-ref\"><a id=\"fnref-vscode-9758\" href=\"#fn-vscode-9758\" data-footnote-ref=\"\">16<\/a><\/sup> as it supports many features and programming languages including Python through its vast selection of extensions<sup class=\"footnote-ref\"><a id=\"fnref-vsextensions-9758\" href=\"#fn-vsextensions-9758\" data-footnote-ref=\"\">17<\/a><\/sup>. VSCode can also be configured to connect to the cluster directly which makes development simple and straightforward<sup class=\"footnote-ref\"><a id=\"fnref-vscodearticle-9758\" href=\"#fn-vscodearticle-9758\" data-footnote-ref=\"\">18<\/a><\/sup>.<\/p>\n<p dir=\"auto\" data-sourcepos=\"61:1-61:292\">When working on a Python project, it is best to use a virtual environment or container with a tool such as <code>conda<\/code>, <code> singularity<\/code> or <code>apptainer<\/code><sup class=\"footnote-ref\"><a id=\"fnref-apptainer-9758\" href=\"#fn-apptainer-9758\" data-footnote-ref=\"\">19<\/a><\/sup>, to ensure reproducibility and robustness of the code, as packages are constantly updated and your code may require specific versions.<\/p>\n<p dir=\"auto\" data-sourcepos=\"63:1-63:497\">There are several widely-used frameworks for deep learning in Python. Three more common ones are PyTorch, TensorFlow, and JAX. Tensorflow is the first major deep learning framework and was developed by Google in 2015. PyTorch was developed in 2016 by Facebook, and is currently the most dominant framework in academia winning 77% of the competitions. JAX<sup class=\"footnote-ref\"><a id=\"fnref-JAX-9758\" href=\"#fn-JAX-9758\" data-footnote-ref=\"\">20<\/a><\/sup> is a relatively new framework developed by Google that is aimed towards a more general auto-differentiation and vectorisation framework.<\/p>\n<p dir=\"auto\" data-sourcepos=\"65:1-65:43\">For most cases, we recommend using PyTorch.<\/p>\n<h3 dir=\"auto\" data-sourcepos=\"68:1-68:35\"><a id=\"user-content-documentation-and-code-handling\" class=\"anchor\" href=\"#documentation-and-code-handling\" aria-hidden=\"true\"><\/a>Documentation and Code Handling<\/h3>\n<p dir=\"auto\" data-sourcepos=\"70:1-70:431\">Code should be well-documented and clean; comments and README files should be included when necessary and repetitive code must be converted into functions and packages. It is also recommended to treat raw data as immutable; build the code around the data instead of editing the raw data itself. Cookiecutter is a tool that can get you started with a decent project structure and has some good recommendations as well<sup class=\"footnote-ref\"><a id=\"fnref-cookiecutter-9758\" href=\"#fn-cookiecutter-9758\" data-footnote-ref=\"\">21<\/a><\/sup><\/p>\n<p dir=\"auto\" data-sourcepos=\"72:1-72:343\">It is crucial to use a Version Control System (VCS) such as git to avoid accidents and disasters when altering code and in general to gain better access and control over the code history. GWDG offers a GitLab server for this purpose. A good example is the deep learning with GPU workshop repository located in GWDG&#8217;s own GitLab server<sup><a id=\"fnref-dlgpu-2\" href=\"#fn-dlgpu\" data-footnote-ref=\"\">15<\/a><\/sup>.<\/p>\n<h3 dir=\"auto\" data-sourcepos=\"74:1-74:29\"><a id=\"user-content-job-scheduling-with-slurm\" class=\"anchor\" href=\"#job-scheduling-with-slurm\" aria-hidden=\"true\"><\/a>Job Scheduling with SLURM<\/h3>\n<p dir=\"auto\" data-sourcepos=\"76:1-76:846\">Running complex programs is forbidden on the login nodes. It is important to run complex tasks, e.g. data processing, model training and testing, only on the compute nodes. As mentioned before, the login nodes are strictly for low-cost tasks and job submission. You must use our job scheduler slurm to run jobs on the various nodes in the cluster. When submitting a job via slurm, you must set the job configuration, e.g. requested partition\/node and GPUs, and you may also add additional settings such as the amount of RAM, time limit, email address to be notified when the job is complete, and a location to store the job&#8217;s log file produced by slurm. It is recommended to specify a location to store the log file and regularly check it for any issues. An overview of the configurations in SLURM can be found in our documentation <sup class=\"footnote-ref\"><a id=\"fnref-slurm-docu-9758\" href=\"#fn-slurm-docu-9758\" data-footnote-ref=\"\">22<\/a><\/sup>.<\/p>\n<p dir=\"auto\" data-sourcepos=\"78:1-78:390\">During model development, you may want to run interactive jobs in the terminal with <code>srun<\/code>, e.g., for testing purposes. In this case it is better to use nodes in partitions designated for this use case such as <code>grete-interactive<\/code>. Finally, you may train the model by writing a slurm batch script and running it with <code>sbatch<\/code> on a dedicated GPU node, e.g. on partition <code>grete<\/code>. <sup class=\"footnote-ref\"><a id=\"fnref-gpu-usage-9758\" href=\"#fn-gpu-usage-9758\" data-footnote-ref=\"\">23<\/a><\/sup><\/p>\n<p dir=\"auto\" data-sourcepos=\"80:1-80:579\">You must also make sure that the requested node configuration is available and suitable for the job. For example, during development if the job is more likely to fail, it is better to set a low time limit and run the job on a test node if applicable. Similarly, the number of nodes and resources must be sufficient for your job while not being too excessive. It is better to split a long job into smaller jobs to avoid time limit issues and potentially wasting resources. To chain consecutive jobs with automated start and easier checkpointing we recommend snakemake<sup class=\"footnote-ref\"><a id=\"fnref-snakemake-9758\" href=\"#fn-snakemake-9758\" data-footnote-ref=\"\">24<\/a><\/sup>.<\/p>\n<h2 dir=\"auto\" data-sourcepos=\"82:1-82:13\"><a id=\"user-content-monitoring\" class=\"anchor\" href=\"#monitoring\" aria-hidden=\"true\"><\/a>Monitoring<\/h2>\n<p dir=\"auto\" data-sourcepos=\"84:1-84:155\">While your program is running, it is important to monitor the resource utilization on the nodes to identify bottlenecks, errors, and possible improvements.<\/p>\n<h3 dir=\"auto\" data-sourcepos=\"86:1-86:14\"><a id=\"user-content-cpu-vs-gpu\" class=\"anchor\" href=\"#cpu-vs-gpu\" aria-hidden=\"true\"><\/a>CPU vs GPU<\/h3>\n<p dir=\"auto\" data-sourcepos=\"88:1-88:369\">CPUs and GPUs are both powerful processing hardware that can perform complex calculations. Figure 3 depicts the inner structure of CPUs and GPUs. While CPUs are capable of handling a larger variety of instructions, GPUs are more efficient when it comes to specific simple instructions. It is important to know when using a GPU improves performance and when it does not.<\/p>\n<figure><a href=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/cpu-gpu.png\" class=\"external\" rel=\"nofollow\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 100%; height: 100%;\" src=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/cpu-gpu.png\" alt=\"\" width=\"442\" height=\"161\" \/><\/a><figcaption>Figure 3. Internal comparison between a CPU and a GPU. This figure was taken from <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/<\/a><\/p>\n<\/figcaption><\/figure>\n<p dir=\"auto\" data-sourcepos=\"99:1-99:307\">A thread is a series of sequential operations. A CPU is capable of processing multiple threads performing different tasks simultaneously. GPUs, on the other hand, are generally suitable for tasks that run the same operation on a large number of elements, i.e., running many similar threads at the same time.<\/p>\n<figure><a href=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/role-cpu-gpu.png\" class=\"external\" rel=\"nofollow\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 100%; height: 100%;\" src=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/role-cpu-gpu.png\" alt=\"\" width=\"2016\" height=\"565\" \/><\/a><\/figure>\n<p>Figure 4. Role of CPU and GPU in typical deep learning task.<\/p>\n<p dir=\"auto\" data-sourcepos=\"110:1-110:515\">Figure 4 depicts a typical deep learning task running on a GPU node. The CPU is responsible for starting and running the program, beginning by initiliazing the model and storing it in the RAM. We would now like to take advantage of the GPU&#8217;s fast processing power in order to train the model, however, the GPU cannot directly access data stored in the RAM. Instead, it has its own memory (usually called VRAM, short for Video-RAM), which can vary in size depending on the GPU, but is typically smaller than the RAM.<\/p>\n<p dir=\"auto\" data-sourcepos=\"112:1-112:265\">During training, the model parameters and data must be copied from the RAM into the GPU&#8217;s VRAM. The GPU can then feed the input data into the model, compute the result, and update the model. The result may then be transferred back to the CPU for further processing.<\/p>\n<p dir=\"auto\" data-sourcepos=\"114:1-114:416\">Transferring data between the CPU and GPU is not immediate, and the time it takes should be taken into account. If the processing task is too simple, running it on a GPU would have little advantage over a CPU. To make full use of a GPU, you must ensure that lots of computation is performed on the GPU and the amount of data transfer is low in comparison. This can be verified using our recommended monitoring tools.<\/p>\n<h3 dir=\"auto\" data-sourcepos=\"116:1-116:21\"><a id=\"user-content-cpu-and-ram-usage\" class=\"anchor\" href=\"#cpu-and-ram-usage\" aria-hidden=\"true\"><\/a>CPU and RAM Usage<\/h3>\n<p dir=\"auto\" data-sourcepos=\"118:1-118:191\">It is possible to view the hardware resource usage of a node by simply logging into it and running <code>top<\/code> (for CPU nodes) or <code>module load nvitop<\/code> then <code>nvitop<\/code> (for GPU nodes) in the terminal.<\/p>\n<figure><a href=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/nvitop.png\" class=\"external\" rel=\"nofollow\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 100%; height: 100%;\" src=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/02\/nvitop.png\" alt=\"\" width=\"1694\" height=\"935\" \/><\/a><figcaption>Figure 5. Example output from `nvitop`.<\/p>\n<p data-sourcepos=\"127:3-127:233\">A typical output of <code>nvitop<\/code> is shown in Figure 5. MEM depicts the amount of allocated GPU memory and UTL shows the GPU utilization. A constant low utilization of the GPU means that the GPU is mostly idle and waiting for more data.<\/p>\n<\/figcaption><\/figure>\n<p dir=\"auto\" data-sourcepos=\"131:1-131:419\">The GPU memory should be large enough to support the model and batch of data simultaneously. There are different types of GPUs available on the cluster which can be selected depending on the use case. For newer GPUs, you may run your job on a GPU slice using NVIDIA-MIG when the full GPU power is not required. An overview of the available GPUs and usage can be found here: <a href=\"https:\/\/www.hlrn.de\/doc\/display\/PUB\/GPU+Usage\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/www.hlrn.de\/doc\/display\/PUB\/GPU+Usage<\/a><\/p>\n<h3 dir=\"auto\" data-sourcepos=\"133:1-133:11\"><a id=\"user-content-example\" class=\"anchor\" href=\"#example\" aria-hidden=\"true\"><\/a>Example<\/h3>\n<p dir=\"auto\" data-sourcepos=\"135:1-135:113\">An example of deep learning with GPU on the cluster and a recorded video series is available. <sup><a id=\"fnref-dlgpu-3\" href=\"#fn-dlgpu\" data-footnote-ref=\"\">15<\/a><\/sup> <sup class=\"footnote-ref\"><a id=\"fnref-youtube-9758\" href=\"#fn-youtube-9758\" data-footnote-ref=\"\">25<\/a><\/sup><\/p>\n<section class=\"footnotes\" data-footnotes=\"\">\n<ol>\n<li id=\"fn-services-9758\">\n<p data-sourcepos=\"139:14-139:41\"><a href=\"https:\/\/gwdg.de\/en\/services\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/gwdg.de\/en\/services\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-services-9758\" aria-label=\"Back to reference 1\" data-footnote-backref-idx=\"1\" data-footnote-backref=\"\">\u21a9<\/a> <a href=\"#fnref-services-2\" aria-label=\"Back to reference 1-2\" data-footnote-backref-idx=\"1-2\" data-footnote-backref=\"\">\u21a9<sup>2<\/sup><\/a><\/p>\n<\/li>\n<li id=\"fn-ai-compute-9758\">\n<p data-sourcepos=\"143:16-143:66\"><a href=\"https:\/\/openai.com\/research\/ai-and-compute#addendum\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/openai.com\/research\/ai-and-compute#addendum<\/a> <a class=\"footnote-backref\" href=\"#fnref-ai-compute-9758\" aria-label=\"Back to reference 2\" data-footnote-backref-idx=\"2\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-gwdg-academy-9758\">\n<p data-sourcepos=\"169:18-169:48\"><a href=\"https:\/\/academy.gwdg.de\/academy\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/academy.gwdg.de\/academy<\/a> <a class=\"footnote-backref\" href=\"#fnref-gwdg-academy-9758\" aria-label=\"Back to reference 3\" data-footnote-backref-idx=\"3\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-hpc-9758\">\n<p data-sourcepos=\"141:9-141:36\"><a href=\"https:\/\/gwdg.de\/hpc\/systems\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/gwdg.de\/hpc\/systems\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-hpc-9758\" aria-label=\"Back to reference 4\" data-footnote-backref-idx=\"4\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-accelerators-9758\">\n<p data-sourcepos=\"144:18-144:112\">Both graphcore systems and neuromorphic chips are being acquired as part of the KISSKI project. <a class=\"footnote-backref\" href=\"#fnref-accelerators-9758\" aria-label=\"Back to reference 5\" data-footnote-backref-idx=\"5\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-hpc-access-9758\">\n<p data-sourcepos=\"145:16-145:127\"><a href=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230623_hpc-access.md?ref_type=heads\" class=\"external\" rel=\"nofollow\">https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230623_hpc-access.md?ref_type=heads<\/a> <a class=\"footnote-backref\" href=\"#fnref-hpc-access-9758\" aria-label=\"Back to reference 6\" data-footnote-backref-idx=\"6\" data-footnote-backref=\"\">\u21a9<\/a> <a href=\"#fnref-hpc-access-2\" aria-label=\"Back to reference 6-2\" data-footnote-backref-idx=\"6-2\" data-footnote-backref=\"\">\u21a9<sup>2<\/sup><\/a><\/p>\n<\/li>\n<li id=\"fn-kisski-9758\">\n<p data-sourcepos=\"146:12-146:34\"><a href=\"https:\/\/kisski.gwdg.de\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/kisski.gwdg.de\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-kisski-9758\" aria-label=\"Back to reference 7\" data-footnote-backref-idx=\"7\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-nhr-9758\">\n<p data-sourcepos=\"147:9-147:36\"><a href=\"https:\/\/www.nhr-verein.de\/en\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/www.nhr-verein.de\/en<\/a> <a class=\"footnote-backref\" href=\"#fnref-nhr-9758\" aria-label=\"Back to reference 8\" data-footnote-backref-idx=\"8\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-sd-ai-9758\">\n<p data-sourcepos=\"148:11-148:54\"><a href=\"https:\/\/gwdg.de\/en\/community-pages\/ai-intro\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/gwdg.de\/en\/community-pages\/ai-intro\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-sd-ai-9758\" aria-label=\"Back to reference 9\" data-footnote-backref-idx=\"9\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-sd-bioinfo-9758\">\n<p data-sourcepos=\"149:16-149:64\"><a href=\"https:\/\/gwdg.de\/en\/community-pages\/bioinfo-intro\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/gwdg.de\/en\/community-pages\/bioinfo-intro\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-sd-bioinfo-9758\" aria-label=\"Back to reference 10\" data-footnote-backref-idx=\"10\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-sd-forestry-9758\">\n<p data-sourcepos=\"151:17-151:66\"><a href=\"https:\/\/gwdg.de\/en\/community-pages\/forestry-intro\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/gwdg.de\/en\/community-pages\/forestry-intro\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-sd-forestry-9758\" aria-label=\"Back to reference 11\" data-footnote-backref-idx=\"11\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-cluster-practical-9758\">\n<p data-sourcepos=\"157:23-157:126\"><a href=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230417_cluster-practical.md\" class=\"external\" rel=\"nofollow\">https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230417_cluster-practical.md<\/a> <a class=\"footnote-backref\" href=\"#fnref-cluster-practical-9758\" aria-label=\"Back to reference 12\" data-footnote-backref-idx=\"12\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-hlrn-gpu-9758\">\n<p data-sourcepos=\"158:14-158:58\"><a href=\"https:\/\/www.hlrn.de\/doc\/display\/PUB\/GPU+Usage\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/www.hlrn.de\/doc\/display\/PUB\/GPU+Usage<\/a> <a class=\"footnote-backref\" href=\"#fnref-hlrn-gpu-9758\" aria-label=\"Back to reference 13\" data-footnote-backref-idx=\"13\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-slurm-9758\">\n<p data-sourcepos=\"159:11-159:35\"><a href=\"https:\/\/slurm.schedmd.com\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/slurm.schedmd.com<\/a> <a class=\"footnote-backref\" href=\"#fnref-slurm-9758\" aria-label=\"Back to reference 14\" data-footnote-backref-idx=\"14\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-dlgpu-9758\">\n<p data-sourcepos=\"160:11-160:73\"><a href=\"https:\/\/gitlab-ce.gwdg.de\/dmuelle3\/deep-learning-with-gpu-cores\" class=\"external\" rel=\"nofollow\">https:\/\/gitlab-ce.gwdg.de\/dmuelle3\/deep-learning-with-gpu-cores<\/a> <a class=\"footnote-backref\" href=\"#fnref-dlgpu-9758\" aria-label=\"Back to reference 15\" data-footnote-backref-idx=\"15\" data-footnote-backref=\"\">\u21a9<\/a> <a href=\"#fnref-dlgpu-2\" aria-label=\"Back to reference 15-2\" data-footnote-backref-idx=\"15-2\" data-footnote-backref=\"\">\u21a9<sup>2<\/sup><\/a> <a href=\"#fnref-dlgpu-3\" aria-label=\"Back to reference 15-3\" data-footnote-backref-idx=\"15-3\" data-footnote-backref=\"\">\u21a9<sup>3<\/sup><\/a><\/p>\n<\/li>\n<li id=\"fn-vscode-9758\">\n<p data-sourcepos=\"162:12-162:41\"><a href=\"https:\/\/code.visualstudio.com\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/code.visualstudio.com\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-vscode-9758\" aria-label=\"Back to reference 16\" data-footnote-backref-idx=\"16\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-vsextensions-9758\">\n<p data-sourcepos=\"163:18-163:60\"><a href=\"https:\/\/marketplace.visualstudio.com\/VSCode\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/marketplace.visualstudio.com\/VSCode<\/a> <a class=\"footnote-backref\" href=\"#fnref-vsextensions-9758\" aria-label=\"Back to reference 17\" data-footnote-backref-idx=\"17\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-vscodearticle-9758\">\n<p data-sourcepos=\"164:19-164:94\"><a href=\"https:\/\/info.gwdg.de\/news\/en\/configuring-vscode-to-access-gwdgs-hpc-cluster\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/info.gwdg.de\/news\/en\/configuring-vscode-to-access-gwdgs-hpc-cluster\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-vscodearticle-9758\" aria-label=\"Back to reference 18\" data-footnote-backref-idx=\"18\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-apptainer-9758\">\n<p data-sourcepos=\"166:15-166:132\"><a href=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230907_python-apptainer.md?ref_type=heads\" class=\"external\" rel=\"nofollow\">https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230907_python-apptainer.md?ref_type=heads<\/a> <a class=\"footnote-backref\" href=\"#fnref-apptainer-9758\" aria-label=\"Back to reference 19\" data-footnote-backref-idx=\"19\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-JAX-9758\">\n<p data-sourcepos=\"170:9-170:55\"><a href=\"https:\/\/jax.readthedocs.io\/en\/latest\/index.html\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/jax.readthedocs.io\/en\/latest\/index.html<\/a> <a class=\"footnote-backref\" href=\"#fnref-JAX-9758\" aria-label=\"Back to reference 20\" data-footnote-backref-idx=\"20\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-cookiecutter-9758\">\n<p data-sourcepos=\"168:18-168:72\"><a href=\"https:\/\/drivendata.github.io\/cookiecutter-data-science\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/drivendata.github.io\/cookiecutter-data-science\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-cookiecutter-9758\" aria-label=\"Back to reference 21\" data-footnote-backref-idx=\"21\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-slurm-docu-9758\">\n<p data-sourcepos=\"167:16-167:56\"><a href=\"https:\/\/www.hlrn.de\/doc\/display\/PUB\/Slurm\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/www.hlrn.de\/doc\/display\/PUB\/Slurm<\/a> <a class=\"footnote-backref\" href=\"#fnref-slurm-docu-9758\" aria-label=\"Back to reference 22\" data-footnote-backref-idx=\"22\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-gpu-usage-9758\">\n<p data-sourcepos=\"172:15-172:59\"><a href=\"https:\/\/www.hlrn.de\/doc\/display\/PUB\/GPU+Usage\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/www.hlrn.de\/doc\/display\/PUB\/GPU+Usage<\/a> <a class=\"footnote-backref\" href=\"#fnref-gpu-usage-9758\" aria-label=\"Back to reference 23\" data-footnote-backref-idx=\"23\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-snakemake-9758\">\n<p data-sourcepos=\"171:15-171:42\"><a href=\"https:\/\/snakemake.github.io\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/snakemake.github.io\/<\/a> <a class=\"footnote-backref\" href=\"#fnref-snakemake-9758\" aria-label=\"Back to reference 24\" data-footnote-backref-idx=\"24\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn-youtube-9758\">\n<p data-sourcepos=\"165:13-165:84\"><a href=\"https:\/\/www.youtube.com\/playlist?list=PLvcoSsXFNRblM4AG5PZwY1AfYEW3EbD9O\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external\">https:\/\/www.youtube.com\/playlist?list=PLvcoSsXFNRblM4AG5PZwY1AfYEW3EbD9O<\/a> <a class=\"footnote-backref\" href=\"#fnref-youtube-9758\" aria-label=\"Back to reference 25\" data-footnote-backref-idx=\"25\" data-footnote-backref=\"\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/section>\n<h3>Author<\/h3>\n<p><a href=\"mailto:ali.doost-hosseini@gwdg.de\">Ali Doost Hosseini<\/a> | <a href=\"mailto:hauke.kirchner@gwdg.de\">Hauke Kirchner<\/a> | Dorothea Sommer<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The GWDG offers data scientists various services and training courses to support them in their work throughout their entire workflow. As the success of a machine learning project often depends on the available computing resources, the working group &#8222;Computing&#8220; operates various HPC systems with suitable accelerators to train deep learning models. This article explains model &#8230; <a title=\"Best Practices for Machine Learning with HPC\" class=\"read-more\" href=\"https:\/\/info.gwdg.de\/news\/best-practices-for-machine-learning-with-hpc\/\" aria-label=\"Mehr Informationen \u00fcber Best Practices for Machine Learning with HPC\">Weiterlesen<\/a><\/p>\n","protected":false},"author":166,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1,133,129],"tags":[],"class_list":["post-23632","post","type-post","status-publish","format-standard","hentry","category-alle","category-kuenstliche-intelligenz","category-wissenschaftliche-domaenen"],"_links":{"self":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts\/23632","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/users\/166"}],"replies":[{"embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/comments?post=23632"}],"version-history":[{"count":4,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts\/23632\/revisions"}],"predecessor-version":[{"id":23647,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts\/23632\/revisions\/23647"}],"wp:attachment":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/media?parent=23632"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/categories?post=23632"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/tags?post=23632"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}