{"id":24051,"date":"2024-09-18T15:07:53","date_gmt":"2024-09-18T13:07:53","guid":{"rendered":"https:\/\/info.gwdg.de\/news\/?p=24051"},"modified":"2024-09-19T07:53:20","modified_gmt":"2024-09-19T05:53:20","slug":"deep-learning-containers-in-hpc","status":"publish","type":"post","link":"https:\/\/info.gwdg.de\/news\/deep-learning-containers-in-hpc\/","title":{"rendered":"Deep Learning Containers in HPC"},"content":{"rendered":"<p>Reliable software environments are often a problem for data scientists. This becomes an issue especially when the analysis needs to be run on different machines \u2013 colleagues laptop, cloud VM or HPC cluster. Here we introduce the idea of containerised software environments in the use case of a deep learning workflow. A more general view on this has already been described in our blog post <a href=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230907_python-apptainer.md\" data-sourcepos=\"7:381-7:521\" class=\"external\" rel=\"nofollow\">Declutter your Python environments<\/a>.<\/p>\n<p>A more permanent version of this article with updates and fixes can be found in our <a title=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20240918_deep-learning-containers-in-hpc.md?ref_type=heads\" href=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20240918_deep-learning-containers-in-hpc.md?ref_type=heads\" target=\"_blank\" rel=\"noopener noreferrer nofollow\" class=\"external\">GitLab Repository<\/a>.<\/p>\n<figure style=\"width: 1407px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/09\/dlc.png\" class=\"external\" rel=\"nofollow\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 100%; height: 100%;\" src=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/09\/dlc.png\" alt=\"\" width=\"1417\" height=\"1152\" \/><\/a><figcaption class=\"wp-caption-text\">Overview of a deep learning container. This example deep learning container may be a bit overloaded, but it shows the flexibility and versatility you can get when using containers to wrap your software environment.<\/figcaption><\/figure>\n<h3>Why? Lets ask <a href=\"https:\/\/chat-ai.academiccloud.de\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" data-sourcepos=\"19:18-19:61\" class=\"external\">Chat AI<\/a><\/h3>\n<p>What are the five most important benefits of using containers in my machine learning workflow?<\/p>\n<ol dir=\"auto\" data-sourcepos=\"23:1-28:0\">\n<li data-sourcepos=\"23:1-23:207\"><strong data-sourcepos=\"23:4-23:38\">Consistency and Reproducibility<\/strong>: Containers ensure that your ML environment is consistent across different machines and environments, making it easier to reproduce results and collaborate with others.<\/li>\n<li data-sourcepos=\"24:1-24:222\"><strong data-sourcepos=\"24:4-24:28\">Dependency Management<\/strong>: Containers simplify dependency management by packaging all required libraries and dependencies within the container, eliminating version conflicts and ensuring that your ML code runs smoothly.<\/li>\n<li data-sourcepos=\"25:1-25:211\"><strong data-sourcepos=\"25:4-25:29\">Isolation and Security<\/strong>: Containers provide a sandboxed environment for your ML code, isolating it from the host system and preventing potential security risks, such as data breaches or malware infections.<\/li>\n<li data-sourcepos=\"26:1-26:229\"><strong data-sourcepos=\"26:4-26:34\">Portability and Flexibility<\/strong>: Containers are highly portable, allowing you to move your ML workloads between different environments, such as development, testing, and production, without worrying about compatibility issues.<\/li>\n<li data-sourcepos=\"27:1-28:0\"><strong data-sourcepos=\"27:4-27:37\">Efficient Resource Utilization<\/strong>: Containers can run multiple ML workloads on a single machine, maximizing resource utilization and reducing the need for additional hardware.<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>Credits: Here the self-hosted model LLaMA 3 70B Instruct was used to generate this answer. Only an excerpt of the original answer is shown here.<\/p>\n<h2 dir=\"auto\" data-sourcepos=\"31:1-31:33\">Building the container \ud83d\udee0\ufe0f<\/h2>\n<pre class=\"code highlight\" lang=\"plaintext\">me@local-machine~ % ssh grete\r\n[me@glogin9 ~]$ cd \/path\/to\/deep-learning-with-gpu-cores\/code\/container\r\n[me@glogin9 ~]$ bash build_dlc-dlwgpu.sh\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p dir=\"auto\" data-sourcepos=\"39:1-40:494\">The container can be built on our various HPC clusters using the <code data-sourcepos=\"39:67-39:80\">build_dlc-*.sh<\/code> scripts. Details on how to use Apptainer and the process of building containers are described in our <a href=\"https:\/\/docs.hpc.gwdg.de\/software\/apptainer\/\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" data-sourcepos=\"39:184-39:244\" class=\"external\">documentation<\/a> and our blog article <a href=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/science-domains-blog\/-\/blob\/main\/20230907_python-apptainer.md\" data-sourcepos=\"39:267-39:410\" class=\"external\" rel=\"nofollow\">Decluttering your Python environments<\/a>. Running <code data-sourcepos=\"40:10-40:28\">build_dlc-dlwgpu.sh<\/code> will build an image with the software used for our workshop example, as defined in <code data-sourcepos=\"40:115-40:128\">dlc-dlwgpu.def<\/code>. Contrary to the traditional way of using conda to install all packages defined in a <code data-sourcepos=\"40:217-40:232\">requirements.txt<\/code> file, pip is used here to reduce the number of software packages used. However, there are good reasons to use conda, so <code data-sourcepos=\"40:356-40:381\">build_dlc-conda-example.sh<\/code> shows a minimal example of installing Python packages in a container using conda (see <code data-sourcepos=\"40:471-40:491\">dlc-conda-example.def<\/code>).<\/p>\n<p dir=\"auto\" data-sourcepos=\"42:1-42:225\">If you encounter errors while building the container, have a look at <code data-sourcepos=\"42:71-42:90\">build_dlc-dlwgpu.log<\/code> or <code data-sourcepos=\"42:97-42:123\">build_dlc-conda-example.log<\/code>. You can also use <code data-sourcepos=\"42:145-42:147\">cat<\/code> (in a different terminal) to see the progress of the image building process.<\/p>\n<p dir=\"auto\" data-sourcepos=\"44:1-44:455\">The image can now be used on our JupyterHub service or via the command line on our HPC systems. We recommend using JupyterHub for the initial phase of a project to set up the workflow and for data exploration. Compute-intensive workloads, such as model training, should be run on the HPC cluster using SLURM. This allows users to execute computational jobs in a very flexible and customisable way, making the best use of the available computing resources.<\/p>\n<h3 dir=\"auto\" data-sourcepos=\"44:1-44:455\">Using the container on HPC JupyterHub<\/h3>\n<p>Below are two examples of how to use HPC JupyterHub. The first example shows the use of a JupyterNotebook for data exploration. The second example is about testing the software environment to ensure that everything is set up correctly, e.g. CUDA for GPU acceleration.<\/p>\n<figure id=\"attachment_24064\" aria-describedby=\"caption-attachment-24064\" style=\"width: 1382px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/09\/hpc-jupyterhub-screenshot-1.png\" class=\"external\" rel=\"nofollow\"><img loading=\"lazy\" decoding=\"async\" style=\"width: 100%; height: 100%;\" src=\"https:\/\/info.gwdg.de\/news\/wp-content\/uploads\/2024\/09\/hpc-jupyterhub-screenshot-1.png\" alt=\"\" width=\"1392\" height=\"1527\" \/><\/a><figcaption id=\"caption-attachment-24064\" class=\"wp-caption-text\">Screenshot of the HPC JupyterHub with examples<\/figcaption><\/figure>\n<div><\/div>\n<div>\n<div>1. visit https:\/\/jupyter.hpc.gwdg.de<\/div>\n<div>2. click on the &#8222;Sign in with AcademicCloud&#8220; button<\/div>\n<div>3. start your server with the following options (everything else is default)<\/div>\n<div style=\"padding-left: 40px;\">&#8211; Select a job profile: `GWDG HPC with own Container`<\/div>\n<div style=\"padding-left: 40px;\">&#8211; Set your own Apptainer container location: `$WORK\/dlc-dlwgpu.sif`<\/div>\n<div style=\"padding-left: 40px;\">&#8211; Jupyter Notebook&#8217;s Home directory: `\/path\/to\/deep-learning-with-gpu-cores\/`<\/div>\n<div>4. run the code<\/div>\n<div style=\"padding-left: 40px;\">&#8211; Notebook: Open `\/path\/to\/deep-learning-with-gpu-cores\/code\/notebooks\/00.1-inital-data-exploration.ipynb`<\/div>\n<div style=\"padding-left: 40px;\">&#8211; Python file:<\/div>\n<div style=\"padding-left: 80px;\">&#8211; Open `\/path\/to\/deep-learning-with-gpu-cores\/test_env.py`.<\/div>\n<div style=\"padding-left: 80px;\">&#8211; [Create Console for Editor](https:\/\/jupyterlab.readthedocs.io\/en\/stable\/user\/documents_kernels.html) and run the code in the console.<\/div>\n<div>5. Stop the JupyterHub server<\/div>\n<div style=\"padding-left: 40px;\">&#8211; File -&gt; Hub Control Panel -&gt; Stop My Server<\/div>\n<\/div>\n<h3><a id=\"user-content-using-the-container-via-cli-on-the-hpc-\" class=\"anchor\" href=\"#using-the-container-via-cli-on-the-hpc-\" aria-hidden=\"true\"><\/a>Using the container via CLI on the HPC<\/h3>\n<p>The full potential of an HPC cluster can only be utilised using the command line interface (CLI). Workflows can be optimized for the available hardware, such as different accelerators (e.g. GPUs, NPU, &#8230;) and highly parallel workflows. Here a simple workflow using our workshops example is shown, based on the same container, that was used in JupyterHub. For more details please have a look at our <a href=\"https:\/\/docs.hpc.gwdg.de\/index.html\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" data-sourcepos=\"75:400-75:451\" class=\"external\">documentation<\/a>, e.g. on <a href=\"https:\/\/docs.hpc.gwdg.de\/usage_guide\/slurm\/gpu_usage\/index.html\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" data-sourcepos=\"75:462-75:543\" class=\"external\">Slurm\/GPU Usage<\/a> and <a href=\"https:\/\/docs.hpc.gwdg.de\/compute_partitions\/gpu_partitions\/index.html\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" data-sourcepos=\"75:549-75:635\" class=\"external\">GPU Partitions<\/a>.<\/p>\n<h4><a id=\"user-content-interactive-hpc-usage\" class=\"anchor\" href=\"#interactive-hpc-usage\" aria-hidden=\"true\"><\/a>Interactive HPC usage<\/h4>\n<p>Using an HPC cluster interactively gives you direct access to the cluster&#8217;s compute resources. This can be particularly useful for optimising your workflow on the HPC. For the interactive usage first the compute ressources are requested (<code data-sourcepos=\"79:240-79:245\">salloc<\/code>). On the requested compute node, a shell is started within the container, allowing code to be executed with the pre-built software environment. In this example, the software environment is tested to make sure that everything works as intended, such as the correct use of CUDA for GPU acceleration (<code data-sourcepos=\"79:547-79:557\">test_env.py<\/code>).<\/p>\n<pre class=\"code highlight\" lang=\"shell\"><span class=\"nb\">cd <\/span>path\/to\/deep-learning-with-gpu-cores\/code\/\r\n\r\n<span class=\"c\"># get available GPUs with (see docs.hpc.gwdg.de\/usage_guide\/slurm\/gpu_usage\/)<\/span>\r\n<span class=\"c\"># sinfo -o \"%25N  %5c  %10m  %32f  %10G %18P \" | grep gpu<\/span>\r\n\r\nsalloc <span class=\"nt\">-t<\/span> 01:00:00 <span class=\"nt\">-p<\/span> grete:interactive <span class=\"nt\">-N1<\/span> <span class=\"nt\">-G<\/span> 3g.20gb\r\n\r\nmodule load apptainer\r\n\r\napptainer <span class=\"nb\">exec<\/span> <span class=\"nt\">--nv<\/span> <span class=\"nt\">--bind<\/span> \/scratch dlc.sif bash\r\n\r\npython ..\/test_env.py\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h3 dir=\"auto\" data-sourcepos=\"96:1-96:31\">HPC usage via batch scripts<\/h3>\n<p dir=\"auto\" data-sourcepos=\"98:1-98:451\">Finally, the defined workflow can be submitted as a job to the workload manager <a href=\"https:\/\/docs.hpc.gwdg.de\/usage_guide\/slurm\/index.html\" target=\"_blank\" rel=\"nofollow noreferrer noopener\" data-sourcepos=\"98:81-98:142\" class=\"external\">SLURM<\/a>. To do this, a job script needs to be defined and submitted via <code data-sourcepos=\"98:209-98:214\">sbatch<\/code>. As you already have a lot of experience with the containerised software environment, starting on your local machine and JupyterHub, a major point of failure when scaling up your analysis and moving to HPC systems has become unlikely.<\/p>\n<p><code>sbatch path\/to\/deep-learning-with-gpu-cores\/code\/submit_train_dlc.sh<br \/>\n<\/code><\/p>\n<h3>Author<\/h3>\n<p><a href=\"mailto:hauke.kirchner@gwdg.de\" data-sourcepos=\"3:9-3:55\">Hauke Kirchner<\/a><\/p>\n<p>This article was first published in scope of the workshop <a href=\"https:\/\/gitlab-ce.gwdg.de\/hpc-team-public\/deep-learning-with-gpu-cores\" data-sourcepos=\"5:61-5:162\" class=\"external\" rel=\"nofollow\">Deep learning with GPU cores<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reliable software environments are often a problem for data scientists. This becomes an issue especially when the analysis needs to be run on different machines \u2013 colleagues laptop, cloud VM or HPC cluster. Here we introduce the idea of containerised software environments in the use case of a deep learning workflow. A more general view &#8230; <a title=\"Deep Learning Containers in HPC\" class=\"read-more\" href=\"https:\/\/info.gwdg.de\/news\/deep-learning-containers-in-hpc\/\" aria-label=\"Mehr Informationen \u00fcber Deep Learning Containers in HPC\">Weiterlesen<\/a><\/p>\n","protected":false},"author":166,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132,133,129],"tags":[],"class_list":["post-24051","post","type-post","status-publish","format-standard","hentry","category-forstwirtschaften","category-kuenstliche-intelligenz","category-wissenschaftliche-domaenen"],"_links":{"self":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts\/24051","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/users\/166"}],"replies":[{"embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/comments?post=24051"}],"version-history":[{"count":21,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts\/24051\/revisions"}],"predecessor-version":[{"id":24090,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/posts\/24051\/revisions\/24090"}],"wp:attachment":[{"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/media?parent=24051"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/categories?post=24051"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/info.gwdg.de\/news\/wp-json\/wp\/v2\/tags?post=24051"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}