Differences

This shows you the differences between two versions of the page.

--- en:services:application_services:high_performance_computing:software:hail [2021/04/22 15:02] – created mboden
+++ en:services:application_services:high_performance_computing:software:hail [2021/12/06 11:40] (current) – mboden
@@ Line 1: / Line 1: @@
+====== Hail ======
+===== Introduction =====
+//Hail is an open-source, scalable framework for exploring and analyzing genomic data.// ([[https://hail.is/|hail.is]])
+The HPC system runs version ''0.2 beta'' which can be obtained from [[https://github.com/hail-is/hail|GitHub]]. The cluster installation has been performed by mostly following the instructions for [[https://hail.is/docs/devel/installation.html#running-on-a-spark-cluster|Running on a cluster]].
+===== Preparing a Spark Cluster =====
+Hail runs on top of an [[https://spark.apache.org/docs/latest/cluster-overview.html|Apache Spark]] cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared.
+==== Environment Variables ====
+Start by loading the modules for the ''Oracle JDK 1.8.0'' and ''Spark 2.3.1'':
+<code>
+module load JAVA/jdk1.8.0_31 spark/2.3.1
+</code>
+Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable ''SPARK_LOG_DIR''. For example, to use the directory ''spark-logs'' in your home directory, enter (or add to ''~/.bashrc'') the following:
+<code>
+export SPARK_LOG_DIR=$HOME/spark-logs
+</code>
+==== Submitting Spark Applications ====
+<WRAP center round info 60%>
+If you're just interested in running Hail, you can safely [[en:services:application_services:high_performance_computing:hail#running_hail|skip ahead]].
+</WRAP>
+Applications can be submitted almost as described in the [[https://spark.apache.org/docs/latest/submitting-applications.html#submitting-applications|Spark documentation]] but the submission has to be wrapped inside an LSF job like the one given by the following script
+<code>
+#!/bin/bash
+#SBATCH -p medium
+#SBATCH -N 4
+#SBATCH --ntasks-per-node=1
+#SBATCH -t 01:00:00
+lsf-spark-submit.sh $SPARK_ARGS
+</code>
+where ''spark-submit'' has been replaced by ''lsf-spark-submit.sh'' and ''$SPARK_ARGS'' are the submit arguments without the ''--master'' argument - this will be added automatically, depending on which cluster node the master has been launched on. Because of ''-N 4'' there are 4 nodes in total and ''<nowiki>--ntasks-per-node=1</nowiki>'' ensures that one worker per node is started.
+==== Interactive Sessions ====
+A Spark cluster to be used with Scala from the [[https://spark.apache.org/docs/latest/quick-start.html|interactive console]] can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script ''lsf-spark-shell.py'' instead:
+<code>
+srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh
+</code>
+===== Running Hail =====
+The Hail user interface requires at least ''Python 3.6'' so we load the corresponding module as well as the one for the application itself:
+<code>
+module load python/3.6.3 HAIL/0.2
+</code>
+Currently the following python packages are loaded by ''HAIL/0.2'' as well:
+<code>
+Package         Version
+--------------- -------
+bokeh           0.13.0
+Jinja2          2.10
+MarkupSafe      1.0
+numpy           1.15.0
+packaging       17.1
+pandas          0.23.3
+parsimonious    0.8.1
+pip             18.0
+pyparsing       2.2.0
+pyspark         2.3.1
+python-dateutil 2.7.3
+pytz            2018.5
+PyYAML          3.13
+scipy           1.1.0
+setuptools      28.8.0
+six             1.11.0
+tornado         5.1
+wheel           0.29.0
+</code>
+<WRAP center round help 60%>
+Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an [[mailto:hpc-support@gwdg.de|HPC support ticket]]. Alternatively, you can use ''HAIL/0.2_novenv'' instead - this module relies on user-provided virtual environments, so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: ''bokeh pandas parsimonious scipy''
+</WRAP>
+An LSF job running the ''pyspark''-based console for Hail can then be submitted as follows:
+<code>
+srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh
+</code>
+Once the console is running, initialize hail with the global Spark context ''sc'' in the following way:
+<code>
+import hail as hl
+hl.init(sc)
+</code>
+ --- //[[christian.koehler@gwdg.de|ckoehle2]] 2018/08/03 15:21//