This shows you the differences between two versions of the page.

Link to this comparison view

en:services:application_services:high_performance_computing:software:hail [2021/04/22 15:02] (current)
mboden created
Line 1: Line 1:
 +====== Hail ======
 +===== Introduction =====
 +//Hail is an open-source,​ scalable framework for exploring and analyzing genomic data.// ([[https://​hail.is/​|hail.is]])
 +The HPC system runs version ''​0.2 beta''​ which can be obtained from [[https://​github.com/​hail-is/​hail|GitHub]]. The cluster installation has been performed by mostly following the instructions for [[https://​hail.is/​docs/​devel/​installation.html#​running-on-a-spark-cluster|Running on a cluster]].
 +===== Preparing a Spark Cluster =====
 +Hail runs on top of an [[https://​spark.apache.org/​docs/​latest/​cluster-overview.html|Apache Spark]] cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared.
 +==== Environment Variables ====
 +Start by loading the modules for the ''​Oracle JDK 1.8.0''​ and ''​Spark 2.3.1'':​
 +module load JAVA/​jdk1.8.0_31 spark/2.3.1
 +Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable ''​SPARK_LOG_DIR''​. For example, to use the directory ''​spark-logs''​ in your home directory, enter (or add to ''​~/​.bashrc''​) the following:
 +export SPARK_LOG_DIR=$HOME/​spark-logs
 +==== Submitting Spark Applications ====
 +<WRAP center round info 60%>
 +If you're just interested in running Hail, you can safely [[en:​services:​application_services:​high_performance_computing:​hail#​running_hail|skip ahead]].
 +Applications can be submitted almost as described in the [[https://​spark.apache.org/​docs/​latest/​submitting-applications.html#​submitting-applications|Spark documentation]] but the submission has to be wrapped inside an LSF job like the one given by the following script
 +#SBATCH -p medium
 +#SBATCH -N 4
 +#SBATCH --ntasks-per-node=1
 +#SBATCH -t 01:00:00
 +lsf-spark-submit.sh $SPARK_ARGS
 +where ''​spark-submit''​ has been replaced by ''​lsf-spark-submit.sh''​ and ''​$SPARK_ARGS''​ are the submit arguments without the ''​--master''​ argument - this will be added automatically,​ depending on which cluster node the master has been launched on. Because of ''​-N 4''​ there are 4 nodes in total and ''<​nowiki>​--ntasks-per-node=1</​nowiki>''​ ensures that one worker per node is started.
 +==== Interactive Sessions ====
 +A Spark cluster to be used with Scala from the [[https://​spark.apache.org/​docs/​latest/​quick-start.html|interactive console]] can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script ''​lsf-spark-shell.py''​ instead:
 +srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh
 +===== Running Hail =====
 +The Hail user interface requires at least ''​Python 3.6''​ so we load the corresponding module as well as the one for the application itself:
 +module load python/​3.6.3 HAIL/0.2
 +Currently the following python packages are loaded by ''​HAIL/​0.2''​ as well:
 +Package ​        ​Version
 +--------------- -------
 +bokeh           ​0.13.0 ​
 +Jinja2 ​         2.10   
 +MarkupSafe ​     1.0    ​
 +numpy           ​1.15.0 ​
 +packaging ​      ​17.1 ​  
 +pandas ​         0.23.3 ​
 +parsimonious ​   0.8.1  ​
 +pip             ​18.0 ​  
 +pyparsing ​      ​2.2.0  ​
 +pyspark ​        ​2.3.1  ​
 +python-dateutil 2.7.3  ​
 +pytz            2018.5 ​
 +PyYAML ​         3.13   
 +scipy           ​1.1.0  ​
 +setuptools ​     28.8.0 ​
 +six             ​1.11.0 ​
 +tornado ​        ​5.1 ​   ​
 +wheel           ​0.29.0
 +<WRAP center round help 60%>
 +Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an [[mailto:​hpc@gwdg.de|HPC support ticket]]. Alternatively,​ you can use ''​HAIL/​0.2_novenv''​ instead - this module relies on user-provided virtual environments,​ so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: ''​bokeh pandas parsimonious scipy''​
 +An LSF job running the ''​pyspark''​-based console for Hail can then be submitted as follows:
 +srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh
 +Once the console is running, initialize hail with the global Spark context ''​sc''​ in the following way:
 +import hail as hl
 + --- //​[[christian.koehler@gwdg.de|ckoehle2]] 2018/08/03 15:21//