Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
en:services:application_services:high_performance_computing:hail [2021/04/22 15:03]
mboden removed
— (current)
Line 1: Line 1:
-====== Hail ====== 
-===== Introduction ===== 
-//Hail is an open-source,​ scalable framework for exploring and analyzing genomic data.// ([[https://​hail.is/​|hail.is]]) 
  
-The HPC system runs version ''​0.2 beta''​ which can be obtained from [[https://​github.com/​hail-is/​hail|GitHub]]. The cluster installation has been performed by mostly following the instructions for [[https://​hail.is/​docs/​devel/​installation.html#​running-on-a-spark-cluster|Running on a cluster]]. 
-===== Preparing a Spark Cluster ===== 
-Hail runs on top of an [[https://​spark.apache.org/​docs/​latest/​cluster-overview.html|Apache Spark]] cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared. 
-==== Environment Variables ==== 
-Start by loading the modules for the ''​Oracle JDK 1.8.0''​ and ''​Spark 2.3.1'':​ 
-<​code>​ 
-module load JAVA/​jdk1.8.0_31 spark/2.3.1 
-</​code>​ 
-Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable ''​SPARK_LOG_DIR''​. For example, to use the directory ''​spark-logs''​ in your home directory, enter (or add to ''​~/​.bashrc''​) the following: 
-<​code>​ 
-export SPARK_LOG_DIR=$HOME/​spark-logs 
-</​code>​ 
-==== Submitting Spark Applications ==== 
-<WRAP center round info 60%> 
-If you're just interested in running Hail, you can safely [[en:​services:​application_services:​high_performance_computing:​hail#​running_hail|skip ahead]]. 
-</​WRAP>​ 
- 
-Applications can be submitted almost as described in the [[https://​spark.apache.org/​docs/​latest/​submitting-applications.html#​submitting-applications|Spark documentation]] but the submission has to be wrapped inside an LSF job like the one given by the following script 
-<​code>​ 
-#!/bin/bash 
-#SBATCH -p medium 
-#SBATCH -N 4 
-#SBATCH --ntasks-per-node=1 
-#SBATCH -t 01:00:00 
- 
-lsf-spark-submit.sh $SPARK_ARGS 
-</​code>​ 
-where ''​spark-submit''​ has been replaced by ''​lsf-spark-submit.sh''​ and ''​$SPARK_ARGS''​ are the submit arguments without the ''​--master''​ argument - this will be added automatically,​ depending on which cluster node the master has been launched on. Because of ''​-N 4''​ there are 4 nodes in total and ''<​nowiki>​--ntasks-per-node=1</​nowiki>''​ ensures that one worker per node is started. 
- 
-==== Interactive Sessions ==== 
-A Spark cluster to be used with Scala from the [[https://​spark.apache.org/​docs/​latest/​quick-start.html|interactive console]] can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script ''​lsf-spark-shell.py''​ instead: 
-<​code>​ 
-srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh 
-</​code>​ 
-===== Running Hail ===== 
-The Hail user interface requires at least ''​Python 3.6''​ so we load the corresponding module as well as the one for the application itself: 
-<​code>​ 
-module load python/​3.6.3 HAIL/0.2 
-</​code>​ 
-Currently the following python packages are loaded by ''​HAIL/​0.2''​ as well: 
-<​code>​ 
-Package ​        ​Version 
---------------- ------- 
-bokeh           ​0.13.0 ​ 
-Jinja2 ​         2.10    
-MarkupSafe ​     1.0    ​ 
-numpy           ​1.15.0 ​ 
-packaging ​      ​17.1 ​   
-pandas ​         0.23.3 ​ 
-parsimonious ​   0.8.1  ​ 
-pip             ​18.0 ​   
-pyparsing ​      ​2.2.0  ​ 
-pyspark ​        ​2.3.1  ​ 
-python-dateutil 2.7.3  ​ 
-pytz            2018.5 ​ 
-PyYAML ​         3.13    
-scipy           ​1.1.0  ​ 
-setuptools ​     28.8.0 ​ 
-six             ​1.11.0 ​ 
-tornado ​        ​5.1 ​   ​ 
-wheel           ​0.29.0 
-</​code>​ 
-<WRAP center round help 60%> 
-Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an [[mailto:​hpc@gwdg.de|HPC support ticket]]. Alternatively,​ you can use ''​HAIL/​0.2_novenv''​ instead - this module relies on user-provided virtual environments,​ so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: ''​bokeh pandas parsimonious scipy''​ 
-</​WRAP>​ 
- 
-An LSF job running the ''​pyspark''​-based console for Hail can then be submitted as follows: 
-<​code>​ 
-srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh 
-</​code>​ 
-Once the console is running, initialize hail with the global Spark context ''​sc''​ in the following way: 
-<​code>​ 
-import hail as hl 
-hl.init(sc) 
-</​code>​ 
- 
- --- //​[[christian.koehler@gwdg.de|ckoehle2]] 2018/08/03 15:21//