This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:services:application_services:high_performance_computing:spark [2019/09/02 16:39]
ckoehle2 add introduction
— (current)
Line 1: Line 1:
-====== Apache Spark ====== 
-<WRAP center round important 60%> 
-under construction 
-===== Introduction ===== 
-Apache Spark is a distributed general-purpose cluster computing system. 
-Instead of the classic Map Reduce Pipeline, Spark’s central concept is a resilient distributed dataset (RDD) which is operated on with the help a central driver program making use of the parallel operations and the scheduling and I/O facilities which Spark provides. Transformations on the RDD are executed by the worker nodes in the Spark cluster. The dataset is resilient because Spark automatically handles failures in the Worker nodes by [[https://​spark.apache.org/​docs/​latest/​rdd-programming-guide.html|redistributing]] the work to other nodes. 
-In the following sections, we give a short introduction on how to prepare a Spark cluster and run applications on it in the Scientific Compute Cluster. 
-===== Creating a Spark Cluster on the SCC ===== 
-===== Access and Monitoring ===== 
-===== Example: Approximating PI =====