Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:services:application_services:high_performance_computing:spark [2019/09/02 20:32]
ckoehle2 add link to GöNET
en:services:application_services:high_performance_computing:spark [2019/09/05 16:49] (current)
ckoehle2 link to user wiki
Line 4: Line 4:
 [[https://​spark.apache.org/​|Apache Spark]] is a distributed general-purpose cluster computing system. [[https://​spark.apache.org/​|Apache Spark]] is a distributed general-purpose cluster computing system.
  
-Instead of the classic Map Reduce Pipeline, Spark’s central concept is a resilient distributed dataset (RDD) which is operated on with the help a central driver program making use of the parallel operations and the scheduling and I/O facilities which Spark provides. Transformations on the RDD are executed by the worker nodes in the Spark cluster. The dataset is resilient because Spark automatically handles failures in the Worker nodes by [[https://​spark.apache.org/​docs/​latest/​rdd-programming-guide.html|redistributing]] the work to other nodes.+Instead of the classic Map Reduce Pipeline, Spark’s central concept is a resilient distributed dataset (RDD) which is operated on with the help of a central driver program making use of the parallel operations and the scheduling and I/O facilities which Spark provides. Transformations on the RDD are executed by the worker nodes in the Spark cluster. The dataset is resilient because Spark automatically handles failures in the Worker nodes by [[https://​spark.apache.org/​docs/​latest/​rdd-programming-guide.html|redistributing]] the work to other nodes.
  
 In the following sections, we give a short introduction on how to prepare a Spark cluster and run applications on it in the Scientific Compute Cluster. In the following sections, we give a short introduction on how to prepare a Spark cluster and run applications on it in the Scientific Compute Cluster.
Line 66: Line 66:
 </​code>​ </​code>​
  
-===== Example: Approximating ​PI =====+===== Example: Approximating ​Pi =====
  
 To showcase the capabilities of the Spark cluster set up thus far we enter a short [[https://​spark.apache.org/​examples.html|Scala program]] into the shell we’ve started before. To showcase the capabilities of the Spark cluster set up thus far we enter a short [[https://​spark.apache.org/​examples.html|Scala program]] into the shell we’ve started before.
Line 72: Line 72:
 {{ :​en:​services:​application_services:​high_performance_computing:​spark:​shell_example.png?​nolink&​800 |}} {{ :​en:​services:​application_services:​high_performance_computing:​spark:​shell_example.png?​nolink&​800 |}}
  
-The local dataset containing the integers from //1// to //1E9// is distributed across the executors using the parallelize function and filtered according to the rule that the random point //(x,y)// with //0 < x, y < 1// that is being sampled according to a uniform distribution,​ is inside the unit circle. Consequently,​ the ratio of the points conforming to this rule to the total number of points approximates the area of one quarter of the unit circle and allows us to extract an estimate for the number //PI// in the last line.+The local dataset containing the integers from //1// to //1E9// is distributed across the executors using the parallelize function and filtered according to the rule that the random point //(x,y)// with //0 < x, y < 1// that is being sampled according to a uniform distribution,​ is inside the unit circle. Consequently,​ the ratio of the points conforming to this rule to the total number of points approximates the area of one quarter of the unit circle and allows us to extract an estimate for the number //Pi// in the last line
 + 
 +===== Further reading ===== 
 +You can find a more in-depth tour on the Spark architecture,​ features and examples (based on Scala) in the [[https://​info.gwdg.de/​wiki/​doku.php?​id=wiki:​hpc:​parallel_processing_with_spark_on_gwdg_s_scientific_compute_cluster|HPC wiki]].
  
  --- //​[[christian.koehler@gwdg.de|ckoehle2]] 2019/09/02 19:50//  --- //​[[christian.koehler@gwdg.de|ckoehle2]] 2019/09/02 19:50//