This is an old revision of the document!
Table of Contents
Apache Spark is a distributed general-purpose cluster computing system. Instead of the classic Map Reduce Pipeline, Spark’s central concept is a resilient distributed dataset (RDD) which is operated on with the help a central driver program making use of the parallel operations and the scheduling and I/O facilities which Spark provides. Transformations on the RDD are executed by the worker nodes in the Spark cluster. The dataset is resilient because Spark automatically handles failures in the Worker nodes by redistributing the work to other nodes. In the following sections, we give a short introduction on how to prepare a Spark cluster and run applications on it in the Scientific Compute Cluster.