Apache Spark: Resilient Distributed Dataset (RDD)
The core concept of the Spark is “resilient computational model”. Therefore, Spark abstracts data into Resilient Distributed Dataset (RDD), which is presented as a data structure distributed across main memories of cluster’s machines for further parallel processing.
The features of RDDs (decomposing the name):
Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned datawith primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).
From the scaladoc of org.apache.spark.rdd.RDD:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.+
From the original paper about RDD -Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing:
Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
RDD operations: Transformations and Actions
RDD could contain data of any types and could be created by loading external data or from existing RDD. The RDD supports operations of two types:
- transformations, which are intended to transform RDD to the new RDD by applying operations, such as, map, filter, join and etc; lazy operations that return another RDD
- actions, which produce some values obtained as a result of some calculations on the RDD, example of such actions are count, reduce; operations that invoke computation and return values
Transformations in the Spark are carried out in “lazy mode”, which means that the result will not be calculated immediately, instead of this, Spark simply “remember” the operation and data, on which the operation has to be performed. Calculation of transformation takes place only when the action is invoked, and the result has to be returned to the driver program.