Papers
Topics
Authors
Recent
Search
2000 character limit reached

Resilient Distributed Datasets (RDDs)

Updated 4 February 2026
  • RDDs are an immutable, partitioned collection of records with a lineage graph that ensures deterministic transformations and robust fault tolerance.
  • They enable efficient, high-throughput in-memory processing for analytics, iterative machine learning, and large-scale data mining workloads.
  • RDDs support lazy evaluation through transformations and actions, optimizing task scheduling and resource management in distributed systems.

Resilient Distributed Datasets (RDDs) are the foundational abstraction for distributed, in-memory data processing in Apache Spark. RDDs provide an immutable, partitioned collection of records distributed across a cluster, equipped with a lineage graph that tracks the sequence of deterministic transformations applied to produce them. The abstraction enables expressive fault-tolerant computation, efficient scheduling, and flexible memory management for high-throughput analytics, iterative machine learning, and large-scale data mining workloads (Morrelli, 2018, Tang et al., 2018, Yang et al., 2018).

1. Formal Definition and Core Properties

An RDD is defined as an immutable, partitioned collection of elements residing across the nodes of a cluster, along with a lineage graph—a Directed Acyclic Graph (DAG) that encodes the deterministic transformations used to derive each RDD from its predecessors or input sources. The key properties are:

  • Immutability: The contents of an RDD cannot be altered post-creation; all mutations result in the construction of a new RDD (Morrelli, 2018, Tang et al., 2018, Yang et al., 2018).
  • Partitioning: RDD data is automatically split into partitions, each typically mapped to a specific executor or node for parallelization. Formally, partitions are assigned via a function pp: RDD→{p1,p2,...,pn}\text{RDD} \rightarrow \{p_1, p_2, ..., p_n\} (Tang et al., 2018).
  • Lineage: RDDs maintain a record of the DAG of transformations (lineage graph), crucial for fault tolerance as it provides recipes for re-deriving lost partitions (Morrelli, 2018, Tang et al., 2018, Yang et al., 2018).
  • Fault Tolerance: Upon partition loss (e.g., node failure), Spark reconstructs only the affected partitions by replaying the relevant transformations from the lineage, without full data replication (Morrelli, 2018, Tang et al., 2018, Yang et al., 2018).

This abstraction permits efficient in-memory computation, fault-tolerant analytics, and workload scalability.

2. RDD Programming Model: Transformations, Actions, and Lazy Evaluation

The RDD API supports two categories of operations:

  • Transformations: These are lazy, create a new RDD, and do not trigger computation until an action is invoked. Examples include map, filter, flatMap, reduceByKey, groupByKey, and join (Morrelli, 2018, Tang et al., 2018). Formally, a transformation TT applied to RDDs R1,...,RkR_1,...,R_k results in a new RDD R=T(R1,...,Rk)R = T(R_1,...,R_k), adding edges to the lineage DAG.
  • Actions: These trigger physical execution and return results or perform I/O (e.g., collect, count, reduce, saveAsTextFile, first, take) (Morrelli, 2018, Tang et al., 2018).

Example in Scala (from (Tang et al., 2018)):

1
2
3
4
5
val lines: RDD[String] = sc.textFile("hdfs://…")
val words: RDD[String] = lines.flatMap(_.split(" "))
val pairs: RDD[(String,Int)] = words.map(w ⇒ (w,1))
val counts: RDD[(String,Int)] = pairs.reduceByKey(_+_)
val output: Array[(String,Int)] = counts.collect()

Each transformation appends to the lineage DAG; actions traverse the DAG, submitting pipeline-executable stages to the cluster.

3. Execution Engine, Scheduling, and Fault Recovery

Spark orchestrates the distributed evaluation of RDDs as follows:

  • Partition Scheduling: Every RDD is divided into logical partitions, each assigned by the driver to an executor as a separate task (Morrelli, 2018, Tang et al., 2018).
  • Lineage-Based Task Graph: Transformations extend the lineage DAG, and the scheduler decomposes this into pipeline-parallel stages. Operations within a stage (e.g., map, filter) are pipelined; shuffle operations introduce stage boundaries (Tang et al., 2018).
  • Fault Recovery: If a partition is lost (e.g., due to executor failure), Spark uses the lineage DAG to recompute only the affected partitions from source data, applying the corresponding sequence of transformations (Morrelli, 2018, Tang et al., 2018, Yang et al., 2018).

The scheduling model is explicitly DAG-based, optimizing resource usage and minimizing recomputation.

4. Persistence, Caching, and Storage Levels

RDDs can be materialized (cached or persisted) in memory or on disk to accelerate iterative and interactive workloads:

  • Persistence API: Users mark RDDs for caching using rdd.persist(StorageLevel) or rdd.cache(). Storage levels include MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, OFF_HEAP, and combinations with replication (Tang et al., 2018, Yang et al., 2018).
  • Eviction Policy: When cache memory is exhausted, Spark's default is Least Recently Used (LRU) block eviction, although this may be suboptimal with complex DAG workloads (Yang et al., 2018).
  • Optimization Algorithms: The formal RDD caching problem is cast as a knapsack optimization—given costs c(v)c(v), sizes s(v)s(v), and frequencies f(v)f(v) for RDD partition nodes vv, maximize total recompute savings under cache size MM. Advanced approaches provide (1−1/e)(1-1/e)-approximation guarantees using submodular maximization and adaptive, gradient-based online algorithms, outperforming LRU by up to 12–40% in work reduction and makespan (Yang et al., 2018).

5. Performance Analysis and Quantitative Experiments

Quantitative comparisons of RDDs vs. Datasets in Spark 2.3.0 (Java 1.8, see (Morrelli, 2018)) highlight:

  • Small Collections: Performance is nearly identical (e.g., for "Dictionary" on the PAISÀ corpus, JavaRDD and Dataset both execute in 1m58s).
  • Larger Collections: Datasets yield increasing speedups (e.g., up to 14% throughput gain on the Wikipedia ThreeGrams task; 33% on large JSON dictionaries).
  • Decision Criteria: RDDs are preferred for custom, low-level manipulations or nested/non-tabular JSON data; Datasets offer higher-level, optimized APIs for tabular or structured data and enable further improvements when Catalyst optimizations can be leveraged (Morrelli, 2018).

Table: Representative Execution Times (NC_Spark_Reduce – (Morrelli, 2018))

Task JavaRDD (s) Dataset (s) Speedup (RDD/DS)
Dictionary 183 168 1.089
TwoGrams 361 316 1.142
ThreeGrams 746 640 1.166

A plausible implication is that RDDs are not inherently slower than Datasets for small inputs, but their lack of query-level optimizations becomes pronounced at scale.

6. Applications and Ecosystem Integration

RDDs provide the substrate for advanced Spark components:

  • Spark SQL/DataFrames: Implement a relational API atop RDDs, exploiting Catalyst optimizations and code generation (Tang et al., 2018).
  • GraphX: Encodes vertices and edges as RDDs for distributed graph analytics (e.g., Pregel-style algorithms).
  • MLlib: Uses RDDs for scalable, iterative linear algebra, and gradient-based learning algorithms.
  • Spark Streaming: Treats each micro-batch as an RDD for consistent scheduling and recovery semantics.
  • Caching Optimization: Heuristics and formal algorithms for RDD cache management, such as EWMA-based scores, improve iterative ML workload throughput over baseline LRU (Yang et al., 2018).

7. Critical Analysis and Comparative Perspective

Fundamental strengths of RDDs include their expressiveness for arbitrary, non-tabular workflows, explicit lineage-driven fault tolerance, and direct encoding of custom partitioning or low-level data manipulation. Datasets and DataFrames, with higher-level APIs and automatic optimization, gradually supersede RDDs for many batch and SQL-style workflows, particularly where tabular schemas and relational patterns prevail. However, RDDs remain essential where full control over partition management, schema freedom, or representation of arbitrarily nested data (e.g., complex JSON) is required (Morrelli, 2018, Tang et al., 2018). Experimental results show that the choice between RDDs and more structured operations should be guided by input complexity, API maturity, and performance targets for large volumes.

In total, the RDD abstraction—rooted in immutability, partitioning, lineage, and distributed memory computing—underpins scalable, fault-tolerant analytics in Spark, sets the stage for advanced optimizations and higher-level APIs, and remains foundational for domains requiring low-level control (Morrelli, 2018, Tang et al., 2018, Yang et al., 2018).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Resilient Distributed Datasets (RDDs).