RDD

At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

When to use RDDs?

Consider these scenarios or common use cases for using RDDs when:

  • you want low-level transformation and actions and control on your dataset;
  • your data is unstructured, such as media streams or streams of text;
  • you want to manipulate your data with functional programming constructs than domain specific expressions;
  • you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
  • you can forgo[放弃] some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured[半结构化] data

DataFrames and Datasets are built on top of RDDs.

DataFrame

Unlike an RDD, data is organized into named columns, like a table in a relational database.Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction[允许高度抽象]; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

Dataset

DataFrame as an alias for a collection of generic objects Dataset[Row]

Dataset, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.
Dataset是强类型JVM对象的集合,由Scala中定义的case类或Java中的类决定。