RDD
At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
When to use RDDs?
Consider these scenarios or common use cases for using RDDs when:
- you want low-level transformation and actions and control on your dataset;
- your data is unstructured, such as media streams or streams of text;
- you want to manipulate your data with functional programming constructs than domain specific expressions;
- you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
- you can forgo[放弃] some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured[半结构化] data
DataFrames and Datasets are built on top of RDDs.
DataFrame
Unlike an RDD, data is organized into named columns, like a table in a relational database.Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction[允许高度抽象]; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.
Dataset
DataFrame as an alias for a collection of generic objects Dataset[Row]
Dataset, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.
Dataset是强类型JVM对象的集合,由Scala中定义的case类或Java中的类决定。