Overview

Datasets

Dataframes

SQL tables and views

Dataframes and Datasets

Definition

  • Table-like collections with well-defined rows and columns

  • Each column must have the same number of rows as all the other columns

  • Immutable, lazily evaluated plans that specify what operations to apply to data residing as a location to generate some output

Schemas

defined the column names and types of a Dataframe

Overview of Structured Spark Types

Def

Spark types map directly to the different language APIs through a lookup table
Spark uses an engine called Catalyst that maintains its own type information through the planning and processing of work

Example

Structured API Overview - 图1
This addition happens because Spark will convert an expression written in an input language to Spark’s internal Catalyst representation of that same type information

Dataframes Versus Datasets

检查类型的时间点

  • Compile Time(Datasets)

  • Run Time(Dataframe)

检查类型的时间点

  • JVM Types(Datasets)

  • Datasets of Row Type-Optimized internal format (Dataframe)

Both can benefit from the Catalyst Engine

Columns

simple type like an integer or string
complex type like an array or map
null value

Rows

Each record in a Dataframe must be of type Row

Spark Types

Scala type reference

Structured API Overview - 图2

Overview of Structured API Execution

Steps

1. Write Dataframe/Dataset/SQL Code

2. If valid code, Spark converts this to a Logical Plan

Structured API Overview - 图3

  • Unresolved Logical Plan

  • Resolved Logical Plan

  • Spark use the catalog, a repository of all table and Dataframe information, to resolve columns and tables in the analyzer
  • Optimized Logical Plan
  • pushing down predicates or selections

3. Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way

  • generating different physical plan

  • compare through a cost model

  • choose best physical plan

an Example of the cost comparison might be choosing how to perform a given join by looking at the physical attributes of a given table(how big the table is or how big its partitions are

  • results in a series of RDDS and transformations

4. Spark then executes this Physical Plan(RDD manipulations) on the cluster

  • Runs all of this code over RDDs, the lower-level programming interfave of Spark

  • Spark performs further optimizations at runtime, generating native Java bytecode that can remove entire tasks or stages during execution

Attachments:

Structured API Overview - 图4image-20210113-063736.png
Structured API Overview - 图5image-20210113-063814.png
Structured API Overview - 图6image-20210113-063834.png