Apache Spark: Reading Data

Preparing for Databricks Certified Associate Developer for Apache Spark 2.4 with Python 3

Reading Data - CSV Files

Technical Accomplishments:

  • Start working with the API documentation
  • Introduce the class SparkSession and other entry points
  • Introduce the class DataFrameReader
  • Read data from:

    • CSV without a Schema.
    • CSV with a Schema.

Apache Spark: Reading Data - 图4 Spark API

Spark API Home Page

  1. Google for Spark API Latest or Spark API x.x.x for a specific version.
  2. Select Spark API Documentation - Spark x.x.x Documentation - Apache Spark.
  3. Which set of documentation you will use depends on which language you will use.

Other Documentation:

  • Programming Guides for DataFrames, SQL, Graphs, Machine Learning, Streaming…
  • Deployment Guides for Spark Standalone, Mesos, Yarn…
  • Configuration, Monitoring, Tuning, Security…

Here are some shortcuts

Spark API (Python)

  1. Select Spark Python API (Sphinx).
  2. Look up the documentation for pyspark.sql.SparkSession.
  3. In the lower-left-hand-corner type SparkSession into the search field.
  4. Hit [Enter].
  5. The search results should appear in the right-hand pane.
  6. Click on pyspark.sql.SparkSession (Python class, in pyspark.sql module)
  7. The documentation should open in the right-hand pane.

Apache Spark: Reading Data - 图5 SparkSession

Quick function review:

  • createDataSet(..)
  • createDataFrame(..)
  • emptyDataSet(..)
  • emptyDataFrame(..)
  • range(..)
  • read(..)
  • readStream(..)
  • sparkContext(..)
  • sqlContext(..)
  • sql(..)
  • streams(..)
  • table(..)
  • udf(..)

Apache Spark: Reading Data - 图6 DataFrameReader

Look up the documentation for DataFrameReader.

Quick function review:

  • csv(path)
  • jdbc(url, table, ..., connectionProperties)
  • json(path)
  • format(source)
  • load(path)
  • orc(path)
  • parquet(path)
  • table(tableName)
  • text(path)
  • textFile(path)

Configuration methods:

  • option(key, value)
  • options(map)
  • schema(schema)

Apache Spark: Reading Data - 图7