Apache Spark: Reading Data
Preparing for Databricks Certified Associate Developer for Apache Spark 2.4 with Python 3
Reading Data - CSV Files
Technical Accomplishments:
- Start working with the API documentation
- Introduce the class
SparkSession
and other entry points - Introduce the class
DataFrameReader
Read data from:
- CSV without a Schema.
- CSV with a Schema.
Spark API
Spark API Home Page
- Google for Spark API Latest or Spark API x.x.x for a specific version.
- Select Spark API Documentation - Spark x.x.x Documentation - Apache Spark.
- Which set of documentation you will use depends on which language you will use.
Other Documentation:
- Programming Guides for DataFrames, SQL, Graphs, Machine Learning, Streaming…
- Deployment Guides for Spark Standalone, Mesos, Yarn…
- Configuration, Monitoring, Tuning, Security…
Here are some shortcuts
- Spark API Documentation - Latest
- Spark API Documentation - 2.4.0
- Spark API Documentation - 2.2.0
- Spark API Documentation - 2.1.1
- Spark API Documentation - 2.0.2
- Spark API Documentation - 1.6.3
Spark API (Python)
- Select Spark Python API (Sphinx).
- Look up the documentation for
pyspark.sql.SparkSession
. - In the lower-left-hand-corner type SparkSession into the search field.
- Hit [Enter].
- The search results should appear in the right-hand pane.
- Click on pyspark.sql.SparkSession (Python class, in pyspark.sql module)
- The documentation should open in the right-hand pane.
SparkSession
Quick function review:
createDataSet(..)
createDataFrame(..)
emptyDataSet(..)
emptyDataFrame(..)
range(..)
read(..)
readStream(..)
sparkContext(..)
sqlContext(..)
sql(..)
streams(..)
table(..)
udf(..)
DataFrameReader
Look up the documentation for DataFrameReader
.
Quick function review:
csv(path)
jdbc(url, table, ..., connectionProperties)
json(path)
format(source)
load(path)
orc(path)
parquet(path)
table(tableName)
text(path)
textFile(path)
Configuration methods:
option(key, value)
options(map)
schema(schema)