Apache Spark: Reading Data - Parquet Files

Preparing for Databricks Certified Associate Developer for Apache Spark 2.4 with Python 3

Apache Spark: Reading Data - Parquet Files - 图1 https://parquet.apache.orgColumn-OrientedSee also

Technical Accomplishments:

  • Read data from:

    • Parquet files without a schema.
    • Parquet files with a schema.
  1. parquetFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/"
  2. (spark.read # The DataFrameReader
  3. .parquet(parquetFile) # Creates a DataFrame from Parquet after reading in the file
  4. .printSchema() # Print the DataFrame's schema
  5. )
  • We do not need to specify the schema - the column names and data types are stored in the parquet files.
  • Only one job is required to read that schema from the parquet file’s metadata.
  • Unlike the CSV or JSON readers that have to load the entire file and then infer the schema, the parquet reader can “read” the schema very quickly because it’s reading that schema from the metadata.

Read in the Parquet Files w/Schema

If you want to avoid the extra job entirely, we can, again, specify the schema even for parquet files:

WARNING Providing a schema may avoid this one-time hit to determine the DataFrame's schema.
However, if you specify the wrong schema it will conflict with the true schema and will result in an analysis exception at runtime.

  1. # Required for StructField, StringType, IntegerType, etc.
  2. from pyspark.sql.types import *
  3. parquetSchema = StructType(
  4. [
  5. StructField("timestamp", StringType(), False),
  6. StructField("site", StringType(), False),
  7. StructField("requests", IntegerType(), False)
  8. ]
  9. )
  10. (spark.read # The DataFrameReader
  11. .schema(parquetSchema) # Use the specified schema
  12. .parquet(parquetFile) # Creates a DataFrame from Parquet after reading in the file
  13. .printSchema() # Print the DataFrame's schema
  14. )