Apache Spark Overview | CRT020 preparation

Preparing for Databricks Certified Associate Developer for Apache Spark 2.4 with Python 3

Notice: This is a personal learning note of DB105.

What is Apache Spark

Apache Spark Overview - 图1

  • At its core is the Spark Engine.
  • The DataFrames API provides an abstraction above RDDs while simultaneously improving performance 5-20x over traditional RDDs with its Catalyst Optimizer.
  • Spark ML provides high quality and finely tuned machine learning algorithms for processing big data.
  • The Graph processing API gives us an easily approachable API for modeling pairwise relationships between people, objects, or nodes in a network.
  • The Streaming APIs give us End-to-End Fault Tolerance, with Exactly-Once semantics, and the possibility for sub-millisecond latency.

A Unifying Engine

Apache Spark Overview - 图2

  • Built upon the Spark Core
  • Apache Spark is data and environment agnostic.
  • Languages: Scala, Java, Python, R, SQL
  • Environments: Yarn, Docker, EC2, Mesos, OpenStack, Databricks (our favorite), Digital Ocean, and much more…
  • Data Sources: Hadoop HDFS, Casandra, Kafka, Apache Hive, HBase, JDBC (PostgreSQL, MySQL, etc.), CSV, JSON, Azure Blob, Amazon S3, ElasticSearch, Parquet, and much, much more…

RDDs

  • The primary data abstraction of Spark engine is the RDD: Resilient Distributed Dataset
  • Resilient, i.e., fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
  • Distributed with data residing on multiple nodes in a cluster.
  • Dataset is a collection of partitioned data with primitive values or values of values, e.g., tuples or other objects.
  • The original paper that gave birth to the concept of RDD is Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia et al.
  • Today, with Spark 2.x, we treat RDDs as the assembly language of the Spark ecosystem.
  • DataFrames, Datasets & SQL provide the higher level abstraction over RDDs.

Scala, Python, Java, R & SQL

Apache Spark Overview - 图3

  • Besides being able to run in many environments…
  • Apache Spark makes the platform even more approachable by supporting multiple languages:
  • Scala - Apache Spark’s primary language.
  • Python - More commonly referred to as PySpark
  • R - SparkR (R on Spark)
  • Java
  • SQL - Closer to ANSI SQL 2003 compliance :

    • Now running all 99 TPC-DS queries
    • New standards-compliant parser (with good error messages!)
    • Subqueries (correlated & uncorrelated)
    • Approximate aggregate stats
  • With the older RDD API, there are significant differences with each language’s implementation, namely in performance.
  • With the newer DataFrames API, the performance differences between languages are nearly nonexistence (especially for Scala, Java & Python).
  • With that, not all languages get the same amount of love - just the same, that API gap for each language is rapidly closing, especially between Spark 1.x and 2.x.
    Apache Spark Overview - 图4

The Cluster: Drivers, Executors, Slots & Tasks

Apache Spark Overview - 图5

  • The Driver is the JVM in which our application runs.
  • The secret to Spark’s awesome performance is parallelism.
  • Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds.
  • Scaling horizontally means we can simply add new “nodes” to the cluster almost endlessly.
  • We parallelize at two levels:
  • The first level of parallelization is the Executor - a Java virtual machine running on a node, typically, one instance per node.
  • The second level of parallelization is the Slot - the number of which is determined by the number of cores and CPUs of each node.
  • Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by the Driver.

Apache Spark Overview - 图6

  • The JVM is naturally multithreaded, but a single JVM, such as our Driver, has a finite upper limit.
  • By creating Tasks, the Driver can assign units of work to Slots for parallel execution.
  • Additionally, the Driver must also decide how to partition the data so that it can be distributed for parallel processing (not shown here).
  • Consequently, the Driver is assigning a Partition of data to each task - in this way each Task knows which piece of data it is to process.
  • Once started, each Task will fetch from the original data source the Partition of data assigned to it.

Quick Note on Jobs & Stages

  • Each parallelized action is referred to as a Job.
  • The results of each Job (parallelized/distributed action) is returned to the Driver.
  • Depending on the work required, multiple Jobs will be required.
  • Each Job is broken down into Stages.
  • This would be analogous to building a house (the job)
  • The first stage would be to lay the foundation.
  • The second stage would be to erect the walls.
  • The third stage would be to add the room.
  • Attempting to do any of these steps out of order just won’t make sense, if not just impossible.

Quick Note on Cluster Management

Apache Spark Overview - 图7

  • At a much lower level, Spark Core employs a Cluster Manager that is responsible for provisioning nodes in our cluster.
  • Databricks provides a robust, high-performing Cluster Manager as part of its overall offerings.
  • Additional Cluster Managers are available for Mesos, Yarn and by other third parties.
  • In addition to this, Spark has a Standalone mode in which you manually configure each node.
  • In each of these scenarios, the Driver is [presumably] running on one node, with each Executors running on N different nodes.
  • From a developer’s and student’s perspective my primary focus is on…
  • The number of Partitions my data is divided into.
  • The number of Slots I have for parallel execution.
  • How many Jobs am I triggering?
  • And lastly the Stages those jobs are divided into.