Apache Spark Overview | CRT020 preparation

Preparing for Databricks Certified Associate Developer for Apache Spark 2.4 with Python 3

Notice: This is a personal learning note of DB105.

Apache Spark Overview - 图1

At its core is the Spark Engine.
The DataFrames API provides an abstraction above RDDs while simultaneously improving performance 5-20x over traditional RDDs with its Catalyst Optimizer.
Spark ML provides high quality and finely tuned machine learning algorithms for processing big data.
The Graph processing API gives us an easily approachable API for modeling pairwise relationships between people, objects, or nodes in a network.
The Streaming APIs give us End-to-End Fault Tolerance, with Exactly-Once semantics, and the possibility for sub-millisecond latency.

Apache Spark Overview - 图2

Built upon the Spark Core
Apache Spark is data and environment agnostic.
Languages: Scala, Java, Python, R, SQL
Environments: Yarn, Docker, EC2, Mesos, OpenStack, Databricks (our favorite), Digital Ocean, and much more…
Data Sources: Hadoop HDFS, Casandra, Kafka, Apache Hive, HBase, JDBC (PostgreSQL, MySQL, etc.), CSV, JSON, Azure Blob, Amazon S3, ElasticSearch, Parquet, and much, much more…

The primary data abstraction of Spark engine is the RDD: Resilient Distributed Dataset
Resilient, i.e., fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned data with primitive values or values of values, e.g., tuples or other objects.
The original paper that gave birth to the concept of RDD is Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia et al.
Today, with Spark 2.x, we treat RDDs as the assembly language of the Spark ecosystem.
DataFrames, Datasets & SQL provide the higher level abstraction over RDDs.

Apache Spark Overview - 图3

Besides being able to run in many environments…
Apache Spark makes the platform even more approachable by supporting multiple languages:
Scala - Apache Spark’s primary language.
Python - More commonly referred to as PySpark
R - SparkR (R on Spark)
Java
SQL - Closer to ANSI SQL 2003 compliance :
- Now running all 99 TPC-DS queries
- New standards-compliant parser (with good error messages!)
- Subqueries (correlated & uncorrelated)
- Approximate aggregate stats
With the older RDD API, there are significant differences with each language’s implementation, namely in performance.
With the newer DataFrames API, the performance differences between languages are nearly nonexistence (especially for Scala, Java & Python).
With that, not all languages get the same amount of love - just the same, that API gap for each language is rapidly closing, especially between Spark 1.x and 2.x.

Apache Spark Overview - 图5

The Driver is the JVM in which our application runs.
The secret to Spark’s awesome performance is parallelism.
Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds.
Scaling horizontally means we can simply add new “nodes” to the cluster almost endlessly.
We parallelize at two levels:
The first level of parallelization is the Executor - a Java virtual machine running on a node, typically, one instance per node.
The second level of parallelization is the Slot - the number of which is determined by the number of cores and CPUs of each node.
Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by the Driver.

Apache Spark Overview - 图6

The JVM is naturally multithreaded, but a single JVM, such as our Driver, has a finite upper limit.
By creating Tasks, the Driver can assign units of work to Slots for parallel execution.
Additionally, the Driver must also decide how to partition the data so that it can be distributed for parallel processing (not shown here).
Consequently, the Driver is assigning a Partition of data to each task - in this way each Task knows which piece of data it is to process.
Once started, each Task will fetch from the original data source the Partition of data assigned to it.

Each parallelized action is referred to as a Job.
The results of each Job (parallelized/distributed action) is returned to the Driver.
Depending on the work required, multiple Jobs will be required.
Each Job is broken down into Stages.
This would be analogous to building a house (the job)
The first stage would be to lay the foundation.
The second stage would be to erect the walls.
The third stage would be to add the room.
Attempting to do any of these steps out of order just won’t make sense, if not just impossible.

Apache Spark Overview - 图7

At a much lower level, Spark Core employs a Cluster Manager that is responsible for provisioning nodes in our cluster.
Databricks provides a robust, high-performing Cluster Manager as part of its overall offerings.
Additional Cluster Managers are available for Mesos, Yarn and by other third parties.
In addition to this, Spark has a Standalone mode in which you manually configure each node.
In each of these scenarios, the Driver is [presumably] running on one node, with each Executors running on N different nodes.
From a developer’s and student’s perspective my primary focus is on…
The number of Partitions my data is divided into.
The number of Slots I have for parallel execution.
How many Jobs am I triggering?
And lastly the Stages those jobs are divided into.