Apache Spark Overview | CRT020 preparation
Preparing for Databricks Certified Associate Developer for Apache Spark 2.4 with Python 3
Notice: This is a personal learning note of DB105.
What is Apache Spark
- At its core is the Spark Engine.
- The DataFrames API provides an abstraction above RDDs while simultaneously improving performance 5-20x over traditional RDDs with its Catalyst Optimizer.
- Spark ML provides high quality and finely tuned machine learning algorithms for processing big data.
- The Graph processing API gives us an easily approachable API for modeling pairwise relationships between people, objects, or nodes in a network.
- The Streaming APIs give us End-to-End Fault Tolerance, with Exactly-Once semantics, and the possibility for sub-millisecond latency.
A Unifying Engine
- Built upon the Spark Core
- Apache Spark is data and environment agnostic.
- Languages: Scala, Java, Python, R, SQL
- Environments: Yarn, Docker, EC2, Mesos, OpenStack, Databricks (our favorite), Digital Ocean, and much more…
- Data Sources: Hadoop HDFS, Casandra, Kafka, Apache Hive, HBase, JDBC (PostgreSQL, MySQL, etc.), CSV, JSON, Azure Blob, Amazon S3, ElasticSearch, Parquet, and much, much more…
RDDs
- The primary data abstraction of Spark engine is the RDD: Resilient Distributed Dataset
- Resilient, i.e., fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
- Distributed with data residing on multiple nodes in a cluster.
- Dataset is a collection of partitioned data with primitive values or values of values, e.g., tuples or other objects.
- The original paper that gave birth to the concept of RDD is Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia et al.
- Today, with Spark 2.x, we treat RDDs as the assembly language of the Spark ecosystem.
- DataFrames, Datasets & SQL provide the higher level abstraction over RDDs.
Scala, Python, Java, R & SQL
- Besides being able to run in many environments…
- Apache Spark makes the platform even more approachable by supporting multiple languages:
- Scala - Apache Spark’s primary language.
- Python - More commonly referred to as PySpark
- R - SparkR (R on Spark)
- Java
SQL - Closer to ANSI SQL 2003 compliance :
- Now running all 99 TPC-DS queries
- New standards-compliant parser (with good error messages!)
- Subqueries (correlated & uncorrelated)
- Approximate aggregate stats
- With the older RDD API, there are significant differences with each language’s implementation, namely in performance.
- With the newer DataFrames API, the performance differences between languages are nearly nonexistence (especially for Scala, Java & Python).
- With that, not all languages get the same amount of love - just the same, that API gap for each language is rapidly closing, especially between Spark 1.x and 2.x.
The Cluster: Drivers, Executors, Slots & Tasks
- The Driver is the JVM in which our application runs.
- The secret to Spark’s awesome performance is parallelism.
- Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds.
- Scaling horizontally means we can simply add new “nodes” to the cluster almost endlessly.
- We parallelize at two levels:
- The first level of parallelization is the Executor - a Java virtual machine running on a node, typically, one instance per node.
- The second level of parallelization is the Slot - the number of which is determined by the number of cores and CPUs of each node.
- Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by the Driver.
- The JVM is naturally multithreaded, but a single JVM, such as our Driver, has a finite upper limit.
- By creating Tasks, the Driver can assign units of work to Slots for parallel execution.
- Additionally, the Driver must also decide how to partition the data so that it can be distributed for parallel processing (not shown here).
- Consequently, the Driver is assigning a Partition of data to each task - in this way each Task knows which piece of data it is to process.
- Once started, each Task will fetch from the original data source the Partition of data assigned to it.
Quick Note on Jobs & Stages
- Each parallelized action is referred to as a Job.
- The results of each Job (parallelized/distributed action) is returned to the Driver.
- Depending on the work required, multiple Jobs will be required.
- Each Job is broken down into Stages.
- This would be analogous to building a house (the job)
- The first stage would be to lay the foundation.
- The second stage would be to erect the walls.
- The third stage would be to add the room.
- Attempting to do any of these steps out of order just won’t make sense, if not just impossible.
Quick Note on Cluster Management
- At a much lower level, Spark Core employs a Cluster Manager that is responsible for provisioning nodes in our cluster.
- Databricks provides a robust, high-performing Cluster Manager as part of its overall offerings.
- Additional Cluster Managers are available for Mesos, Yarn and by other third parties.
- In addition to this, Spark has a Standalone mode in which you manually configure each node.
- In each of these scenarios, the Driver is [presumably] running on one node, with each Executors running on N different nodes.
- From a developer’s and student’s perspective my primary focus is on…
- The number of Partitions my data is divided into.
- The number of Slots I have for parallel execution.
- How many Jobs am I triggering?
- And lastly the Stages those jobs are divided into.