Big Data processing using Apache Spark
During the training, you will learn how to use the Apache Spark framework to quickly process large amounts of data.
Technologies
Purpose of training
This course provides an introduction to the Apache Spark architecture. It can be conducted in Scala or Python. It covers the process of creating a Spark application - integration with the source, data processing, process optimization and saving in a database in the Cloud environment.
You will get familiar with Spark API. You will learn how to write Spark Jobs related to common and specific problems. We will discuss optimizations, the most common challenges and ways to overcome them. The training focuses mainly on practical skills.
40% - theory, 60% - practical workshops
The training can be conducted in the Client’s office or online.
Duration:
2-3 days.
Training addresses
Developers and business analysts whose aim is to learn about Apache Spark technology. Basic knowledge of Python or Scala is recommended.
Training plan
Module 1 Introduction to data processing using Apache Spark
1.1 Origin - new features in the newest version of Apache Spark, integration with Cloud/Hadoop
1.2 Introduction to the API
- 1.2.1 RDD / Dataset / Dataframe
- 1.2.2 Main characteristics, differences and performance comparison
- 1.2.3 Recommendations, tips & tricks
1.3 Lazy Evaluation - Transformations and Actions
- 1.3.1 How the transformation execution graph is built
- 1.3.2 How to read and interpret execution graph
- 1.3.3 How to re-use RDD applying previously performed transformations
1.4 Shuffling - transferring data between machines
- 1.4.1 Transformations that require shuffling (wide and narrow)
- 1.4.2 ReduceByKey and GroupByKey
- 1.4.3 Minimizing the impact on application performance
1.5 Data partitioning
- 1.5.1 The choice between repartition() and coalesce()
- 1.5.2 Main assumptions of data partitioning
- 1.5.3 Number/size of partitions vs. processing performance
1.6 Basic configuration of an Apache Spark based project:
- 1.6.1 How to configure the script starting job
- 1.6.2 How to write job code to make it easy for testing
1.7 Possibilities of Spark integration with other solutions (database, hdfs, avro, text file, csv, json ...)
Module 2 Architecture, integration, common problems and optimization of applications based on Apache Spark
2.1 Architecture
- 2.1.1 Spark Driver, Worker and Executor
- 2.1.2 Job vs. Stage vs. Task
- 2.1.3 Processing units and data units
- 2.1.4 Deployment possibilities
2.2 Testing Spark jobs
- 2.3 Joins
- 2.3.1 Physical types of joins
- 2.3.2 Best practice
- 2.3.3 Using Join to minimize data transfer between machines (shuffling reduction)
2.4 UDF - how to build them and how they impact performance. Differences between DataFrame and Dataset.
2.5 Spark job optimizations and common issues
- 2.5.1 Key skew
- 2.5.2 OOM
- 2.5.3 Broadcast
- 2.5.4 Cache
- 2.5.5 Serialization
- 2.5.6 How to choose the size of executors
2.6 Interpretation and optimization of query plans
- 2.6.1 Navigating through Spark UI
- 2.6.2 Verification and most important elements
2.7 Spark Catalyst and Tungsten