Spark danger: pivot is an action!

May 20, 2024

•

min read

Michał Możdżonek

Sebastian Skiba

Introduction

When delving into crafting a new and efficient Spark job, or optimising an existing one, multiple implementation and design choices may have a significant impact on the job’s performance. One of the most prominent aspects influencing a Spark job’s efficiency is the fundamental difference between actions and transformations. Integrating a transformation into the job blueprint comes at a nominal cost, merely entailing the addition of another node within the execution plan. Of course this transformation will be computed later, once we trigger our data pipeline execution with an action. These actions commonly encompass activities such as persisting result files to storage, presenting a count of rows, or transmitting the entire DataFrame to the Spark Driver (in the form of a .collect() action). Spark actions impose a substantial computational burden due to their requirement for processing the entire graph of transformations. Hence, our team's primary focal point in optimising Apache Spark jobs rests upon identifying superfluous actions. A recent revelation that caught us off guard was the realisation that a seemingly typical transformation, .pivot(), may harbour an additional action.

Reshaping data with Spark pivot function

Since "pivoting" isn't a routine operation, let's take a brief two-minute interval to explain its function through a simple usage example.

We got data from a shipping company (see Code 1., Table 1.). Each row in DataFrame represents the number of packages sent to a particular country via different means of transport.

**Code 1.** Creation of example dataset - transport

image presents example dataset about transport. — **Table 1.** Example dataset - transport.

Now let’s assume that we want to calculate the total number of packages for each country and transport method in our DataFrame. In Spark we can compute it as follows (see Code 2., Table 2.):

**Code 2.** Aggregation of transport dataset

example dataset after group by operation — **Table 2.** Example dataset after group by operation.

Nonetheless, there are instances when we may prefer to pivot a single column. This transformation would transpose unique values from that particular column into new columns within an output DataFrame. Fortunately, Spark provides pivot for this purpose (see Code 3., Table 3.):

**Code 3.** Aggregation of transport dataset with pivot

**Table 3.** Example dataset after pivot.

What happens when we perform a Spark pivot function? An insight

After running the Spark pivot function on DataFrame, we can analyse its execution using SparkUI. When it comes to assessing the execution, I suggest beginning with the SQL/DataFrame page. Here, you are able to view all the queries carried out by the Spark cluster. As a general guideline, one query typically aligns with one action in the source code. Nevertheless, as depicted in the image below (see Image 1.), there are four queries in total, despite having utilised the .show() action three times.

**Image 1.** List of executed Spark SQL queries in the SparkUI. There is an additional query executed by an action in the pivot method.

Surprisingly, there's an additional action within the function! This presents a potential issue because a portion of our job could be executed twice: initially by the action within pivot and subsequently by the show action.

Hidden action and how to remove it

While perusing the pySpark documentation for the function, the following passage comes to light:

There are two versions of the pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.

Initially, take note that this function has two parameters: `pivot_col` and an optional `values` parameter. Furthermore, a laconic explanation highlights that the invocation without `values` is comparatively less optimised. Upon a more in-depth exploration, the implementation details of the `pivot` function emerge. It becomes evident that this function includes a "collect" action responsible for computing the distinct values linked to the `pivot_col`. Subsequently, the second variant of the function, `pivot(pivot_col, values)`, is invoked.

‍

To rectify our task, we can enhance our job by appending a secondary parameter to the `pivot` function (see Code 4.). Now this is a clear transformation with only one action (see Image 2.).

**Code 4.** Pivot example with `values` parameter filled

After adding a list of columns to the pivot method additional collect action is no longer required. — **Image 2.** After adding a list of columns to the pivot method an additional collect action is no longer required.

Nevertheless, on occasion, the inclusion of the `values` parameter might not be feasible due to our lack of awareness regarding them. While we deem this circumstance to be infrequent, we propose incorporating a `cache` operation immediately preceding the `pivot" function (see Code 5.). By doing so, the action within the `pivot` function will instantiate a cache, enabling subsequent actions to leverage the pre-computed outcomes of this specific segment within the job (see Image 3.).

**Code 5.** Example with cache before pivot operation

Adding cache before `group by + pivot` protects from multiple execution of the same part of the job. It can be used when providing a second parameter `values` is not possible. — **Image 3.** Adding cache before `group by + pivot` protects from multiple execution of the same part of the job. It can be used when it is not possible to provide a second parameter `values`.

Why is it a problem?

In our experience Spark jobs are often complex. They read data from multiple sources, filter and aggregate them in various ways. When we found this tiny difference between two versions of the pivot function we were optimising a Spark job that has hundreds of lines and multiple pivot calls. The job was struggling, but we were stumped, as we had already removed all unnecessary actions. Because of this hidden action inside the pivot, parts of our job were executed multiple times. The fact that we had multiple pivot calls spread across the job was not helping either. In our case we could easily add the values parameter, hence we easily gained a significant improvement in performance of the problematic Spark job.

Conclusion

Do not be a lazy data engineer. Add a second parameter to the pivot function.

BTW have I mentioned that our client’s job is now running 16 min instead of 85 min on a three times smaller cluster?

Happy Spark optimization!

AI visualisation of happy robot — Spark job after optimization!

Share this post

Apache Spark

Big Data

Data Engineering

Data Debugging

Programming

Problem Solving

Michał Możdżonek

Data Engineer

As a GCP Professional Data Engineer, I specialize in optimizing data processing workflows and spearheading cloud migrations. Proficient in Python and Scala, with a knack for cost-saving strategies and a passion for innovation.

Sebastian Skiba

Data Engineer

Experienced Data Engineer skilled in BigQuery, Cloud Composer, and Looker Studio, with proficiency in Python and SQL. Played a key role in migrating Hadoop data centers to the cloud and enjoys integrating new data sources into data platforms. Passionate about new technologies and staying current with data engineering trends.

Get to know us, discover our interests, projects and training courses.

Spark danger: pivot is an action!

Introduction

Reshaping data with Spark pivot function

What happens when we perform a Spark pivot function? An insight

Hidden action and how to remove it

Why is it a problem?

Conclusion

Related posts

Get expert advice for free

Introduction

Reshaping data with Spark pivot function

What happens when we perform a Spark pivot function? An insight

Hidden action and how to remove it

Why is it a problem?

Conclusion

Related posts

Snowflake vs Databricks vs BigQuery

When shuffle in a BigQuery matters - short story of few joins

Breaking the monolith: Scalable transformations with Data Mesh and dbt

Get expert advice for free