When shuffle in a BigQuery matters - short story of few joins

March 2, 2025

•

min read

Michał Możdżonek

Sebastian Skiba

Introduction

Numerous contemporary data processing frameworks, such as Apache Spark, BigQuery, and Snowflake, effectively conceal their distributed nature. Nonetheless, it's important to recognize that the seamless processing of terabytes of data within a few seconds is made possible by the collaborative effort of numerous cluster nodes. These servers work in parallel, occasionally necessitating the exchange of portions of an entire dataset between them. This operation is commonly referred to as a "data shuffle". When optimising a Big Data task, the main point of developer’s focus is often reducing the frequency of shuffling, which is a resource-intensive process involving extensive network communication. In this discussion, we aim to illustrate how you can structure your BigQuery job to minimise shuffling.

The story behind the scenes

Behind the scenes lies a hidden story - now is the time to unveil it. Our mission was to enhance the efficiency of a task responsible for preparing a dataset intended for training a machine learning model. Initially, this job was quite straightforward, as it involved merging only 20 features generated by 20 distinct pipelines. Each table shared a comparable structure, comprised of only three columns:

item_id - id of observed item
dt - date of an observation
value - value of an observation

Our target was joining all these features into one large table with the following schema:

item_id
dt
value_1
value_2
…

The main challenge we faced with this task was its scale. Each day, we accumulated around 0.5 terabytes of fresh data across these 20 feature tables. When we initially considered aggregating these features over a span of 14 days, the result was a notably large dataset.

The first attempt

In our initial approach, we proceeded straightforwardly: we took the first table and gradually incorporated features from the others one by one. A pseudocode select query representation of this process would resemble the following (see Code 1.):

**Code 1.** Joining multiple features with straightforward strategy.

BigQuery changed this query into an execution plan that looked like a very long chain (see Image 1.).

**Image 1.** A small section of a query v1 execution plan. Features are joined to the main dataset sequentially.

This outcome was somewhat anticipated, yet it also held elements of surprise. It was anticipated because the structure closely mirrored our SQL code. However, SQL operates as a declarative language, not an imperative one. In essence, we specify the desired output rather than the step-by-step execution. Most database engines optimise query plans before execution. The execution plan produced for our query showed that BigQuery lacked an extensive query optimization module, or perhaps that specific query proved too intricate to optimise.

Problem identification and an improvement

In a situation where the optimizer proved less than effective, we began to explore potential solutions ourselves. We observed that during the initial stages, the join operations were performed adequately. However, as we introduced additional features, the performance steadily declined. In the table below (see Image 2.), we illustrate the disparity between the initial and final join operations. The `slots_ms` metric is worth noting, as it reflects the overall CPU time expended during the processing of a particular stage.

**Image 2.** Stage statistics comparison - first join vs last join. Last join has significantly increased metric values.

The contrast is substantial, with roughly 610 slot seconds in the initial join, compared to around 37,000 slot seconds in the concluding join. It was anticipated that there would be an increase, as each join added an additional feature. Nevertheless, the magnitude of the escalation was quite surprising.

The primary challenge of our query can be illustrated as constructing a snowman using a single enormous snowball. It's straightforward to initiate a small snowball, but as we progress, moving the snowball becomes increasingly difficult. What if, instead, we began with numerous small snowballs and merged them cleverly? That was our concept - we reimplemented the joining process in a divide and conquer manner. At the outset, we firstly paired features – Table 1 with Table 2, Table 3 with Table 4, and so on. Afterwards, we combined the outcomes of the initial pairings, and in the following step, we joined pairs of datasets from the preceding stage, and so on. Ultimately, we arrived at a solitary dataset, which was our desired result.

A crucial point here is that we utilized the commutative property of a FULL JOIN. It doesn't matter whether we join Table A and Table B or the other way around – B and A. In either case, the result remains identical. It's important to highlight that our method would not function in the same way with a LEFT JOIN, for instance.

Let’s come back to our query. We built it from blocks which look like this (see Code 2.):

**Code 2.** Refactored query with nested joins.

That adds a lot of boilerplate code, but it turns out that it changes the query plan too!

**Image 3.** Execution plan of the refactored query.

The new plan appears significantly better (see Image 3.)! In this updated approach, each join operation merges two datasets of roughly equivalent sizes. This prevents the danger of building one big dataset that then experiences multiple rounds of shuffle stages. We first combine smaller datasets before integrating them into the larger one. Moreover, all data from different tables now passes through a maximum of just five stages. These modifications are expected to decrease the volume of bytes shuffled during the job. Lastly, the entire job plan now fits on a single screen!

Performance comparison

Let’s compare execution statistics when we build joined features for one day (see Image 4.).

**Image 4.** Execution statistics comparison. On the top - chain approach, before optimization. At the bottom - tree approach, after optimization.

The crucial metric to consider is the `totalSlotsMs`. Our latest solution is roughly twice as fast as the previous one. In terms of elapsed time, we've managed to reduce it by one-third. We've observed a comparable shift in the shuffle bytes metric. However, it's worth noting that after changing the query design, we began to experience disk spills. This is typically an indication of suboptimal performance. Even though it's an improvement over the initial solution. There is still room for further enhancement!

Conclusion

The structure of a SQL code is important in BigQuery. We managed to rewrite the query and improve its execution plan by doing so. Second note, shuffle matters - as always in a distributed framework. Last but not least, there is always room for improvement, even for the simplest queries. Our team of professionals specializes in optimizing BigQuery performance. If your company is facing challenges, we're the right experts to help.

Happy optimizations!

Share this post

Big Data

BigQuery

Data Engineer

Google Cloud

Programming

GCP

Michał Możdżonek

Data Engineer

As a GCP Professional Data Engineer, I specialize in optimizing data processing workflows and spearheading cloud migrations. Proficient in Python and Scala, with a knack for cost-saving strategies and a passion for innovation.

Sebastian Skiba

Data Engineer

Experienced Data Engineer skilled in BigQuery, Cloud Composer, and Looker Studio, with proficiency in Python and SQL. Played a key role in migrating Hadoop data centers to the cloud and enjoys integrating new data sources into data platforms. Passionate about new technologies and staying current with data engineering trends.

Get to know us, discover our interests, projects and training courses.

When shuffle in a BigQuery matters - short story of few joins