No more cargo cult around data platform migration to GCP

April 15, 2024

•

min read

Michał Możdżonek

Michał Misiewicz

The buzz surrounding cloud migration is widespread, and discussed across various platforms, from tech-centric YouTube channels to the marketing efforts of cloud service providers. Nevertheless, despite the enthusiasm surrounding cloud computing, approximately 40% of global computing power still resides in on-premise data centers, with 23% located in colocation data centers, leaving only 37% for hyper-scale cloud providers, according to the Synergy Research Group report. Why is there such a disconnect between media portrayals and the actual state of data platforms within companies? Is the move to the cloud merely a passing trend, or is it becoming an unavoidable necessity? Today, we aim to share our firsthand experience with transitioning existing data platforms to GCP (Google Cloud Platform). Our narrative will not only highlight the advantages of cloud migration but also shed light on the potential challenges encountered throughout the process.

To distinguish ourselves from conventional aspects of the "cloud-cargo cult," let's offer a succinct overview of our journey. We undertook the extensive task of migrating the entire Big Data department of the e-commerce giant, OnlineMarket(*), from an on-premise Hadoop infrastructure to GCP. This endeavour spanned two years and was marked by numerous triumphs and challenges. Significantly, the scale of OnlineMarket's data warehousing was anything but small, quantified not in terabytes but in petabytes. Additionally, the modernization of thousands of data pipelines was high on our agenda.

Throughout this architectural transformation, we encountered a spectrum of challenges and had high success in their resolution. We fostered a profound comprehension of the benefits and potential pitfalls of transitioning from the Hadoop ecosystem to GCP. By sharing our insights and experiences, we aspire to offer valuable guidance to organisations grappling with similar transitions.

*The name of this company was invented for the purpose of this blog post. The real name of our client cannot be disclosed due to NDA.

‍

What are the benefits of migrating on-premises Hadoop to Google Cloud?

Moving the infrastructure from an on-prem solution to Google Cloud has brought forth a multitude of benefits for OnlineMarket's company. Let's look at them in more detail.

Graphic showing AI interpretations of cloud infrastrutcture

Streamlined infrastructure management

One of the key advantages of migrating to the cloud is the elimination of the need to maintain physical infrastructure and the challenges associated with it. Managing and scaling the infrastructure requires significant resources and expertise in the on-premises environment. This often involved a lengthy process of forecasting, budgeting and purchasing new servers in advance, anticipating future data growth and processing needs. Moreover, the procurement and setup of physical servers could be time-consuming, potentially causing delays in project timelines.

However, by transitioning to a cloud-based solution, OnlineMarket's team has experienced a transformative shift in infrastructure management. Instead of procuring and maintaining physical servers, they now utilise Google's reliable and robust cloud infrastructure. Google cloud-based services provide instant access to compute, storage, and network resources, eliminating the need for upfront hardware investments and time-consuming server setup processes. By offloading the burden of infrastructure maintenance to Google's cloud infrastructure, OnlineMarket's team can now focus more on the core aspects of data analysis, pipeline optimization, and driving meaningful insights.

(Almost) endless scaling capabilities

The cloud solution provides unparalleled scalability for data workloads, addressing a significant limitation faced in the self-hosted environment. This flexibility allows OnlineMarket to seamlessly adjust resources up or down in response to changes in data volume or processing needs. For example, during peak events such as Black Friday week, when there is a surge in data processing requirements, the new solution enables the company to scale its resources to meet the increased demand. This ensures that OnlineMarket's big data jobs receive the necessary resources without any performance degradation or job starvation issues.

In the Hadoop architecture, resources were typically shared among different jobs without clear separation. This lack of resource isolation could lead to situations where certain resource-intensive jobs monopolised the resources, starving other jobs of the necessary computing power. This could result in delays, lower throughput, or even job failures.

Seamless ad hoc analysis with lightning-fast speed

One of the standout features of BigQuery, which sets it apart from the traditional local-hosted ecosystems, is its ability to provide ad hoc analysis with exceptional timeframes. In the self-hosted setup, ad hoc analysis typically involved using tools like Hive or Spark to run queries and process data. However, BigQuery often delivers results faster than Spark could even schedule a job to the cluster. In our opinion, BigQuery's execution engine surpasses that of Spark’s and benefits significantly from its internal storage format.

The speed and efficiency of BigQuery's ad hoc analysis capabilities have unlocked new possibilities for OnlineMarket. Teams can now rapidly iterate through exploratory analyses, quickly test hypotheses, and gain valuable insights into data without the delays associated with lengthy query execution times. This agility has empowered OnlineMarket to make data-driven decisions more efficiently, identify emerging trends, optimise marketing campaigns, and respond promptly to changing business needs.

Cost-efficient data storage with BigQuery

BigQuery, the data warehousing solution offered by Google, has proven to be a cost-efficient data store for OnlineMarket. Traditional on-premises data storage solutions often involve significant upfront investments and ongoing maintenance costs. In contrast, BigQuery operates on a pay-as-you-go model, allowing OnlineMarket to pay only for the storage and processing resources they utilise. This approach not only reduces upfront costs but also provides the flexibility to adapt to changing data storage requirements without over-provisioning or incurring unnecessary expenses.

Integration with other services

Another significant benefit of transitioning to a cloud-based ecosystem is the seamless integration with other cloud services. Integration with services such as Logging and Monitoring allows us to build a comprehensive ecosystem for managing and monitoring of Data Platform. Now, OnlineMarket can easily track and analyse system logs, monitor resource utilisation and debug their data workloads. This integrated approach streamlines operations, enhances troubleshooting capabilities, and provides a unified environment for end-to-end data management.

‍

What are the challenges and considerations?

Cloud solutions, with their many benefits, allow us to process more data in less time, but they require us to change our approach to data processing. It is important to consider the potential challenges that may arise during the transition.

Visual representation displaying AI-generated interpretations of Google Cloud Platform.

Service integration is not as simple as it seams

Cloud providers present a plethora of services that must be integrated to construct a robust data platform. Despite ongoing efforts to enhance connectivity, not every integration proves efficient or feasible. This underscores the continued relevance of companies with expertise in developing cloud-native solutions, such as Datumo. With firsthand experience garnered from diverse projects, our company understands both successful strategies and potential obstacles. To substantiate these claims, let's examine the integration of Spark jobs and BigQuery.

While the benefits of adopting BigQuery are substantial, it is important to consider that integrating it into existing data pipelines may require modifications to jobs and workflows. This might not sound like a big deal, but it is significant especially if you want to write tests for your big data pipelines. BigQuery is very similar to Hive, but there is no such thing as a local BigQuery instance. During tests you have to communicate with BigQuery. This additional step in the platform change process can introduce some overheads and may require adjustments to the existing codebase.

During the migration of OnlineMarket's data warehouse, we encountered issues with the spark-bigquery-connector library. Specifically, the lack of predicate pushdown for nested columns led to considerable, avoidable expenses for our client. To address this, we implemented a workaround involving additional jobs on BigQuery to preprocess data before reading.

Migrating data from Hive tables to BigQuery may also require addressing schema differences, such as the lack of multi-column partitioning in BigQuery. While table clustering can help partially mitigate this limitation, schema alignment remains a time-consuming task that demands attention to ensure data integrity and consistency during the whole process.

Cost management and education

A notable consideration when migrating to Google’s cloud solution is the cost associated with every action. This is particularly significant for big data pipelines, which often require large clusters to process vast amounts of data. Without proper cost awareness and management, these pipelines can incur substantial expenses. It is crucial to prioritise educating the workforce on cost optimization strategies to ensure efficient utilisation of cloud resources.

With the endless scaling capabilities offered by GCP, it is easy to trigger and provision additional computing and storage resources. However, this can lead to overspending if not carefully managed. By educating teams on cost-effective practices and providing them with insights into monitoring and controlling resource usage, companies can avoid unnecessary expenses and optimise their budget allocation. Additionally, Google Cloud Platform includes inherent features designed to curb the risk of overspending. This maximises the benefits of the platform while keeping costs in check and ensuring long-term financial sustainability.

From old school to cutting edge technology

The move from the on-prem data management system was not just a shift in infrastructure, but a transformation from an outdated data platform to a state-of-the-art solution. Before the migration, OnlineMarket was operating on an ageing on-premises data platform that posed limitations in terms of scalability, performance, and agility. The decision to embrace cloud solutions brought about a complete overhaul of their data platform, equipping them with modern and advanced tools for managing, processing, and analysing large-scale data. In retrospect, we are confident that our dedicated team, responsible for nearly 30% of the overall migration project, has laid a solid foundation for future success. With the right training and a focus on cost optimization, OnlineMarket is well-positioned to capitalise on the new possibilities that Google Cloud Platform provides, propelling their business into a new era of data-driven innovation. If your company is facing a similar challenge, contact us. Datumo is a Data Analytics specialised Google Cloud partner. Let's work together to streamline your costs and achieve your transformation goals.

‍

Images’ credentials

Generated with AI - Bing Image Creator ∙ 6 October 2023

Share this post

Cloud

Data

Google Cloud

BigQuery

Data Platform

Michał Możdżonek

Data Engineer

As a GCP Professional Data Engineer, I specialize in optimizing data processing workflows and spearheading cloud migrations. Proficient in Python and Scala, with a knack for cost-saving strategies and a passion for innovation.

Michał Misiewicz

Chief Technology Officer

I prepare strategies to implement new technology in line with the company goals. I’m also designing data platform architecture at the early stages of projects. In my spare time, I have successfully combined my professional work with running for many years.

Get to know us, discover our interests, projects and training courses.

No more cargo cult around data platform migration to GCP