In the ever-evolving landscape of cloud data warehouses and analytic platforms, three prominent solutions have emerged as leaders: BigQuery, Snowflake, and Databricks. Each of these platforms offers unique features and capabilities, making them suitable for different use cases and workloads. This article serves as a comprehensive comparative analysis, delving into these platforms' architectural nuances, pricing models, query performance, workloads, and processing capabilities. Based on our experience of working with a variety of cloud solutions our aim is to furnish you with valuable insights, empowering you to make a well-informed decision when it comes to selecting the ideal platform for your unique data requirements.
What are the architectural differences between cloud data warehouses?
BigQuery is a serverless data warehouse that supports machine learning. It decouples storage and computation, allowing on demand independent scaling. BigQuery is only supported on Google Cloud. The storage layer is managed by Colossus, Google’s distributed file system. The compute layer is managed by Borg, Google’s cluster management system. BigQuery uses Dremel, a scalable, interactive ad-hoc query system for analysis. The platform determines the required number of slots for a query. These slots are virtual CPUs used to execute SQL queries. BigQuery supports dynamic query execution. The number of slots needed is calculated based on the query execution plan for each stage. This means that the number of slots can dynamically change between stages. Google's cloud data warehouse offers support to stages that can be processed concurrently, as well as speculative execution and dynamic repartitioning to achieve optimal parallelization. In contrast to Spark, BigQuery can initiate the next stage even before completing the preceding one, as soon as data begins to flow out of the previous stage.

Databricks is an analytics platform that supports machine learning use cases. It decouples storage and computation in the form of clusters and is supported on AWS, GCP, and Azure. Databricks utilized open-source Spark to process data, with Photon - high-performance vectorized query engine that runs on the Databricks Lakehouse Platform. It provides extremely fast query performance at low cost, and is compatible with Apache Spark APIs. Databricks features Adaptive Query Execution that optimizes performance with dynamically coalescing shuffle partitions, adaptive join strategies, skew join optimizations, and dynamic partition pruning. Customers can control their Spark clusters, which can be a benefit for experts but a challenge for beginners. Clusters can also scale dynamically during jobs. Tables in Databricks use the Delta Lake format, which is the default format in Delta Lake that support performance optimizations like Z-order clustering keys and time travel that is a temporal data management feature that allows users to access historical or future versions of their data stored in Delta Lake, typically spanning a default period of 30 days. Additionally, Databricks can automatically cache queries on clusters' local caches, facilitating expedited access to query results. Databricks provides Delta Sharing, enabling the sharing of datasets across organizations. Notably, this can be done without the requirement for the recipient organization to possess Databricks or use the Databricks user interface. Furthermore, it provides the Unity Catalog for efficient management and discovery of datasets, as well as for tracking the lineage of data.
Snowflake is a cloud data platform that enables users to store, process, and analyze large volumes of structured and semi-structured data. Snowflake decouples storage and compute and allows them to scale independently on demand, which means users only pay for the resources they use and can adjust them as needed. Snowflake is supported on AWS, GCP and Azure, giving users the flexibility to choose their preferred cloud provider. Snowflake architecture consists of three layers: a centralized storage layer that uses cloud provider storage to store data in an optimized columnar format; a multi cluster compute layer that contains multiple massively parallel processing clusters to execute data queries; and a cloud service layer that is a collection of different services, such as Authentication, management, Metadata management, Query parsing, optimization and Access control. Snowflake users can also utilize Apache Iceberg, open source tables that provide high-performance data access for big data analytics. Another thing worth mentioning is how the tables are organized in micro partitions. Micro partitions are small chunks of compressed and encrypted data that are typically between 50 MB and 500 MB in size. Snowflake uses micro partitions to enable efficient data access and query performance. Snowflake also supports automatic or manual clustering for increased query performance. Clustering is the process of ordering the rows in a table based on one or more columns, which can improve the pruning of micro partitions and reduce the amount of data scanned by a query. Snowflake offers time travel and fail safe features, which allow users to access historical data and recover data in case of accidental deletion or modification. Time travel allows users to query data as it was at any point in the past within a retention period, while fail safe is a non-queryable layer that stores data for an additional period after it is no longer accessible by time travel. Snowflake's warehouses, similar to Databricks, feature a local cache, a pivotal element that expedites data retrieval by storing results of previously queried data. This warehouse-level caching can provide lower latency and more predictable performance compared to platforms without explicit query-result caching mechanisms. Snowflake facilitates data sharing among different Snowflake accounts through a secure data sharing feature. Users who do not have a licensing agreement can access the shared data via a reader account. This reader account must be set up by the data provider and still relies on the Snowflake user interface for access.

Pricing

BigQuery pricing depends on two factors: compute and storage. Compute is the cost to process SQL queries, while storage is the cost to store data loaded into BigQuery. For compute, you can choose between on-demand and capacity pricing options. On-demand pricing charges you for the bytes processed by each query, with the first 1 TB per month being free. Capacity pricing charges you for slots (vCPUs) over time, which gives you more control and predictability over your costs. The option of reserving slots for extended periods, spanning months or even years, offers substantial cost savings with BigQuery. However, it may not be the most cost-effective approach for on-demand analysis. This strategy is particularly valuable in environments with stable workloads, such as jobs executed on an hourly basis, as it ensures consistent utilization of resources. In contrast, in scenarios where workloads vary significantly, slot reservation may not result in cost reduction, as a considerable portion of the reserved slots may remain idle.
Databricks pricing revolves around two key components: compute cost and storage cost. Compute cost, the expense incurred for processing data, can be divided into two parts. The first part is directed towards compensating the cloud provider for services like virtual machines (VMs), while the latter portion is dedicated to compensating Databricks for the utilization of DataBricks Units (DBUs). It's important to note that storage costs are handled directly through the cloud provider, such as Amazon S3 or Azure Storage.
DBU, a standardized unit of processing power intrinsic to the Databricks Lakehouse Platform, is employed for both measurement and pricing purposes. The quantity of DBUs a particular workload consumes is influenced by a range of processing metrics. These metrics may encompass factors like the utilization of compute resources and the volume of data processed. It's worth mentioning that the pricing of DBUs varies regionally and cloud-specific variations.
Snowflake's pricing model, similar to Databricks, is based on actual usage of storage and compute resources, including virtual warehouses, serverless compute, various cloud services and data transfer costs. For data transfers within Snowflake's ecosystem, a per-byte fee is applied when data is moved from a Snowflake account to a distinct region within the same cloud platform or when transferring to an entirely different one.
In a manner akin to Databricks, Snowflake employs a credit-based system to facilitate payment for processing time. These credits serve as units of measurement, and they are only expended when a customer actively employs resources, such as operating a virtual warehouse, utilizing the cloud services layer, or engaging serverless functionalities. Notably, Snowflake offers a 10% allowance for cloud services, effectively freeing up this portion of the daily warehouse usage from credit consumption.
Query Performance
BigQuery, Databricks, and Snowflake share a common underlying technology stack, relying on columnar storage, employing cost-based query planning, executing queries in a pipelined manner, and utilizing just-in-time compilation. A benchmark analysis, accessible at https://www.fivetran.com/blog/warehouse-benchmark, highlights their comparable query performance. Nonetheless, the most substantial distinctions in performance stem from each platform's design choices. Certain platforms prioritize tunability, enabling users to fine-tune configurations for optimal performance, while others emphasize user-friendliness and a seamless experience.
In terms of processing costs, it's worth noting that Databricks comes in as a cost-effective option, being 25% cheaper than Snowflake and BigQuery. However, it's essential to recognize that the authors of the benchmark did not leverage cost-reducing features apart from data compression, such as spot-instance pricing for Databricks, multi-cluster auto-scaling for Snowflake, or BigQuery's on-demand pricing. Moreover, they didn't make use of advanced features available in Snowflake and Databricks.
Both platforms offer the advantage of larger warehouse sizes, which can significantly expedite the execution of extensive queries. Snowflake goes a step further with its query acceleration service feature, offloading sections of queries to serverless compute resources and implementing table clustering for optimized performance. Databricks distinguishes itself with its Photon execution engine, greatly enhancing Spark performance. Additionally, Delta introduces further optimizations through caching, data partitioning, and more. As a result, fully tuned workloads that harness all the platform-specific features may exhibit substantial variations in performance.
Workload types and processing
BigQuery is an ideal choice for SQL processing, making it particularly well-suited for migrations from other SQL-based systems. Transitioning from code-based processing, however, may necessitate codebase rewriting to conform to SQL. This versatile tool caters to various data analysis needs, including ad hoc analysis, geospatial analysis, machine learning, and business intelligence. It empowers users to schedule queries and offers compatibility with external orchestrators like Cloud Scheduler or Cloud Composer. BigQuery tasks typically form part of broader workflows, and it excels in supporting near-real-time streaming, delivering data availability within seconds. Notably, BigQuery facilitates machine learning model training through SQL queries, enabling analysts to construct models without the hassle of data migration.
Databricks, on the other hand, stands out as a more adaptable solution, accommodating SQL and a spectrum of programming languages such as Python, Scala, Java, and R. This versatility allows users to migrate seamlessly from both Spark-based and SQL-based codebases. Databricks boasts robust support for machine learning libraries and comes equipped with an in-built orchestrator. Additionally, it's compatible with external orchestrators like Apache Airflow. The platform is well-suited for near-real-time streaming with Delta Live tables and Spark Structured Streaming. While Databricks may present a steeper learning curve compared to BigQuery and Snowflake, it empowers experienced developers with extensive options for configuring clusters, processing, and code, making it a powerful tool in capable hands.
Snowflake, similar to Databricks, offers support for SQL and popular programming languages like Python, Scala, and Java. Where Snowflake distinguishes itself is in its focus on user-friendliness. With the Snowpark API, migrating from Spark or SQL-based workloads is designed to be as seamless as possible. While Snowflake also supports machine learning, it's worth noting that, at the time of writing, Snowpark ML has a less mature set of features, making Databricks the more sensible choice for machine learning support. Snowflake includes a built-in orchestrator for executing ad hoc SQL statements within the Snowflake database. Moreover, it facilitates streaming processing in microbatches with minimal latency, ensuring timely data handling.

When will a particular cloud warehouse solution work best?
BigQuery serves as a highly developed, versatile data warehouse with a primary focus on internal business intelligence and reporting. Its serverless nature and seamless integration with the Google Cloud Platform (GCP) make it exceptionally well-suited for ad-hoc analytics and machine learning applications within the GCP ecosystem. However, it's important to note that while BigQuery's automated resource allocation simplifies management, it may not always be the optimal choice for operational use cases and data applications that demand consistent and predictable performance.
Databricks, as a robust Spark-based platform, excels in handling both batch and streaming data processing, setting it apart from other platforms. It offers extensive flexibility, allowing for the incorporation of additional libraries and low-level optimizations. Data scientists frequently utilized Databricks for machine learning tasks using integrated notebooks. Regarding low-latency queries, although it features capabilities like Delta Cache, it lacks specialized indexes for achieving low-latency query performance.
Snowflake, in contrast, boasts a broad range of use cases that extend beyond traditional reporting and dashboards. Its unique decoupling of storage and compute allows for the isolation of diverse workloads to meet specific SLAs, while also supporting high user concurrency. However, Snowflake may not be the ideal choice for interactive or ad-hoc queries due to inefficient data access and a limited range of indexing and query optimization. Recently, Snowflake introduced Snowpipe Streaming, enabling ingestion latency typically between 1–3 seconds, thus substantially improving support for near-real-time operational scenarios.
Conclusion
In conclusion, the choice between BigQuery, Databricks or Snowflake hinges on careful consideration of your specific requirements and use cases. Opt for BigQuery when you operate within Google Cloud and need a warehouse tailored for SQL-based processing. In the event of utilizing a different cloud provider and seeking simplicity with both SQL and code-based processing, Snowflake stands out as an excellent choice. Databricks excels in dealing with more intricate workloads, particularly when Apache Spark and robust machine learning capabilities are essential. It's vital to thoroughly assess your workload types, performance criteria, and cloud provider preferences to arrive at an informed decision that aligns with your organization's objectives and available resources. Our team has a deep understanding of these platforms and can assist you in finding the perfect fit for your business. Contact us to utilize our expertise and make an informed decision that aligns with your organization's objectives and available resources.