In an attempt to improve business, organizations are processing more and more data. Many valuable data sources contain resources without strict schema, and are represented by semi-structured data such as JSON. This type of data requires a different approach to its storage and analysis. Google BigQuery, a serverless data warehouse, contains the JSON data type designed to store and query semi-structured data in a structured and queryable format, as well as offering advanced capabilities for handling such data efficiently.
This article aims to explore the representation, storage, and processing of semi-structured data in BigQuery, highlighting the benefits of Google Cloud Platform services specifically for data engineers, analysts and data scientists.
What is Semi-Structured Data?
Semi-structured data refers to a type of data that does not have a rigid schema but possesses some organizational properties that make it easier to analyze than unstructured data. This data can contain hierarchical or paired elements that are partially ordered but do not fit neatly into traditional relational database structures. Examples of semi-structured data include JSON (JavaScript Object Notation) and XML, which can feature nested and repeatable structures.
One of the key ways to use such data in BigQuery is with JSON-type columns. This classic data representation in particular allows direct management of information in this commonly used format.
What further distinguishes BigQuery is its advanced capabilities for managing semi-structured data through nested and repeated columns. Taken together, these techniques allow for a more complex presentation of data, where hierarchies and relationships between elements are directly reflected in the table structure. This approach not only simplifies queries, but also increases processing efficiency through better resource utilization.
I will now give a general definition for JSON data representation and the repeated and nested column representations in BigQuery, and what they offer us.
Harnessing Semi-Structured Data in BigQuery: Insights and Innovations for Modern Data Warehouses
JSON Data Formats in BigQuery
The JSON data type in BigQuery is designed to store and manage JSON data directly. By using the JSON data type, users can store JSON data directly in a table, and can easily query and extract items from the JSON field using BigQuery's built-in JSON functions.
In practice, querying JSON data in BigQuery is facilitated by functions such as JSON_QUERY, JSON_VAL. These capabilities make it possible to seamlessly integrate JSON data into standard SQL queries.
The Benefits of JSON Data Type in BigQuery
Flexibility in data handling: JSON allows for the management of flexible and semi-structured data within a BigQuery table. This is especially useful in situations where the data schema might evolve over time or when dealing with data that has a nested or hierarchical structure. This flexibility works well with scenarios that traditional relational databases might struggle to accommodate.
Flexibility in data managing: It simplifies the process of integrating, querying, and managing JSON data, reducing the need for preprocessing or transforming JSON data into a strictly relational format before analysis. Thus, it offers greater agility and efficiency in managing dynamic and semi-structured data sets.
Dynamic schema capabilities: Using the JSON format in BigQuery supports dynamic schemas, which means that fields can be added or modified without needing to restructure the entire database. This adaptability is beneficial in environments with variable and rapidly changing data inputs, helping to keep data management processes efficient and responsive to changes in data structure and content.
Alternative for JSON data type
Without a dedicated JSON data type, it is still possible to store data as either strings or repeated and nested data, however each of which has its limitations.
Storing JSON as strings. This approach introduces several problems:
- increased parsing overhead - each query against a JSON string requires parsing the string back into a JSON format, incurring computational overhead and potentially slowing down query performance,
- loss of structure - when JSON data is stored as a string, the inherent hierarchical structure of JSON is flattened, making it difficult to perform structured queries efficiently,
- limited query capabilities - SQL operations on string-based JSON are cumbersome and limited, lacking the depth of functionality provided by native JSON functions like JSON_QUERY,
- inefficient data processings - storing JSON as strings can result in significant resource consumption, especially with large datasets. This method is not scalable, as increasing data volume exponentially increases the resources needed for data retrieval and manipulation.
Storing JSON as repeated and nested data types. This representation has many strong features, which I will discuss later in the article. However, it also has its own set of limitations:
- fixed schema requirement - unlike JSON's schema-less nature, STRUCTs require a predefined schema, which must be strictly adhered to. This rigidity can negate the benefits of JSON’s flexibility, as any change in data structure necessitates schema modifications.
- complexity in schema evolution - as the data evolves, so must the schema, which can become a complex and maintenance-heavy process, especially in fast-changing data environments.
Nested and Repeated Fields in BigQuery
In Google BigQuery, semi-structured data can be also managed through specific column types, namely repeated columns, nested columns, and a combination of both.
Repeated columns (ARRAY)
Repeated columns, referred to as ARRAYs in BigQuery, enable the storage of multiple values of the same data type within a single row. This setup avoids the need to flatten these data points into separate rows, thus preserving the structure and enabling more efficient queries.
Nested columns (STRUCT)
Nested columns, defined as STRUCTs in BigQuery, allow the incorporation of a set of related fields grouped as a sub-record within a single row. This facilitates the efficient representation and analysis of complex data types, ideal for handling data structures where hierarchical data is integrated seamlessly within a single row of a larger table. Nested columns streamline data manipulation and querying by maintaining a clear, hierarchical structure of data that mirrors its real-world relationships.
Combination of nested and repeated columns (ARRAY of STRUCTs)
Each element of the array is a complex object with multiple fields, supporting intricate data models where each array element, while consistent in structure, can hold varied data content. This configuration is crucial for maintaining data integrity in scenarios involving multiple instances of related data types, such as multiple contact details for a single entity.
The Benefits of These Structures in BigQuery
Utilizing these advanced data structures in BigQuery presents several key advantages:
User-Friendly Data Access: By using nested (STRUCT) and repeated (ARRAY) columns, BigQuery simplifies data access for users. These structures enable a more intuitive layout of data that mirrors real-world hierarchical relationships, such as a product with multiple categories and variants.
This resemblance aids users in understanding and navigating datasets more naturally, thereby reducing the learning curve associated with complex data operations.
Flexibility in data modeling: the ability to dynamically adapt data structures, without extensive schema redesigns offers unmatched flexibility in data modeling, crucial for rapidly evolving data needs. Users can add, modify, or reorganize data fields in a STRUCT without affecting the entire schema, providing a robust foundation for growth and adaptation.
Simplified data management: these structures simplify the management of complex and hierarchical data, making it easier for data professionals to maintain, query, and analyze large datasets.
Enhanced query performance: the structure of data in nested and repeated formats aligns more naturally with the inherent structure of semi-structured data, reducing the need for complex joins and transformations. This results in faster and more efficient query execution.
Maximizing Query Performance with Semi-Structured Data in BigQuery
The following questions have sometimes been asked. Does the organization of semi-structured data negatively affect data performance? Doesn't the unique way of packing and storing information as JSON type or both nested and repeated columns cause problems when retrieving information from BigQuery data storage and during data modification, addition, and deletion operations?
To answer these questions, it is necessary to discuss the columnar storage format used by BigQuery and then how this translates into the use of semi-structured data in this format.
Columnar storage format in BigQuery
BigQuery architecture utilizes a columnar storage format called Capacitor to efficiently store, manipulate and retrieve large amounts of data. Unlike traditional row-based storage, which requires reading entire rows to access needed data, columnar storage allows for reading and touching only the specific columns that are relevant to a query.
Details of the columnar storage format can be found here: Inside Capacitor, BigQuery’s next-generation columnar storage format and BigQuery Admin reference guide: Storage internals. However, the following are the most important features:
- Storage Efficiency: Each column is stored independently, allowing for effective compression and encoding. Data of similar types is stored sequentially, enhancing compression and reducing I/O operations, thus lowering storage costs.
- Query Performance: This format enables selective reading of only the columns relevant to a query, minimizing the data read from disk. This is particularly beneficial for analytics involving large datasets.
Handling Semi-Structured Data in Columnar Storage Format
BigQuery manages JSON type data, nested columns (STRUCT), and repeated columns (ARRAY) effectively within its columnar storage system. This method enhances the handling of semi-structured data by maintaining the structural hierarchy of JSON objects and applying advanced compression and encoding schemes.
JSON type columns storage
When JSON data is ingested into BigQuery, it is stored in a STRING column, but the data in the system is shredded into virtual columns to the extent possible. This treatment is crucial for several reasons:
Virtual Column Creation: Upon ingestion, BigQuery does not just store JSON data as a simple string. Instead, it interprets the keys in the JSON document as virtual columns. This means that even though the physical storage is compact and efficient as a single string, the logical view presents these keys as separate columns. This allows users to interact with the data as if it were structured, without the need for manual parsing or extraction in queries.
Query Optimization: By treating JSON keys as virtual columns, BigQuery allows users to directly query these keys using standard SQL syntax, which simplifies data access and increases the speed of query execution. The underlying engine is optimized to access only the necessary portions of the JSON string that correspond to the specified keys, enhancing performance and reducing unnecessary data scans.
More details you can find in Unlocking the power of semi-structured data with the JSON Type in BigQuery.
Nested columns (STRUCT) storage
Each nested field within a STRUCT is treated as an independent entity but stored within the same column. This organization preserves the original data structure and benefits from columnar storage efficiencies.
The figure 1 shows the main idea: all the values of a nested field, such as A.B.C, utilize contiguous storage. Therefore, the values of A.B.C can be retrieved without reading A.E, A.B.D and so on.
Repeated columns (ARRAY) storage
Repeated columns are stored as ARRAY data types, managed as a single column that contains multiple values, allowing for efficient storage and retrieval. The ARRAY structure facilitates operations that need to access specific elements within the list, preserving the order and allowing for the storage of duplicate values.
More details can be found in Dremel: Interactive Analysis of Web-Scale Datasets.
Data Modification and Efficiency
When it comes to modifying semi-structured data, BigQuery's approach offers substantial flexibility and efficiency. Adding, deleting, or changing the structure of JSON data does not require a restructuring of the entire database. Instead, modifications are handled at the column level. For example, adding a new field to a JSON object involves adding a new column to the storage, which can be done without impacting any existing data structures. Similarly, deleting a field removes the corresponding column. These operations are efficient because they avoid the costly process of data rewriting that traditional row-based storages might necessitate.
Query performance and cost reduction
Furthermore, the columnar storage model supports efficient compression and encoding schemes. Because each column contains only one type of data (e.g., integers, timestamps), it can be encoded more efficiently than a traditional row-based format.
This not only improves query performance but also reduces storage costs and speeds up data retrieval operations, making it ideal for managing the large and diverse datasets typically used in machine learning and data analysis.
Dremel Execution Engine
Google BigQuery uses the Dremel execution engine to optimize the way data is stored and queried. Dremel is capable of handling semi-structured data by breaking down a complex query on nested data into simpler queries, which are then executed across an immense computational infrastructure. The results are then aggregated back to form the final answer. Further details can be found in Dremel: Interactive Analysis of Web-Scale Datasets.
Performance Impact
Rather than slowing down performance, the nested and repeated fields can actually enhance query performance for certain types of queries. For instance, when queries need to filter or aggregate data based on conditions within these nested arrays or records, having direct access to this structured format can significantly speed up these operations as compared to flattening these structures into a relational format.
I should, of course, add that BigQuery offers many other additional performance-relevant features in the context of semi-structured data, but the limitations of a reasonable article size compels me to give only a few of the more important ones:
- Smart Caching
- Streaming Data Ingestion
- Dynamic Load Balancing
- and others
Practical Applications of Semi-Structured Data in Machine Learning
Finally, I would like to give some examples of the use of semi-structured data when a data scientist is working on a model and the data taken into the model learning process exhibits the characteristics of semi-structured data or has a complex structure, e.g. with an extended hierarchy..
This chapter explores how these data formats can significantly enhance the machine learning workflow, especially for data scientists who may not be proficient in advanced SQL.
Iterative Modeling and Experimentation
Machine learning involves iterative processes where rigorous testing is conducted to refine algorithms and improve model accuracy. Semi-structured data formats in BigQuery provide flexible and dynamic access to data, essential for running these multiple trials. Data scientists can experiment with different feature sets and model configurations without the constraints commonly imposed by more rigid data structures.
Advanced Aggregation and Sampling Techniques
Semi-structured data facilitates complex aggregation and sampling operations that are commonplace in machine learning. For instance, features for a model might need to be aggregated at various granularities or sampled differently across various segments of the data. The query capabilities of BigQuery allow for sophisticated data manipulation like calculating maxima, averages, or conditional selections directly within the database, streamlining the data preparation phase.
Integration of Diverse Data Types
The challenge of integrating heterogeneous data sources is streamlined by the use of semi-structured data. Whether dealing with timestamps, text, or numerical data as well as BigQuery's flexible data structures allow these varied types to be housed within the same dataset or even the same table. This capability is particularly beneficial when data scientists are working with datasets that incorporate a wide range of data types, enhancing the richness of the data available for model training.
Preserving Data Relationships
The nested and repeated structures in semi-structured data naturally maintain the inherent relationships within the data. For example, in a retail scenario, a transaction record might include a list of products purchased, each with its own set of attributes. These relationships are preserved without flattening the structure into separate tables, thus maintaining contextual integrity which is crucial for building predictive models that accurately reflect real-world scenarios.
Flexibility in data exploration and transformation
Data scientists often experiment with different ways of processing data before building the final model. Thanks to nested and repeated structures, they can easily manipulate, filter, and transform data at different levels of detail without the need for repeated processing or storing in multiple tables. This enables more dynamic and less costly experimentation.
Enhanced Data Management and Query Optimization
Handling large datasets becomes more manageable with semi-structured data due to the reduced need for additional tables or complex joins. BigQuery optimizes query performance for these data types using advanced techniques such as columnar storage and smart caching. For data scientists, this means faster access to data and more efficient computational operations, facilitating quicker insights into models.
Flexible Data Exploration and Transformation
The versatility of semi-structured data allows data scientists to explore and manipulate data in innovative ways. Nested and repeated structures can be easily expanded or restructured as needed during the exploratory analysis phase, enabling more creative and effective feature engineering and preprocessing strategies.
Conclusion
Google BigQuery offers robust solutions for managing semi-structured data, such as JSON, that transcend the capabilities of traditional relational databases. Its innovative use of JSON-type columns, alongside nested and repeated data structures, allows for more intuitive and efficient data querying and storage. This is further enhanced by BigQuery's columnar storage architecture, which maximizes data retrieval efficiency and minimizes operational costs. The platform's flexibility in handling dynamic schemas and its ability to integrate seamlessly with machine learning workflows make it an invaluable tool for data engineers, analysts, and scientists. By harnessing the power of BigQuery, organizations can optimize their data management practices and unlock new insights from their semi-structured data assets.