Illustration of a large table splitting into smaller tables, with a BigQuery logo on a light blue background.

Normalization vs. Denormalization in BigQuery

Ben Makansi Ben Makansi
6 minute read

When structuring a database, one of the first decisions you need to make is whether to normalize or denormalize your data. This decision shapes how you model, query, and maintain your database and has far-reaching implications for performance, integrity, and cost. In this guide, we'll walk through both normalization and denormalization, compare their trade-offs, and explain how BigQuery uniquely fits into this landscape.

What Is Normalization?

Normalization is the process of structuring a database to reduce redundancy and improve data integrity. You do this by dividing the data into smaller, related tables. Each distinct piece of data is stored only once and is referenced elsewhere using identifiers like customer_id or product_id.

Let’s consider a simple example to illustrate why normalization matters.

Imagine a table that records customer orders. It has four columns: order_id, customer_name, product_name, and price. Right away, you can spot inefficiencies. "Mike Smith" shows up more than once, and so does the product "Laptop." This redundancy means updates (e.g., changing Mike’s email or updating a laptop’s price) have to be made in multiple places, increasing the risk of inconsistency.

To fix this, we break the data into several normalized tables:

  • Customers: Contains unique entries for each customer with their customer_id, name, and email.

  • Products: Stores product_id, product_name, and price, again ensuring no repetition.

  • Orders: Links everything together using order_id, customer_id, and order_date.

By linking tables through IDs, we avoid duplicating information. This improves integrity, makes updates easier, and keeps storage efficient.

Slide titled "Normalization at a glance" explaining that normalization reduces redundancy and improves data integrity by dividing data into related tables, ensuring each piece of data is stored in one place. The slide also highlights challenges: normalization can introduce performance overhead, require expensive JOIN operations, and slow down performance for large datasets. Includes icons representing databases and a warning symbol to reinforce key ideas visually.

The Drawbacks of Normalization

While normalization improves consistency, it introduces performance overhead during queries. Pulling a single view often requires JOINs across multiple tables. This can become expensive and slow, especially as datasets grow.

For example, retrieving order details with customer names and product prices might require a multi-table JOIN:

SELECT o.order_id, c.customer_name, p.product_name, p.price
FROM Orders o
JOIN Customers c ON o.customer_id = c.customer_id
JOIN Products p ON o.product_id = p.product_id;

With small datasets, this isn't a problem. But with millions of records, JOINs can become a bottleneck.

Enter Denormalization

Slide titled "Denormalization" explaining that denormalization involves reversing some normalization by combining related tables into one to reduce complex JOINs. The goal is to improve query performance by reducing the number of tables involved. Includes an abstract icon representing combining multiple tables into one.

Denormalization is the process of combining related tables into one. Instead of splitting data across several tables, you keep related information together, even at the cost of some redundancy.

The main advantage? Simpler and faster queries. You reduce or eliminate JOINs, which are often the most performance-intensive part of SQL queries.

But there’s a trade-off. Denormalized tables often duplicate data, which can make updates harder and increase storage usage. If a store name changes, for example, you now have to update every row where that name appears.

Denormalization in BigQuery

BigQuery flips the traditional conversation. Unlike many relational databases that strongly favor normalization, BigQuery encourages denormalization to improve both cost and performance.

Why? Several reasons:

  • Columnar Storage: BigQuery stores data by column, not row. This means it can scan only the columns needed for a query, making reads more efficient, even if the table is large and wide.

  • Minimized Need for JOINs: JOINs in BigQuery are expensive because they involve scanning and shuffling large datasets. Denormalization helps avoid this.

  • Nested and Repeated Fields: BigQuery supports complex types like STRUCT (nested records) and ARRAY (repeated fields), letting you pack related data into a single row. This is especially powerful for hierarchical data.

    Slide titled "Denormalization in BigQuery" explaining that BigQuery encourages denormalization to improve cost and performance. It highlights features like columnar storage, reduced need for JOINs, and support for nested tables and repeated fields using RECORD/STRUCT types and the UNNEST function. The slide includes icons for BigQuery, nested tables (represented by a Russian doll), cost savings (cheaper), and performance (faster), reinforcing the advantages of denormalization in BigQuery environments.

For example, consider this public weather dataset where each row represents a geographical point. Instead of storing hourly forecasts in a separate table, the forecast data is nested within each row. Each row contains an array of forecasted values for that point, one per hour.

Slide titled "Denormalized table example" showing a denormalized public weather dataset from NOAA in BigQuery. Each row represents a geographic point with nested forecast data, including hourly forecast times and temperatures at 2 meters above ground. Columns include creation_time, geography, forecast.hours, forecast.time, and forecasted temperature values. The nested structure illustrates how multiple forecast records are stored within a single row, eliminating the need for joins and improving query efficiency.

This structure allows queries like:

SELECT geography, forecast.time AS forecast_time, forecast.temperature_2m_above_ground AS temperature
FROM `bigquery-public-data.noaa_global_forecast_system.NOAA_GFS0P25`,
     UNNEST(forecast) AS forecast
WHERE forecast.hours = 10;

Using UNNEST, you flatten the nested data to extract temperature at a specific hour, without requiring joins. This structure keeps related data together while enabling flexible queries.

A Real-World Example: Forecast Data

Let’s say you're building a system to analyze temperature forecasts for different locations. You can either:

  • Normalize the data: One table for locations, another for hourly forecasts, linked by a location_id.

  • Denormalize the data: Store all hourly forecasts as a nested array inside a single table row for each location.

In BigQuery, the second approach is usually better. Columnar storage and nested fields allow you to scan just the temperature data for the required hour and location, without needing to shuffle data across tables. This results in cheaper and faster queries.

So When Do You Normalize?

Slide titled "When to choose: Normalization vs. Denormalization" comparing the two approaches. On the left, normalization is recommended when data integrity and update efficiency matter more than query speed, especially in systems where consistent, accurate data is a priority. On the right, denormalization is preferred when query performance is the main concern, such as in analytics use cases or read-heavy systems with large datasets where JOINs slow things down. Each approach is illustrated with its own abstract icon.

Despite BigQuery’s tilt toward denormalization, there are situations where normalization still makes sense, even in BigQuery.

Use normalization when:

  • Data integrity is critical

  • You frequently update reference data (e.g., store locations or customer contact info)

  • You need to avoid redundant updates across massive tables

Consider a retail company with millions of transactions. The store's address or contact details might change occasionally. If you've denormalized store info into every transaction record, each update requires rewriting millions of rows. That’s expensive and error-prone.

Instead, you can keep a separate Stores table and reference it in the TransactionLogs table using a store_id. This structure supports efficient updates with minimal risk of inconsistency.

Relevance to the GCP Professional Data Engineer Exam

This kind of scenario might appear on a GCP certification exam:

You're designing a BigQuery data model for a retail company. There’s one table for store details and another for transaction logs. The company frequently updates store details and wants to maintain data integrity. What’s the best design?

The answer: Normalize the data. Keep store information in a separate table and link it using a store_id. This ensures updates are efficient and avoids having to touch millions of transaction rows.

Even though BigQuery supports denormalization, normalization is more appropriate when:

  • Updates are frequent

  • Redundant writes are costly

  • You need to enforce consistent data


Summary: Choosing the Right Approach

CriteriaNormalizeDenormalize
Data integrityHigh priorityLess enforced
Update frequencyFrequent updatesRare updates
Query performanceModerate to poor with joinsExcellent with fewer joins
Dataset sizeSmall to largeLarge, read-heavy datasets
Tool preferenceTraditional RDBMSBigQuery, analytics systems
Schema complexityMore tables, more joinsFewer tables, nested structures

Ultimately, it's not a binary decision. Good data engineering means knowing when to use each strategy. BigQuery gives you the flexibility to choose, based on your specific needs.

The key is to understand the trade-offs and align your schema design with the real-world behavior of your system, how often the data changes, how it’s queried, and how much scale you expect.

In BigQuery, denormalization is often the default. But with the right use case, normalization still has a valuable role to play.

« Back to Blog