BigQuery Denormalized Tables vs Star Schema: A Design Guide


9 minute read

BigQuery Denormalized Tables vs Star Schema: A Design Guide

If you're designing a data warehouse in BigQuery, one of your first architectural decisions involves choosing between wide denormalized tables and a star schema. This choice affects query performance, storage costs, and how easily your team can work with the data. Unlike traditional databases where normalization reduces redundancy and star schemas optimize for specific query patterns, BigQuery's columnar storage and distributed architecture change the calculation entirely.

The question of BigQuery denormalized tables vs star schema isn't about following database theory from textbooks. It's about understanding how BigQuery actually processes data and charges for it.

Understanding Wide Denormalized Tables

A denormalized table in BigQuery contains all related information in a single, wide structure. Instead of splitting customer orders across multiple tables (orders, customers, products, shipping addresses), you store everything together. Each row repeats information that would traditionally live in dimension tables.

Here's what a denormalized order table looks like:

CREATE TABLE ecommerce.orders_denormalized (
  order_id STRING,
  order_date DATE,
  customer_id STRING,
  customer_name STRING,
  customer_email STRING,
  customer_segment STRING,
  product_id STRING,
  product_name STRING,
  product_category STRING,
  product_subcategory STRING,
  quantity INT64,
  unit_price FLOAT64,
  total_amount FLOAT64,
  shipping_address STRING,
  shipping_city STRING,
  shipping_country STRING
);

The strength of this approach in BigQuery centers on simplicity and query performance. Users write straightforward SELECT statements without joins. Analysts don't need to understand complex table relationships. More importantly, BigQuery scans only the columns you reference in your query, so the width of the table matters less than you'd expect.

When you run a query asking for total revenue by product category, BigQuery reads just the product_category and total_amount columns. It doesn't touch customer information, shipping details, or any other fields. This columnar scanning makes wide tables surprisingly efficient.

The Cost of Denormalization

Wide denormalized tables have clear drawbacks. Storage redundancy increases costs because customer information repeats with every order. A customer who places 100 orders has their name, email, and segment duplicated 100 times.

More critically, updates become expensive and complicated. When a customer changes their email address, you need to update potentially thousands of rows. In traditional databases, you'd update one row in a customer dimension table. In a denormalized structure, you're modifying every order record for that customer.

Here's the problem in practice:

-- Updating customer email in denormalized table
UPDATE ecommerce.orders_denormalized
SET customer_email = '[email protected]'
WHERE customer_id = 'CUST_12345';
-- This could update thousands of rows
-- BigQuery charges for bytes scanned AND modified

BigQuery bills for the amount of data processed. UPDATE statements scan the entire table to find matching rows, then modify them. For large denormalized tables with frequent updates, this becomes prohibitively expensive.

Data inconsistency also emerges as a risk. If an update fails partway through or multiple processes update the same customer simultaneously, you end up with conflicting information across rows.

Star Schema: The Dimensional Approach

A star schema separates data into fact tables (containing metrics and foreign keys) and dimension tables (containing descriptive attributes). The fact table stays narrow and focused on measurements. Dimension tables hold the context.

-- Fact table
CREATE TABLE ecommerce.fact_orders (
  order_id STRING,
  order_date DATE,
  customer_key INT64,
  product_key INT64,
  quantity INT64,
  unit_price FLOAT64,
  total_amount FLOAT64
);

-- Dimension tables
CREATE TABLE ecommerce.dim_customers (
  customer_key INT64,
  customer_id STRING,
  customer_name STRING,
  customer_email STRING,
  customer_segment STRING
);

CREATE TABLE ecommerce.dim_products (
  product_key INT64,
  product_id STRING,
  product_name STRING,
  product_category STRING,
  product_subcategory STRING
);

This design eliminates redundancy. Customer information exists once. Updates happen in one place. The data model reflects business entities clearly, making it easier to maintain data quality and consistency.

Star schemas excel when dimension tables change frequently or when you need to maintain historical accuracy through slowly changing dimensions. You can track how customer segments evolve over time without modifying fact records.

How BigQuery Changes the Star Schema Equation

Traditional data warehouses built star schemas primarily for query performance. Joins were expensive, so denormalization helped certain access patterns. But in BigQuery, the performance characteristics flip.

BigQuery processes joins differently than row-based databases. It distributes both fact and dimension tables across thousands of machines. When you join tables, BigQuery can broadcast smaller dimension tables to every node processing the fact table. For reasonably sized dimensions (under 10 GB), this broadcast join performs exceptionally well.

The real consideration becomes storage cost versus query cost. Star schemas store less data overall but require joins. Those joins scan additional tables, increasing the bytes processed per query. Denormalized tables store more data but eliminate join overhead.

BigQuery's columnar storage also means that wide tables don't slow down queries the way they would in traditional systems. You only pay for columns you actually use.

Real-World Scenario: E-commerce Analytics

Consider an e-commerce company with 50 million orders per year. The orders connect to 2 million customers and 100,000 products. Let's compare both approaches.

Denormalized approach:Each order row contains customer details (name, email, segment, registration date, lifetime value) and product details (name, category, brand, supplier). The table has 40 columns and stores 50 million rows annually.

Average row size: 1 KB Annual storage: 50 GB Five years of data: 250 GB

Star schema approach:Fact table: 10 columns, 50 million rows, 300 bytes per row = 15 GB annually Customer dimension: 2 million rows, 500 bytes = 1 GB Product dimension: 100,000 rows, 400 bytes = 40 MB Five years: 75 GB facts + 1 GB customers + 40 MB products = 76 GB total

The star schema uses roughly 30% of the storage. At BigQuery's storage pricing of $0.02 per GB per month, the denormalized approach costs about $5/month while the star schema costs $1.52/month. The difference is small in absolute terms but scales with data volume.

Now consider queries. An analyst runs this query 100 times per day:

-- Star schema version
SELECT 
  p.product_category,
  SUM(f.total_amount) as revenue
FROM ecommerce.fact_orders f
JOIN ecommerce.dim_products p ON f.product_key = p.product_key
WHERE f.order_date >= '2025-01-01'
GROUP BY p.product_category;

This scans the order_date, total_amount, and product_key columns from the fact table (three columns out of ten, processing about 5 GB for one year of data) plus the entire small products dimension table (40 MB). Total: approximately 5.04 GB per query.

The denormalized version:

-- Denormalized version
SELECT 
  product_category,
  SUM(total_amount) as revenue
FROM ecommerce.orders_denormalized
WHERE order_date >= '2025-01-01'
GROUP BY product_category;

This scans only product_category, total_amount, and order_date (three columns from a 40-column table). For one year: approximately 3.75 GB per query.

Over 100 daily queries, the denormalized approach scans 375 GB daily versus 504 GB for the star schema. At $5 per TB scanned, the monthly query cost difference is $195 versus $252. The denormalized table actually reduces query costs by 22% for this access pattern.

Decision Framework for BigQuery Denormalized Tables vs Star Schema

Choose denormalized tables when:

Query patterns focus on a core set of metrics with various dimensional filters. Users primarily slice and aggregate fact data rather than deeply analyzing dimension attributes. Dimension data changes infrequently or updates can happen through scheduled batch rebuilds. Storage costs are acceptable given the data volume. Query simplicity matters for analyst productivity and reduces the risk of incorrect joins.

Choose star schema when:

Dimension tables update frequently and you need those updates reflected immediately. You maintain slowly changing dimensions to track historical attribute changes. Multiple fact tables share the same dimensions, making storage efficiency critical. Governance requires clear separation between transactional facts and master data. Complex dimension hierarchies and attributes benefit from dedicated tables. Dimension tables are large enough that denormalizing them would dramatically increase storage costs.

Use a hybrid approach when:

Some dimensions are small and rarely change (denormalize these), while others are large or volatile (keep these separate). You can denormalize high-use dimensions into fact tables while maintaining separate dimension tables for detailed analysis. Nested and repeated fields in BigQuery allow you to embed related records without full denormalization.

Google Cloud Certification Exam Relevance

This topic appears frequently in the Professional Data Engineer and Professional Cloud Architect certifications. Both exams test your ability to design cost-effective, performant data solutions.

A typical exam scenario might read:

"Your company processes 100 million daily transactions. Analysts query this data to generate reports on customer behavior, requiring joins between transactions, customers, and products. Customer records update daily with new preferences and segments. Which design minimizes costs while maintaining data freshness?"

The correct answer would be a star schema. The daily customer updates make denormalization impractical because you'd need to update 100 million transaction records daily. The star schema allows you to update the customer dimension once while leaving transaction facts unchanged.

Conversely, if the scenario stated that customer and product data rarely changes, and analysts primarily run aggregation queries filtered by product category and customer segment, the denormalized approach would be more cost-effective. It reduces query complexity and lowers the bytes scanned per query.

The Associate Cloud Engineer exam covers this at a higher level, focusing on understanding when to use BigQuery versus other storage options. The Machine Learning Engineer certification touches on this when designing feature stores and training data pipelines, where denormalized tables often simplify feature extraction.

Conclusion

The choice between BigQuery denormalized tables and star schema depends entirely on your specific use case. BigQuery's architecture makes both approaches viable, unlike traditional databases where normalization was almost always preferred.

Denormalized tables offer query simplicity and can reduce scanning costs when your queries focus on core metrics with dimensional filters. Star schemas provide storage efficiency, easier updates, and clearer data governance when dimensions change frequently or multiple fact tables share common dimensions.

The best data engineers don't pick one approach dogmatically. They analyze query patterns, update frequencies, data volumes, and cost projections. They understand that BigQuery's columnar storage and distributed joins make the traditional arguments for normalization less compelling. Sometimes they even blend approaches, denormalizing frequently queried dimensions while keeping others separate.

What matters most is making an informed decision based on how your data actually behaves and how your users actually work with it. That's the difference between following best practices blindly and engineering thoughtfully for the platform you're using.

« Back to Blog