BigQuery Nested Fields vs Flat Tables: When to Use Each


9 minute read

BigQuery Nested Fields vs Flat Tables: When to Use Each

Every data engineer working with BigQuery eventually faces this question: should you normalize your data into multiple flat tables joined with foreign keys, or denormalize it into a single table with nested STRUCT and ARRAY fields? The choice between BigQuery nested fields vs flat tables affects query performance, storage costs, and how easily your team can extract insights from data.

Traditional relational databases trained us to normalize everything. Split orders from customers. Put line items in their own table. Create junction tables for many-to-many relationships. This approach reduces redundancy and maintains referential integrity, but it comes with a cost that BigQuery's columnar architecture handles differently than row-based systems.

Understanding when to flatten and when to nest requires looking beyond database theory and examining how BigQuery actually executes queries and stores data.

Flat Tables: The Traditional Relational Approach

Flat tables organize data in normalized form, where each entity gets its own table connected through foreign key relationships. A typical ecommerce schema might have separate tables for customers, orders, and order_line_items, each containing scalar values in every column.

Here's what that looks like:

-- customers table
customer_id | name           | email
1           | Alice Chen     | [email protected]
2           | Bob Martinez   | [email protected]

-- orders table
order_id | customer_id | order_date
101      | 1           | 2025-09-15
102      | 1           | 2025-09-20
103      | 2           | 2025-09-18

-- order_line_items table
line_id | order_id | product_name | quantity | price
1       | 101      | Widget A     | 2        | 29.99
2       | 101      | Widget B     | 1        | 49.99
3       | 102      | Widget C     | 3        | 19.99

This structure offers clear benefits. Each fact appears exactly once, making updates simple and preventing inconsistencies. Adding a new order line item requires inserting one row. Analytics teams familiar with SQL can join tables using patterns they already know. The schema clearly documents relationships between entities.

Flat tables work well when your queries mostly filter and aggregate on a single table, when different teams need access to different entities independently, or when you're migrating from a traditional RDBMS and want to preserve existing query patterns.

Drawbacks of Flat Tables in BigQuery

The normalized approach shows its limitations when you need to reconstruct full business objects. Consider finding all products purchased by customers in California:

SELECT 
  c.name,
  oli.product_name,
  oli.quantity
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_line_items oli ON o.order_id = oli.order_id
WHERE c.state = 'CA'

This query requires BigQuery to perform two joins, reading data from three separate tables. Each join operation creates intermediate results that must be shuffled across slots in your BigQuery cluster. With billions of rows, these shuffles consume significant resources and increase query costs.

BigQuery charges based on bytes processed. When tables are normalized, you often scan more data than necessary because joins require reading key columns even when you only need a few fields from the final result. The query planner must coordinate reads across multiple tables, and the storage layer fetches columns from different physical locations.

Flat tables also create schema management challenges. If your application needs order data with line items, you must maintain synchronized writes across multiple tables. This requires careful transaction handling or accepting eventual consistency, neither of which fits naturally into BigQuery's batch-oriented architecture.

Nested Fields: BigQuery's Denormalized Structures

BigQuery supports STRUCT and ARRAY types that let you embed related data directly within parent records. A STRUCT groups multiple fields together like a JSON object, while an ARRAY holds repeated values of the same type. Combining them creates hierarchical data structures.

The same ecommerce data using nested fields looks like this:

CREATE TABLE orders_nested (
  order_id INT64,
  order_date DATE,
  customer STRUCT
    customer_id INT64,
    name STRING,
    email STRING
  >,
  line_items ARRAY<STRUCT
    product_name STRING,
    quantity INT64,
    price FLOAT64
  >>
)

Now each order row contains complete customer information and all associated line items. Querying becomes dramatically simpler:

SELECT 
  customer.name,
  line_item.product_name,
  line_item.quantity
FROM orders_nested,
UNNEST(line_items) as line_item
WHERE customer.state = 'CA'

This query reads from a single table. No joins required. BigQuery's columnar storage means it only scans the specific nested fields you reference, ignoring unneeded columns entirely. The UNNEST operation flattens the array for processing but happens locally within each slot without network shuffles.

The performance difference becomes stark with large datasets. Nested structures keep related data physically close on disk, reducing I/O and improving cache efficiency. Queries execute faster and cost less because you process fewer bytes.

How BigQuery Changes the Calculus

Traditional databases optimized for transactional workloads punish denormalization. Updates become expensive when the same customer data repeats across thousands of order rows. Storage costs multiply. Maintaining consistency requires complex application logic.

BigQuery flips these assumptions. As an append-only columnar analytics database, it handles denormalization elegantly. You rarely update historical orders. Instead, you insert new records or rewrite entire partitions. Repeated customer data within a STRUCT compresses efficiently because BigQuery's columnar format stores each field separately, and compression algorithms excel at repeated values.

The columnar architecture means BigQuery only reads columns your query actually uses. With nested fields, those columns stay together physically. The query optimizer can push predicates into nested structures, filtering arrays before unnesting them. This minimizes data movement and processing.

BigQuery also provides features that make nested structures practical at scale. You can partition tables containing nested fields just like flat tables, limiting scans to relevant date ranges. Clustering works on nested field values, organizing data for common access patterns. Required fields within structs enforce schema constraints similar to foreign keys.

Real-World Scenario: Event Analytics Platform

Imagine building an analytics platform for a mobile gaming company. The platform tracks player events: logins, level completions, item purchases, and social interactions. Each event has common metadata like timestamp and player ID, plus event-specific attributes.

With flat tables, you might create separate tables for each event type:

-- Four separate tables
player_logins (timestamp, player_id, device_type, session_id)
level_completions (timestamp, player_id, level_id, score, duration_seconds)
item_purchases (timestamp, player_id, item_id, price, currency)
social_interactions (timestamp, player_id, friend_id, interaction_type)

Analyzing player behavior requires complex queries unioning across all event types, then joining back to player dimensions. Finding all actions by players who made a purchase last week means scanning the purchases table, extracting player IDs, then querying each event type with those IDs.

Using nested fields, you create a unified events table:

CREATE TABLE player_events (
  event_timestamp TIMESTAMP,
  player_id INT64,
  event_type STRING,
  event_data STRUCT
    login STRUCT<device_type STRING, session_id STRING>,
    level_completion STRUCT<level_id INT64, score INT64, duration_seconds INT64>,
    purchase STRUCT<item_id STRING, price FLOAT64, currency STRING>,
    social STRUCT<friend_id INT64, interaction_type STRING>
  >
)

Now a single scan answers complex behavioral questions:

WITH recent_purchasers AS (
  SELECT DISTINCT player_id
  FROM player_events
  WHERE event_type = 'purchase'
    AND event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
SELECT 
  e.event_type,
  COUNT(*) as event_count
FROM player_events e
JOIN recent_purchasers p ON e.player_id = p.player_id
GROUP BY e.event_type

The query processes 30% fewer bytes compared to the multi-table approach because it avoids scanning multiple tables and joining on player_id repeatedly. Monthly analysis costs drop from $450 to $315 based on typical query patterns.

Decision Framework: When to Choose Each Approach

Use flat tables when:

  • Different teams query distinct entities independently and access controls matter
  • You need to update specific records frequently rather than appending new data
  • Your organization has strong SQL expertise but limited BigQuery-specific knowledge
  • Data relationships are truly many-to-many without clear parent-child hierarchies
  • You're migrating from a traditional database and want to minimize query rewrites

Use nested fields when:

  • Data has natural hierarchies (orders contain line items, events contain properties)
  • Queries typically need complete business objects rather than slices across entities
  • You primarily append data rather than updating existing records
  • Query performance and cost matter more than schema familiarity
  • You want to reduce join operations and shuffle costs

Consider a hybrid approach when:

  • Some entities are clearly hierarchical while others have independent lifecycles
  • Core dimensions like customers or products need independent updates
  • Fact tables capturing events or transactions can embed dimension snapshots as structs
  • You partition large tables and can accept dimension data duplicated within partition boundaries

The table below summarizes key trade-offs:

FactorFlat TablesNested Fields
Query ComplexityHigher (joins required)Lower (single table scans)
Bytes ProcessedMore (join overhead)Fewer (targeted column reads)
StorageLower (no duplication)Higher (data repeats in structs)
Update PatternIndividual row updatesAppend-only or partition rewrites
Schema ClarityExplicit relationshipsEmbedded hierarchies
Learning CurveFamiliar SQLRequires STRUCT/ARRAY knowledge

Relevance to Google Cloud Certification Exams

This design choice appears frequently on the Professional Data Engineer certification and sometimes on the Professional Cloud Architect exam. Google emphasizes understanding BigQuery's unique characteristics rather than applying traditional database patterns blindly.

A typical exam scenario might present this situation:

"A retail company loads transaction data into BigQuery. Each transaction contains customer information, multiple purchased items, and payment details. The analytics team runs daily reports showing customer purchase patterns. Which schema design optimizes query performance and cost?"

Correct answer: Use a single table with customer data stored in a STRUCT and purchased items in an ARRAY of STRUCTs. This design eliminates joins, reduces bytes scanned, and keeps related data together for common query patterns.

Why other options are wrong:

  • Separate tables for customers, transactions, and line items require expensive joins for typical reports
  • Flattening items into individual rows duplicates transaction and customer data unnecessarily
  • Storing all data as JSON strings prevents predicate pushdown and column pruning

The exam tests whether you understand that BigQuery's columnar storage and nested type support make denormalization practical and often preferable for analytics workloads, contrary to traditional RDBMS design principles.

Associate Cloud Engineer candidates should recognize when nested fields improve BigQuery performance. Machine Learning Engineer certification scenarios sometimes include feature engineering questions where denormalized structures simplify data preparation for model training.

Conclusion

The choice between BigQuery nested fields vs flat tables comes down to matching data structure to query patterns and BigQuery's architectural strengths. Flat tables preserve familiar relational patterns and work well when entities have independent lifecycles or need frequent updates. Nested fields leverage BigQuery's columnar storage to eliminate joins, reduce costs, and improve performance when data has natural hierarchies.

Neither approach is universally correct. Thoughtful data engineering means evaluating how your team queries data, understanding BigQuery's pricing model, and recognizing when denormalization serves analytics better than normalization. The best designs often combine both approaches, using flat tables for core dimensions that change independently and nested structures for event data or transactional facts that capture complete business moments.

Master this trade-off and you'll build BigQuery schemas that serve both current reporting needs and future analytical complexity while controlling costs and maintaining query performance at scale.

« Back to Blog