Structured, Unstructured, and Semi-Structured Data: A GCP ACE Exam Primer

Ben Makansi
November 19, 2025

One of the first things the Associate Cloud Engineer exam tests is whether you can match a data type to the right storage service. Structured data, unstructured data, and semi-structured data each have different characteristics, and Google Cloud has specific services designed for each. Get the categorization wrong and you end up recommending a database for image files or an object store for transactional records.

Structured Data

Structured data is highly organized. It lives in tables with rows and columns, follows a predefined schema, and is typically managed by a relational database. Every record conforms to the same format. A customer record always has a name, an email address, and a customer ID. A transaction always has an amount, a timestamp, and an account number.

Structured data supports SQL queries directly because the schema is known in advance. It is also the category most associated with ACID compliance, meaning transactions are atomic, consistent, isolated, and durable. Financial records, CRM data, order management systems, and inventory databases are all structured data.

On Google Cloud, the services for structured data are Cloud SQL, Cloud Spanner, and BigQuery. Cloud SQL handles traditional relational workloads in a single region. Cloud Spanner handles relational workloads that need global scale and strong consistency. BigQuery handles structured analytical data at scale, optimized for queries rather than transactions.

Unstructured Data

Unstructured data has no predefined schema or fixed format. Images, videos, audio files, PDFs, raw log files, emails, and social media content are all unstructured. You cannot put an image in a database table row in any meaningful way. The content is the data, and it does not conform to a schema you define ahead of time.

On Google Cloud, Cloud Storage is the primary home for unstructured data. It stores objects of any type and size without caring about their internal structure. A video file, a trained machine learning model, a backup archive, and a raw CSV export all go into Cloud Storage buckets. The service organizes by bucket and object name, not by schema.

Cloud Storage is also the landing zone for data pipelines. Raw data frequently arrives in Cloud Storage first, gets processed by Dataflow or Dataproc, and the structured output lands in BigQuery. Understanding this pipeline pattern is useful for the Associate Cloud Engineer exam.

Semi-Structured Data

Semi-structured data sits between the other two. It has some organization, often through metadata tags or a flexible schema, but it does not conform to a rigid table structure. JSON, XML, and YAML are the most common formats. A JSON document might have consistent top-level keys but variable nested structures. Different records can have different fields.

NoSQL databases are built to handle semi-structured data. On Google Cloud, Cloud Firestore stores documents as JSON-like objects, making it a natural fit for semi-structured application data. Cloud Bigtable uses a column-family model, which allows each row to have different columns, accommodating semi-structured patterns at very high throughput.

Even Cloud Storage can hold semi-structured data. JSON files sitting in a bucket are technically unstructured from Cloud Storage's perspective because it just sees bytes, but the content itself is semi-structured. BigQuery can query JSON files in Cloud Storage directly through external tables, treating the semi-structured content as if it were tabular.

The Exam Mapping

The Associate Cloud Engineer exam presents scenarios and asks you to choose the right storage service. The data type is almost always a clue. When the scenario describes images, videos, or raw files, Cloud Storage is the answer. When it describes customer records, transactions, or anything requiring SQL and consistency, Cloud SQL or Cloud Spanner applies. When it describes user profiles with variable attributes or mobile app data, Firestore is likely the right choice.

BigQuery is a special case. It primarily stores structured analytical data, but it can query both structured and semi-structured data from external sources. The exam tests whether you know that BigQuery is an analytics engine designed for large-scale queries, not a replacement for an operational database.

A useful way to remember the mapping: structured data belongs in databases that enforce schema, unstructured data belongs in object storage, and semi-structured data belongs in NoSQL databases or can live in object storage with schema applied at query time.

Why the ACE Exam Cares About This

An Associate Cloud Engineer needs to design storage architectures that fit the data being stored. Choosing the wrong service means either wasting money on capabilities you do not need or running into technical limitations when your data does not fit the service's model.

The exam tests this through architecture scenario questions: a startup is building a photo-sharing app, a retailer wants to analyze sales data, a healthcare company stores patient notes in varying formats. Each scenario has a correct answer that depends on recognizing the data type and matching it to the appropriate GCP service.

My Associate Cloud Engineer course walks through these scenario types in detail, including the less obvious cases where semi-structured data spans multiple services depending on the access pattern.

External Tables and Schema-on-Read

BigQuery introduces a useful concept for semi-structured data called schema-on-read. You can store JSON files in Cloud Storage without defining a schema upfront, then query them through BigQuery external tables as if they were structured. The schema is applied when the query runs rather than when the data is stored. This is different from loading data into a native BigQuery table, which enforces the schema at write time.

Schema-on-read is useful for exploratory analysis of data whose structure you have not fully defined yet, or for querying logs and event data that have varying fields. The trade-off is performance. External table queries scan Cloud Storage directly, which is slower than querying native BigQuery tables. For data you query frequently, loading into native BigQuery tables is the better choice.

Understanding this distinction helps with exam scenarios that describe semi-structured data at large scale with a mix of storage cost and query performance requirements.

arrow