AWS Glue Data Catalog

The AWS Glue Data Catalog is a centralized, Hive-compatible metadata repository that stores information about data sources — databases, tables, columns, partitions, and schemas — across your AWS environment. It is the metadata backbone for Athena, Redshift Spectrum, EMR, Glue ETL, and Lake Formation, and the foundation of any S3-based data lake on AWS.


Key Features:


Common Use Cases:


Service Limits & Quotas:


Pricing Model:


Code Example:


import boto3

glue = boto3.client("glue", region_name="us-west-2")

# Register a Parquet table over an S3 prefix
glue.create_table(
    DatabaseName="logs",
    TableInput={
        "Name": "app_events",
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {"classification": "parquet"},
        "PartitionKeys": [
            {"Name": "year",  "Type": "string"},
            {"Name": "month", "Type": "string"},
        ],
        "StorageDescriptor": {
            "Columns": [
                {"Name": "event_time", "Type": "timestamp"},
                {"Name": "user_id",    "Type": "bigint"},
                {"Name": "status",     "Type": "string"},
            ],
            "Location": "s3://my-lake/logs/app_events/",
            "InputFormat":  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
            "SerdeInfo":    {"SerializationLibrary":
                             "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"},
        },
    },
)

# Add a partition index to keep partition pruning fast as the table grows
glue.create_partition_index(
    DatabaseName="logs",
    TableName="app_events",
    PartitionIndex={"IndexName": "year_month_idx", "Keys": ["year", "month"]},
)
  


Common Interview Questions:

What does the Glue Data Catalog actually store?

Metadata only — database, table, column, partition, and serialization details. Data continues to live in S3 (or whichever underlying store). Athena, Redshift Spectrum, EMR, and Glue all read this metadata to plan queries.

How do crawlers differ from manually creating tables?

Crawlers scan a source on a schedule, infer schema and partition layout, and update the catalog. Manual table creation gives you precise control over column types, serde properties, and partition projection. Production lakes typically combine both — crawlers for raw ingest, manual definitions for curated tables.

When would you use partition projection instead of registered partitions?

Partition projection works well when partition values follow a predictable pattern (dates, integer ranges, enums). It removes the need to register every partition and avoids slow GetPartitions calls on tables with millions of partitions.

How does Lake Formation relate to the Glue Data Catalog?

Lake Formation layers fine-grained access control (database, table, column, row, cell, and tag-based) on top of catalog objects. Without Lake Formation, access is governed by IAM policies on the Glue API.

What is the cost trap with high-cardinality partitioning?

Each partition is a catalog object, and chatty GetPartitions calls accumulate request charges and slow query planning. Partition indexes and Iceberg metadata avoid scanning all partition records on every query.

How do you share a Glue table across AWS accounts?

Use Lake Formation or AWS RAM to grant the consumer account access. The consumer creates a resource link in their own catalog that points to the producer's database/table, and queries the data in place without copying.


The AWS Glue Data Catalog is the metadata foundation of any analytics workload on AWS. It centralizes schema, drives query planners across services, and integrates with Lake Formation to govern access — making it a non-negotiable component of a modern data lake.