AWS Glue Data Catalog

The AWS Glue Data Catalog is a centralized, Hive-compatible metadata repository that stores information about data sources — databases, tables, columns, partitions, and schemas — across your AWS environment. It is the metadata backbone for Athena, Redshift Spectrum, EMR, Glue ETL, and Lake Formation, and the foundation of any S3-based data lake on AWS.

Key Features:

Centralized Metadata Storage: Single catalog per region per account, accessed by every AWS analytics service that reads S3.
Automatic Schema Discovery: Glue Crawlers scan S3, JDBC, DynamoDB, Delta, and Iceberg sources to infer schemas and register tables and partitions.
Iceberg, Hudi, and Delta Support: Native support for open table formats with ACID semantics, time travel, and schema evolution.
Partition Indexes & Projection: Speed up planning over highly partitioned tables by indexing partition keys or computing partitions from naming conventions.
Schema Versioning: Each table maintains a history of schema changes; revert or compare versions over time.
Lake Formation Integration: Apply database-, table-, column-, row-, and tag-based permissions on top of catalog objects.
Resource Linking & Cross-Account Sharing: Share catalog entries between accounts via Lake Formation or RAM without copying data.
Hive Metastore Compatibility: Tools that target the Hive metastore API can use Glue as a drop-in replacement.

Common Use Cases:

Data Lake Management: Catalog all S3 datasets so Athena, Redshift Spectrum, and EMR can query them without per-tool schema definitions.
ETL Job Configuration: Glue ETL jobs read source schema and write back updated schema after transforms.
Ad Hoc Queries: Athena and Redshift Spectrum resolve table names against the catalog.
Data Governance: Lake Formation uses catalog objects as the unit of permission grants.
Cross-Account Data Mesh: Producer accounts publish tables to the catalog; consumer accounts subscribe via resource links.

Service Limits & Quotas:

Databases per region: default soft limit 10,000.
Tables per database: default soft limit 200,000.
Partitions per table: default soft limit 10,000,000.
Table version history: 1,000 versions per table.
Crawler concurrency: default soft limit 50 concurrent crawlers per account; each crawler runs as a managed Glue job.
API throttling: requests like GetPartitions have per-account TPS caps — use partition indexes to reduce calls.

Pricing Model:

Storage: first 1 million objects (databases, tables, partitions) per month are free; charges per 100,000 objects above that.
Requests: first 1 million requests per month are free; charges per million requests beyond.
Crawlers: billed per DPU-second with a 10-minute minimum per run.
Common cost surprises: chatty GetPartitions from poorly partitioned queries; runaway crawlers that re-scan unchanged data on a frequent schedule.

Code Example:


import boto3

glue = boto3.client("glue", region_name="us-west-2")

# Register a Parquet table over an S3 prefix
glue.create_table(
    DatabaseName="logs",
    TableInput={
        "Name": "app_events",
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {"classification": "parquet"},
        "PartitionKeys": [
            {"Name": "year",  "Type": "string"},
            {"Name": "month", "Type": "string"},
        ],
        "StorageDescriptor": {
            "Columns": [
                {"Name": "event_time", "Type": "timestamp"},
                {"Name": "user_id",    "Type": "bigint"},
                {"Name": "status",     "Type": "string"},
            ],
            "Location": "s3://my-lake/logs/app_events/",
            "InputFormat":  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
            "SerdeInfo":    {"SerializationLibrary":
                             "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"},
        },
    },
)

# Add a partition index to keep partition pruning fast as the table grows
glue.create_partition_index(
    DatabaseName="logs",
    TableName="app_events",
    PartitionIndex={"IndexName": "year_month_idx", "Keys": ["year", "month"]},
)

Common Interview Questions:

What does the Glue Data Catalog actually store?

Metadata only — database, table, column, partition, and serialization details. Data continues to live in S3 (or whichever underlying store). Athena, Redshift Spectrum, EMR, and Glue all read this metadata to plan queries.

How do crawlers differ from manually creating tables?

Crawlers scan a source on a schedule, infer schema and partition layout, and update the catalog. Manual table creation gives you precise control over column types, serde properties, and partition projection. Production lakes typically combine both — crawlers for raw ingest, manual definitions for curated tables.

When would you use partition projection instead of registered partitions?

Partition projection works well when partition values follow a predictable pattern (dates, integer ranges, enums). It removes the need to register every partition and avoids slow GetPartitions calls on tables with millions of partitions.

How does Lake Formation relate to the Glue Data Catalog?

Lake Formation layers fine-grained access control (database, table, column, row, cell, and tag-based) on top of catalog objects. Without Lake Formation, access is governed by IAM policies on the Glue API.

What is the cost trap with high-cardinality partitioning?

Each partition is a catalog object, and chatty GetPartitions calls accumulate request charges and slow query planning. Partition indexes and Iceberg metadata avoid scanning all partition records on every query.

How do you share a Glue table across AWS accounts?

Use Lake Formation or AWS RAM to grant the consumer account access. The consumer creates a resource link in their own catalog that points to the producer's database/table, and queries the data in place without copying.

The AWS Glue Data Catalog is the metadata foundation of any analytics workload on AWS. It centralizes schema, drives query planners across services, and integrates with Lake Formation to govern access — making it a non-negotiable component of a modern data lake.