Database schema design is the process of translating logical data models (conceptual ERDs) into physical database structures optimized for performance, integrity, scalability, and maintainability. It bridges the gap between what data you need and how it's actually stored and accessed in the database.
Schema design involves critical decisions:
OLTP (Online Transaction Processing) schemas prioritize write performance and normalization:
OLAP (Online Analytical Processing) schemas prioritize query performance via denormalization:
The most common OLAP pattern. A central fact table (containing measures and foreign keys) surrounded by dimension tables (describing attributes).
┌─────────────────┐
│ dim_customer │
│ - cust_id (PK) │
└─────────────────┘
│ FK
│
┌───────────┼───────────┐
│ │ │
┌─────────────────────────────────────┐
│ fact_sales (FACT) │
│ - sale_id (PK) │
│ - cust_id (FK) │
│ - prod_id (FK) │
│ - store_id (FK) │
│ - date_id (FK) │
│ - quantity (measure) │
│ - revenue (measure) │
└─────────────────────────────────────┘
│ │ │
│ │ │
┌─────────────┐ │ ┌──────────────┐
│ dim_product │ │ │ dim_store │
└─────────────┘ │ └──────────────┘
│
┌──────────┐
│ dim_date │
└──────────┘
Dimensions are normalized into multiple related tables, reducing redundancy but increasing query complexity.
fact_sales ──┬──> dim_product ──> dim_category
├──> dim_customer ──> dim_region
├──> dim_store
└──> dim_date
Each dimension table references other dimension tables,
eliminating attribute duplication.
Multiple fact tables sharing common dimensions. Used for complex analytical domains with different granularities.
fact_sales ──────┬──> dim_product
├──> dim_customer
└──> dim_date
fact_returns ────┬──> dim_product
├──> dim_customer
└──> dim_date
Shared dimensions across multiple fact tables.
Star schemas excel in query performance because:
-- Dimension: Customer
CREATE TABLE dim_customer (
cust_key SERIAL PRIMARY KEY, -- Surrogate key
cust_id VARCHAR(50) NOT NULL UNIQUE, -- Business key
cust_name VARCHAR(100) NOT NULL,
email VARCHAR(100),
country VARCHAR(50),
city VARCHAR(50),
is_active BOOLEAN DEFAULT TRUE,
scd_version INT DEFAULT 1, -- For SCD Type 2
effective_date DATE DEFAULT CURRENT_DATE,
end_date DATE DEFAULT '9999-12-31',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_dim_customer_cust_id ON dim_customer(cust_id);
-- Dimension: Product
CREATE TABLE dim_product (
prod_key SERIAL PRIMARY KEY,
prod_id VARCHAR(50) NOT NULL UNIQUE,
prod_name VARCHAR(100) NOT NULL,
category VARCHAR(50) NOT NULL,
subcategory VARCHAR(50),
brand VARCHAR(50),
list_price DECIMAL(10, 2),
is_active BOOLEAN DEFAULT TRUE,
scd_version INT DEFAULT 1,
effective_date DATE DEFAULT CURRENT_DATE,
end_date DATE DEFAULT '9999-12-31'
);
CREATE INDEX idx_dim_product_category ON dim_product(category);
-- Dimension: Store
CREATE TABLE dim_store (
store_key SERIAL PRIMARY KEY,
store_id VARCHAR(50) NOT NULL UNIQUE,
store_name VARCHAR(100) NOT NULL,
region VARCHAR(50) NOT NULL,
district VARCHAR(50),
country VARCHAR(50),
city VARCHAR(50),
state VARCHAR(10),
postal_code VARCHAR(10),
store_type VARCHAR(20),
opened_date DATE,
is_active BOOLEAN DEFAULT TRUE
);
CREATE INDEX idx_dim_store_region ON dim_store(region);
-- Dimension: Date (conformed dimension)
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD format
full_date DATE NOT NULL UNIQUE,
year INT,
quarter INT,
month INT,
day_of_month INT,
day_of_week INT,
week_of_year INT,
is_weekday BOOLEAN,
is_holiday BOOLEAN DEFAULT FALSE,
holiday_name VARCHAR(50)
);
-- Fact Table: Sales
CREATE TABLE fact_sales (
sale_id BIGSERIAL PRIMARY KEY,
cust_key INT NOT NULL REFERENCES dim_customer(cust_key),
prod_key INT NOT NULL REFERENCES dim_product(prod_key),
store_key INT NOT NULL REFERENCES dim_store(store_key),
date_key INT NOT NULL REFERENCES dim_date(date_key),
quantity INT NOT NULL CHECK (quantity > 0),
unit_price DECIMAL(10, 2) NOT NULL,
discount_pct DECIMAL(5, 4) DEFAULT 0,
revenue DECIMAL(12, 2) GENERATED ALWAYS AS
(quantity * unit_price * (1 - discount_pct)) STORED,
cost DECIMAL(12, 2),
profit GENERATED ALWAYS AS (revenue - cost) STORED,
tax DECIMAL(10, 2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Indexing strategy for fact table
CREATE INDEX idx_fact_sales_cust_key ON fact_sales(cust_key);
CREATE INDEX idx_fact_sales_prod_key ON fact_sales(prod_key);
CREATE INDEX idx_fact_sales_store_key ON fact_sales(store_key);
CREATE INDEX idx_fact_sales_date_key ON fact_sales(date_key);
CREATE INDEX idx_fact_sales_composite ON fact_sales(date_key, store_key, prod_key);
-- Analytical Query Example
SELECT
dc.cust_name,
dp.prod_name,
EXTRACT(YEAR FROM dd.full_date) AS year,
EXTRACT(MONTH FROM dd.full_date) AS month,
SUM(fs.quantity) AS total_units,
SUM(fs.revenue) AS total_revenue,
ROUND(AVG(fs.profit), 2) AS avg_profit
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
JOIN dim_product dp ON fs.prod_key = dp.prod_key
JOIN dim_store ds ON fs.store_key = ds.store_key
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE ds.region = 'North America'
AND EXTRACT(YEAR FROM dd.full_date) = 2024
AND dp.category = 'Electronics'
GROUP BY dc.cust_name, dp.prod_name, year, month
HAVING SUM(fs.revenue) > 10000
ORDER BY total_revenue DESC;
Normalizes dimension tables to eliminate redundancy. Pros: smaller storage, easier maintenance. Cons: more joins, slower queries.
-- Snowflake Dimension: Product hierarchy
CREATE TABLE dim_brand (
brand_key SERIAL PRIMARY KEY,
brand_id VARCHAR(50) NOT NULL UNIQUE,
brand_name VARCHAR(100) NOT NULL,
country_of_origin VARCHAR(50)
);
CREATE TABLE dim_category (
category_key SERIAL PRIMARY KEY,
category_id VARCHAR(50) NOT NULL UNIQUE,
category_name VARCHAR(100) NOT NULL,
category_desc TEXT
);
CREATE TABLE dim_product_snowflake (
prod_key SERIAL PRIMARY KEY,
prod_id VARCHAR(50) NOT NULL UNIQUE,
prod_name VARCHAR(100) NOT NULL,
category_key INT NOT NULL REFERENCES dim_category(category_key),
brand_key INT NOT NULL REFERENCES dim_brand(brand_key),
list_price DECIMAL(10, 2)
);
-- Snowflake query (more joins, but smaller tables)
SELECT
dc.category_name,
db.brand_name,
COUNT(*) AS num_products,
AVG(dps.list_price) AS avg_price
FROM dim_product_snowflake dps
JOIN dim_category dc ON dps.category_key = dc.category_key
JOIN dim_brand db ON dps.brand_key = db.brand_key
GROUP BY dc.category_name, db.brand_name
ORDER BY num_products DESC;
Indexes accelerate data retrieval but slow writes. Choose carefully based on query patterns.
Balanced tree structure, excellent for range queries and sorting.
-- Simple B-tree on column
CREATE INDEX idx_employee_salary ON employees(salary);
-- Range query benefits from B-tree
SELECT * FROM employees
WHERE salary BETWEEN 50000 AND 100000
AND hire_date > '2020-01-01'
ORDER BY salary;
-- Composite (multi-column) B-tree
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);
-- Composite query
SELECT customer_id, order_date, total
FROM orders
WHERE customer_id = 123 AND order_date >= '2024-01-01'
ORDER BY order_date DESC;
Perfect for equality lookups, not range queries.
-- Hash index (PostgreSQL)
CREATE INDEX idx_user_email USING HASH ON users(email);
-- Excellent for exact match
SELECT * FROM users WHERE email = 'john@example.com';
-- Poor for range queries
SELECT * FROM users WHERE email LIKE '%@example.com'; -- Slow with hash index
Generalized Search Tree for full-text search, JSON, geometric data.
-- Full-text search with GIN
CREATE INDEX idx_article_content_gin
ON articles USING GIN(to_tsvector('english', content));
SELECT id, title FROM articles
WHERE to_tsvector('english', content) @@ to_tsquery('database & optimization');
-- JSON document search
CREATE INDEX idx_metadata_gin ON events USING GIN(metadata);
SELECT * FROM events WHERE metadata @> '{"event_type": "purchase"}';
Efficient for low-cardinality columns (few distinct values).
-- Bitmap ideal for status, gender, region fields
-- Bitmap indexes are typically created implicitly in columnar stores
-- Example query pattern (in Redshift, Snowflake, Databricks):
SELECT region, status, COUNT(*) AS cnt
FROM orders
WHERE is_active = TRUE
AND region IN ('North', 'South')
GROUP BY region, status;
Includes non-key columns so query can be satisfied without accessing the table.
-- Covering index (PostgreSQL, SQL Server)
CREATE INDEX idx_customer_name_email
ON customers(customer_id) INCLUDE (name, email);
-- Query satisfied entirely from index (index-only scan)
SELECT name, email FROM customers WHERE customer_id = 456;
Index only rows matching a condition, saving space.
-- Partial index for active customers only
CREATE INDEX idx_active_customers
ON customers(email) WHERE status = 'active';
-- Query uses partial index
SELECT * FROM customers
WHERE status = 'active' AND email = 'john@example.com';
-- Partial index for recent orders
CREATE INDEX idx_pending_orders
ON orders(order_date DESC)
WHERE status = 'pending';
Splitting large tables into smaller physical pieces for performance, maintainability, and parallelism.
Most common in data warehouses and time-series systems.
-- PostgreSQL Range Partitioning
CREATE TABLE fact_events (
event_id BIGSERIAL,
user_id INT,
event_type VARCHAR(50),
event_timestamp TIMESTAMP,
properties JSONB,
PRIMARY KEY (event_id, event_timestamp)
) PARTITION BY RANGE (DATE_TRUNC('month', event_timestamp));
-- Create partitions for each month
CREATE TABLE fact_events_2024_01 PARTITION OF fact_events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE fact_events_2024_02 PARTITION OF fact_events
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Databricks SQL (Delta Lake)
CREATE TABLE fact_events (
event_id BIGINT,
user_id INT,
event_type STRING,
event_timestamp TIMESTAMP,
properties MAP<STRING, STRING>
)
PARTITIONED BY (year INT, month INT)
USING DELTA
CLUSTER BY (user_id);
-- Insert automatically partitions
INSERT INTO fact_events VALUES
(1, 100, 'click', '2024-03-15 10:30:00', map(), 2024, 3);
Partition by discrete values (region, country, tenant_id).
-- PostgreSQL List Partitioning
CREATE TABLE orders (
order_id BIGSERIAL,
customer_id INT,
region VARCHAR(50),
order_date DATE,
amount DECIMAL(10, 2),
PRIMARY KEY (order_id, region)
) PARTITION BY LIST (region);
CREATE TABLE orders_north PARTITION OF orders
FOR VALUES IN ('North', 'Northeast', 'Northwest');
CREATE TABLE orders_south PARTITION OF orders
FOR VALUES IN ('South', 'Southeast', 'Southwest');
CREATE TABLE orders_central PARTITION OF orders
FOR VALUES IN ('Central', 'Midwest');
Distribute rows evenly across partitions using a hash function.
-- PostgreSQL Hash Partitioning
CREATE TABLE user_events (
event_id BIGSERIAL,
user_id INT,
event_type VARCHAR(50),
event_timestamp TIMESTAMP,
PRIMARY KEY (event_id, user_id)
) PARTITION BY HASH (user_id);
-- 4 hash partitions
CREATE TABLE user_events_0 PARTITION OF user_events
FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE user_events_1 PARTITION OF user_events
FOR VALUES WITH (modulus 4, remainder 1);
-- ...create 2 and 3 similarly
Combine multiple partitioning strategies (range + hash).
-- Databricks: Partition by date, cluster by user_id
CREATE TABLE fact_sales_composite (
sale_id BIGINT,
customer_id INT,
product_id INT,
sale_date DATE,
amount DECIMAL(12, 2)
)
PARTITIONED BY (year INT, month INT)
USING DELTA
CLUSTER BY (customer_id);
-- Query prunes partitions first, then uses clustering
SELECT SUM(amount) FROM fact_sales_composite
WHERE year = 2024 AND month = 3 AND customer_id = 100;
Constraints enforce data quality at the database level, preventing invalid states.
CREATE TABLE organizations (
org_id SERIAL PRIMARY KEY, -- PRIMARY KEY
org_name VARCHAR(100) NOT NULL, -- NOT NULL
email VARCHAR(100) UNIQUE, -- UNIQUE (allows NULL)
industry VARCHAR(50),
founded_year INT CHECK (founded_year >= 1900), -- CHECK
status VARCHAR(20) DEFAULT 'active', -- DEFAULT
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE subscriptions (
subscription_id SERIAL PRIMARY KEY,
org_id INT NOT NULL,
plan VARCHAR(20) DEFAULT 'basic',
monthly_seats INT CHECK (monthly_seats > 0),
billing_email VARCHAR(100) NOT NULL,
-- FOREIGN KEY with CASCADE delete
CONSTRAINT fk_org FOREIGN KEY (org_id)
REFERENCES organizations(org_id) ON DELETE CASCADE,
-- Unique constraint on composite columns
UNIQUE(org_id, plan),
-- CHECK constraint across columns
CHECK (monthly_seats <= 1000)
);
CREATE TABLE invoices (
invoice_id SERIAL PRIMARY KEY,
subscription_id INT,
invoice_date DATE NOT NULL,
amount DECIMAL(10, 2) NOT NULL CHECK (amount > 0),
status VARCHAR(20) DEFAULT 'pending',
paid_date DATE,
CONSTRAINT fk_subscription FOREIGN KEY (subscription_id)
REFERENCES subscriptions(subscription_id)
ON DELETE SET NULL, -- SET NULL on parent delete
CHECK (status IN ('pending', 'paid', 'cancelled')),
CHECK (paid_date IS NULL OR paid_date >= invoice_date)
);
-- Referential integrity examples
-- ON DELETE options:
-- CASCADE: delete child rows
-- SET NULL: set FK to NULL
-- RESTRICT: prevent deletion if children exist
-- SET DEFAULT: set FK to default value
CREATE TABLE audit_log (
log_id SERIAL PRIMARY KEY,
org_id INT,
action VARCHAR(50) NOT NULL,
changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT fk_audit_org FOREIGN KEY (org_id)
REFERENCES organizations(org_id)
ON DELETE RESTRICT -- Prevent org deletion if audit log exists
);
Design pattern for Software-as-a-Service with organization isolation and partitioning on org_id.
-- Organizations (tenants)
CREATE TABLE organizations (
org_id SERIAL PRIMARY KEY,
org_name VARCHAR(100) NOT NULL,
country VARCHAR(50),
industry VARCHAR(50),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_org_name ON organizations(org_name);
-- Users within organizations
CREATE TABLE users (
user_id SERIAL PRIMARY KEY,
org_id INT NOT NULL REFERENCES organizations(org_id) ON DELETE CASCADE,
email VARCHAR(100) NOT NULL,
password_hash VARCHAR(255),
full_name VARCHAR(100),
role VARCHAR(20) DEFAULT 'member',
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE (org_id, email) -- Email unique per org
);
CREATE INDEX idx_users_org_id ON users(org_id);
-- Subscriptions / Plans
CREATE TABLE subscriptions (
subscription_id SERIAL PRIMARY KEY,
org_id INT NOT NULL REFERENCES organizations(org_id) ON DELETE CASCADE,
plan_type VARCHAR(50) DEFAULT 'starter',
billing_cycle VARCHAR(20) DEFAULT 'monthly',
monthly_seats INT DEFAULT 10,
price_per_month DECIMAL(10, 2),
subscription_date DATE NOT NULL,
next_billing_date DATE,
status VARCHAR(20) DEFAULT 'active',
UNIQUE (org_id) -- One sub per org
);
-- Invoices
CREATE TABLE invoices (
invoice_id SERIAL PRIMARY KEY,
org_id INT NOT NULL REFERENCES organizations(org_id) ON DELETE CASCADE,
subscription_id INT NOT NULL REFERENCES subscriptions(subscription_id),
invoice_date DATE NOT NULL,
amount DECIMAL(10, 2) NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
paid_date DATE,
due_date DATE,
payment_method VARCHAR(50)
)
PARTITION BY LIST (org_id); -- Partition by tenant for isolation
-- Usage events (time-series, high cardinality)
CREATE TABLE usage_events (
event_id BIGSERIAL,
org_id INT NOT NULL,
user_id INT,
event_type VARCHAR(50),
event_timestamp TIMESTAMP NOT NULL,
properties JSONB,
PRIMARY KEY (event_id, org_id, event_timestamp)
)
PARTITION BY RANGE (DATE_TRUNC('month', event_timestamp));
-- Audit log with org isolation
CREATE TABLE audit_log (
log_id BIGSERIAL PRIMARY KEY,
org_id INT NOT NULL,
user_id INT,
action VARCHAR(100),
resource_type VARCHAR(50),
resource_id INT,
old_values JSONB,
new_values JSONB,
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (DATE_TRUNC('month', timestamp));
-- Row-level security policy (PostgreSQL example)
ALTER TABLE subscriptions ENABLE ROW LEVEL SECURITY;
CREATE POLICY org_isolation ON subscriptions
USING (org_id = current_setting('app.current_org_id')::INT);
Complete dimensional model for web analytics with multiple fact tables (Galaxy schema).
-- Conformed Dimension: Date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE NOT NULL UNIQUE,
year INT, quarter INT, month INT, day_of_month INT,
day_of_week INT, week_of_year INT,
is_weekend BOOLEAN, is_holiday BOOLEAN
);
-- Dimension: User / Visitor
CREATE TABLE dim_user (
user_key SERIAL PRIMARY KEY,
user_id VARCHAR(100) NOT NULL UNIQUE,
first_visit_date DATE,
country VARCHAR(50),
state VARCHAR(50),
city VARCHAR(50),
device_type VARCHAR(20),
os VARCHAR(50),
browser VARCHAR(50),
cohort_month DATE,
lifetime_value DECIMAL(12, 2),
is_active BOOLEAN DEFAULT TRUE
);
-- Dimension: Page
CREATE TABLE dim_page (
page_key SERIAL PRIMARY KEY,
page_id VARCHAR(200) NOT NULL UNIQUE,
page_url TEXT NOT NULL,
page_title VARCHAR(255),
section VARCHAR(50),
subsection VARCHAR(50),
page_type VARCHAR(50),
is_indexed BOOLEAN DEFAULT TRUE
);
-- Dimension: Campaign
CREATE TABLE dim_campaign (
campaign_key SERIAL PRIMARY KEY,
campaign_id VARCHAR(100) NOT NULL UNIQUE,
campaign_name VARCHAR(255),
channel VARCHAR(50),
source VARCHAR(100),
medium VARCHAR(50),
campaign_start_date DATE,
campaign_end_date DATE,
budget DECIMAL(10, 2)
);
-- Fact Table: Page Views (grain = one row per page view)
CREATE TABLE fact_page_views (
pageview_id BIGSERIAL PRIMARY KEY,
user_key INT NOT NULL REFERENCES dim_user(user_key),
page_key INT NOT NULL REFERENCES dim_page(page_key),
campaign_key INT REFERENCES dim_campaign(campaign_key),
date_key INT NOT NULL REFERENCES dim_date(date_key),
time_of_day TIME,
session_id VARCHAR(100),
time_on_page INT, -- seconds
scroll_depth DECIMAL(5, 2), -- percent
bounce INT CHECK (bounce IN (0, 1)),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (date_key);
-- Indexes on fact_page_views
CREATE INDEX idx_fpv_user ON fact_page_views(user_key);
CREATE INDEX idx_fpv_page ON fact_page_views(page_key);
CREATE INDEX idx_fpv_date ON fact_page_views(date_key);
CREATE INDEX idx_fpv_campaign ON fact_page_views(campaign_key);
-- Fact Table: Events (grain = one row per user interaction)
CREATE TABLE fact_events (
event_id BIGSERIAL PRIMARY KEY,
user_key INT NOT NULL REFERENCES dim_user(user_key),
page_key INT NOT NULL REFERENCES dim_page(page_key),
date_key INT NOT NULL REFERENCES dim_date(date_key),
event_type VARCHAR(50) NOT NULL,
event_name VARCHAR(100),
event_value DECIMAL(10, 2),
event_category VARCHAR(50),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (date_key);
CREATE INDEX idx_fe_user ON fact_events(user_key);
CREATE INDEX idx_fe_event_type ON fact_events(event_type);
CREATE INDEX idx_fe_date ON fact_events(date_key);
-- Analytical Queries
-- Page performance analysis
SELECT
dp.section,
dp.page_title,
COUNT(DISTINCT fpv.pageview_id) AS page_views,
COUNT(DISTINCT fpv.user_key) AS unique_visitors,
ROUND(100.0 * SUM(fpv.bounce) / COUNT(*), 2) AS bounce_rate,
ROUND(AVG(fpv.time_on_page), 0) AS avg_time_seconds,
ROUND(AVG(fpv.scroll_depth), 2) AS avg_scroll_depth
FROM fact_page_views fpv
JOIN dim_page dp ON fpv.page_key = dp.page_key
JOIN dim_date dd ON fpv.date_key = dd.date_key
WHERE dd.full_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY dp.section, dp.page_title
ORDER BY page_views DESC
LIMIT 20;
-- Campaign effectiveness
SELECT
dc.campaign_name,
dc.channel,
COUNT(DISTINCT fpv.user_key) AS users_reached,
COUNT(DISTINCT fe.event_id) AS total_events,
SUM(CASE WHEN fe.event_type = 'purchase' THEN fe.event_value ELSE 0 END)
AS revenue,
ROUND(100.0 * COUNT(CASE WHEN fe.event_type = 'purchase' THEN 1 END)
/ NULLIF(COUNT(DISTINCT fpv.user_key), 0), 2) AS conversion_rate
FROM fact_page_views fpv
LEFT JOIN fact_events fe ON fpv.user_key = fe.user_key
AND fpv.date_key = fe.date_key
JOIN dim_campaign dc ON fpv.campaign_key = dc.campaign_key
JOIN dim_date dd ON fpv.date_key = dd.date_key
WHERE dd.full_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY dc.campaign_name, dc.channel
ORDER BY revenue DESC;
-- User cohort analysis
SELECT
du.cohort_month,
dd.full_date - du.first_visit_date AS days_since_first_visit,
COUNT(DISTINCT fpv.user_key) AS active_users,
COUNT(DISTINCT fe.event_id) AS events,
ROUND(AVG(du.lifetime_value), 2) AS avg_ltv
FROM fact_page_views fpv
JOIN fact_events fe ON fpv.user_key = fe.user_key AND fpv.date_key = fe.date_key
JOIN dim_user du ON fpv.user_key = du.user_key
JOIN dim_date dd ON fpv.date_key = dd.date_key
WHERE du.cohort_month IS NOT NULL
AND dd.full_date >= CURRENT_DATE - INTERVAL '180 days'
GROUP BY du.cohort_month, 2
ORDER BY du.cohort_month, 2;
Schema design is only part of optimization; query execution matters equally.
Understand how the database executes your query.
-- PostgreSQL EXPLAIN ANALYZE
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
dc.cust_name,
SUM(fs.revenue) AS total_revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
WHERE EXTRACT(YEAR FROM fs.created_at) = 2024
GROUP BY dc.cust_name
HAVING SUM(fs.revenue) > 5000
ORDER BY total_revenue DESC;
/*
Output (example):
Finalize GroupAggregate (cost=5000.00..6000.00 rows=100)
-> Gather Merge (cost=4500.00..5500.00 rows=200)
-> Sort (cost=4500.00..4600.00 rows=200)
-> Partial GroupAggregate (cost=3500.00..4500.00 rows=200)
-> Hash Join (cost=1500.00..3000.00 rows=50000) <-- JOIN
-> Seq Scan on fact_sales (cost=0.00..1000.00 rows=100000)
Filter: (EXTRACT(YEAR ...) = 2024) <-- FILTER
-> Hash (cost=500.00..500.00 rows=5000)
-> Seq Scan on dim_customer (cost=0.00..500.00)
Key observations:
- Seq Scan on fact_sales: might benefit from date index
- Hash Join: good for large joins
- Seq Scan on small dim_customer: expected, dimension table is small
*/
-- Add index to improve performance
CREATE INDEX idx_fact_sales_created_at ON fact_sales(created_at);
-- Re-run EXPLAIN to see improvement
EXPLAIN (ANALYZE, BUFFERS)
SELECT ... -- same query as above
/*
With index, the planner will use Index Scan instead of Seq Scan,
reducing I/O significantly for large tables.
*/
Pre-compute expensive aggregations and refresh incrementally.
-- Materialized view for daily sales summary
CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT
dd.full_date,
ds.region,
COUNT(*) AS num_transactions,
SUM(fs.quantity) AS total_units,
SUM(fs.revenue) AS total_revenue,
ROUND(AVG(fs.profit), 2) AS avg_profit
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key
JOIN dim_store ds ON fs.store_key = ds.store_key
GROUP BY dd.full_date, ds.region;
-- Create index on materialized view
CREATE INDEX idx_mv_daily_sales_date ON mv_daily_sales_summary(full_date);
-- Refresh periodically (full or incremental)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_sales_summary;
-- Fast queries against pre-aggregated data
SELECT full_date, region, total_revenue
FROM mv_daily_sales_summary
WHERE full_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY full_date DESC;
Strategically duplicate data to avoid joins.
-- Normalized (joins required)
SELECT
fs.sale_id,
dc.cust_name,
dc.email,
dp.prod_name,
fs.quantity,
fs.revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
JOIN dim_product dp ON fs.prod_key = dp.prod_key
WHERE fs.date_key = 20240315;
-- Denormalized (no joins, but redundant data)
CREATE TABLE fact_sales_denorm AS
SELECT
fs.sale_id,
dc.cust_name,
dc.email,
dp.prod_name,
dp.category,
fs.quantity,
fs.revenue,
fs.date_key
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
JOIN dim_product dp ON fs.prod_key = dp.prod_key;
-- Much faster for OLAP queries, but harder to maintain
-- Use only if join cost is prohibitive and data refresh is infrequent
Manage database connections efficiently.
-- PgBouncer configuration example (pgbouncer.ini)
[databases]
analytics_db = host=db.internal port=5432 dbname=analytics
[pgbouncer]
pool_mode = transaction -- Return connection after each transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
-- Typical production setup:
-- - Read replicas for analytical queries
-- - Write to primary, replicate to replicas
-- - Use connection pooler between app and DB
-- - Enable query caching layer (Redis, Memcached)
Keep statistics fresh for optimal query planning.
-- Analyze tables to update statistics
ANALYZE fact_sales;
ANALYZE dim_customer;
-- Vacuum to reclaim space and cleanup dead tuples
VACUUM ANALYZE fact_sales;
-- Schedule regular maintenance
-- In cron or scheduled job:
-- 0 2 * * * vacuum analyze fact_sales;
-- 0 3 * * * reindex index idx_fact_sales_date;
Effective database schema design is the foundation of data system performance. Key principles: