Normalization is the process of organizing columns and tables in a relational database to reduce redundancy and eliminate undesirable characteristics such as insertion, update, and deletion anomalies. It was introduced by Edgar F. Codd in a series of papers from 1970 onward, starting with the seminal "A Relational Model of Data for Large Shared Data Banks." Codd defined the first three normal forms; later researchers (Raymond Boyce, Ronald Fagin, and others) added stricter forms. Normalization is primarily applied to OLTP systems where data integrity and write performance matter. Analytical systems often deliberately denormalize for read speed.
At its core, normalization is a set of guidelines — actually, a set of mathematical definitions — that tell you how to decompose tables so that each fact is stored in exactly one place. When a fact appears in only one row, it cannot become inconsistent with itself. That is the whole motivation.
Codd's relational model gave the database community three things at once:
Normalization matters because without it, the same fact is stored in many places. When facts diverge, queries return contradictory answers depending on where they look. Normalization is not about saving bytes — modern disks are cheap. It is about preventing contradictions and making correctness a structural property of the schema rather than a discipline the application layer has to enforce.
The normal forms form a hierarchy. Each higher form includes the constraints of those below it:
6NF (temporal / irreducible)
|
5NF (project-join)
|
4NF (multi-valued dependency)
|
BCNF (every determinant is a key)
|
3NF (no transitive dependency)
|
2NF (no partial dependency)
|
1NF (atomic values)
|
Unnormalized Form (UNF)
DKNF (Domain-Key Normal Form) sits to the side — it is a theoretical ideal rather than a step in the same chain.
Before discussing the normal forms themselves, look at the problems they exist to solve. Consider a single denormalized table tracking employees, their departments, and the projects they work on:
employee_projects
+--------+----------+---------+-----------+-----------+--------------+----------+
| emp_id | emp_name | dept_id | dept_name | dept_head | project_code | hours |
+--------+----------+---------+-----------+-----------+--------------+----------+
| 101 | Alice | D10 | Research | Dr. Smith | P-ALPHA | 20 |
| 101 | Alice | D10 | Research | Dr. Smith | P-BETA | 15 |
| 102 | Bob | D10 | Research | Dr. Smith | P-ALPHA | 35 |
| 103 | Carol | D20 | Sales | Ms. Jones | P-GAMMA | 40 |
| 104 | Dave | D20 | Sales | Ms. Jones | P-GAMMA | 30 |
+--------+----------+---------+-----------+-----------+--------------+----------+
Three distinct pathologies appear in this single table.
Suppose a new department D30 Marketing is formed but no one has been hired into it yet. There is no way to record the department's existence: the primary key must include emp_id and project_code, and both would be NULL. You cannot add a fact about a department without an employee.
Alice's department head changes from Dr. Smith to Dr. Patel. Every row where dept_id = D10 must be updated in lockstep. If the UPDATE statement misses even one row (a bug, an abort, a concurrent writer), the table now asserts two different department heads for D10 simultaneously. The same fact stored in many rows can drift apart.
Carol leaves the company, and her only row is deleted. If Dave also leaves at the same time, all knowledge of department D20 and its head Ms. Jones is gone. Deleting an employee should not erase facts about a department.
Normalization solves these by splitting this one table into three:
employees departments assignments
+--------+-------+ +---------+----------+ +--------+-----------+-------+
| emp_id | name | | dept_id | name | | emp_id | project | hours |
+--------+-------+ +---------+----------+ +--------+-----------+-------+
| 101 | Alice | | D10 | Research | | 101 | P-ALPHA | 20 |
| 102 | Bob | | D20 | Sales | | 101 | P-BETA | 15 |
| 103 | Carol | | D30 | Marketing| | 102 | P-ALPHA | 35 |
| 104 | Dave | +---------+----------+ | 103 | P-GAMMA | 40 |
+--------+-------+ | 104 | P-GAMMA | 30 |
+--------+-----------+-------+
Now the Marketing department exists with no employees. The department head is stored once, not four times. Deleting Carol erases nothing about D20.
Normalization theory is built on the concept of a functional dependency (FD). An FD is a constraint between two sets of attributes in a relation.
Notation: X → Y reads as "X functionally determines Y" or "Y is functionally dependent on X." It means: whenever two tuples agree on the values of attributes in X, they must also agree on the values in Y.
For example, in an employees table, emp_id → emp_name says: if two rows have the same employee id, they must have the same employee name. That is a statement about the real world, not about the current data — it holds for every possible instance of the table.
An FD X → Y is trivial if Y is a subset of X. For example, {emp_id, name} → emp_id is trivial — of course two rows that agree on id and name agree on id. Trivial FDs hold in every relation and carry no design information.
A non-trivial FD is one where Y is not a subset of X. These are the FDs that drive normalization.
William W. Armstrong (1974) gave a sound and complete set of inference rules for deriving all FDs that follow from a given set. The axioms are:
From these three, additional rules can be derived:
These rules together are sound (they derive only FDs that hold) and complete (they derive every FD that logically follows from the given set).
Given a set of FDs F and an attribute set X, the closure X+ is the set of all attributes that X functionally determines. Computing X+ is the workhorse algorithm of normalization theory — it answers the questions "is X a superkey?" and "does X → Y follow from F?".
Algorithm (informal):
closure(X, F):
result ← X
repeat:
for each FD A → B in F:
if A ⊆ result:
result ← result ∪ B
until result stops growing
return result
A superkey is any attribute set whose closure contains every attribute. A candidate key is a minimal superkey — no proper subset is itself a superkey. The primary key is a candidate key chosen by the designer.
Example. Consider R(A, B, C, D) with FDs:
F = { A → B, B → C, CD → A }
Compute candidate keys by finding minimal X with X⁺ = {A, B, C, D}.
{A}⁺ = {A, B, C} not a key (missing D)
{D}⁺ = {D} not a key
{A,D}⁺ = {A, D, B, C} key
{B,D}⁺ = {B, D, C, A} key (using B→C, CD→A)
{C,D}⁺ = {C, D, A, B} key
Candidate keys: {A,D}, {B,D}, {C,D}.
Prime attributes: A, B, C, D.
A compact way to visualize FDs in a single relation:
orders_denormalized(order_id, product_id, customer_id, customer_name, product_name, qty)
order_id ────► customer_id
│
▼
customer_name
product_id ────► product_name
{order_id, product_id} ────► qty (the PK)
Arrows originating from non-key attributes or from proper subsets of the composite key are red flags — they point at 2NF or 3NF violations.
Rule: every attribute value is atomic; there are no repeating groups, arrays, or nested structures. Every row has the same number of columns, and every cell holds one value from the attribute's domain.
contacts (violates 1NF)
+--------+---------+---------------------------------+
| cid | name | phone_numbers |
+--------+---------+---------------------------------+
| 1 | Alice | 555-1111, 555-2222 |
| 2 | Bob | 555-3333 |
| 3 | Carol | 555-4444, 555-5555, 555-6666 |
+--------+---------+---------------------------------+
The phone_numbers cell holds a list. Querying "who has phone 555-2222?" forces a substring search. Updating a single number is awkward.
CREATE TABLE contacts (
cid INT PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
CREATE TABLE contact_phones (
cid INT NOT NULL REFERENCES contacts(cid) ON DELETE CASCADE,
phone_number VARCHAR(20) NOT NULL,
phone_type VARCHAR(10),
PRIMARY KEY (cid, phone_number)
);
contacts contact_phones
+-----+-------+ +-----+--------------+------------+
| cid | name | | cid | phone_number | phone_type |
+-----+-------+ +-----+--------------+------------+
| 1 | Alice | | 1 | 555-1111 | mobile |
| 2 | Bob | | 1 | 555-2222 | home |
| 3 | Carol | | 2 | 555-3333 | mobile |
+-----+-------+ | 3 | 555-4444 | mobile |
| 3 | 555-5555 | home |
| 3 | 555-6666 | work |
+-----+--------------+------------+
Modern databases (PostgreSQL, MySQL, SQL Server) offer JSON and JSONB column types. Strictly speaking, a JSON column can hold a nested array, which looks like a 1NF violation. Practitioners are split:
GIN in PostgreSQL, path expressions in SQL Server), so the database treats it as first-class data, not an opaque blob.A useful rule of thumb: if you will filter, join, or aggregate on the nested values, they belong in their own table. If the JSON is a self-contained payload (webhook body, event envelope, extension metadata), a JSON column is fine.
Rule: the relation is in 1NF, and every non-prime attribute is fully functionally dependent on every candidate key — no partial dependency on a proper subset of a composite key.
2NF is only a concern when a table has a composite primary key. If the PK is a single column, the table is automatically in 2NF (there are no proper non-empty subsets of a single attribute).
order_items (1NF but not 2NF)
PK = (order_id, product_id)
+----------+------------+---------------+---------------+----------+
| order_id | product_id | product_name | product_price | quantity |
+----------+------------+---------------+---------------+----------+
| 1001 | P-100 | USB-C Cable | 12.99 | 2 |
| 1001 | P-200 | HDMI Adapter | 19.99 | 1 |
| 1002 | P-100 | USB-C Cable | 12.99 | 3 |
| 1003 | P-200 | HDMI Adapter | 19.99 | 1 |
+----------+------------+---------------+---------------+----------+
FDs:
{order_id, product_id} → quantity
product_id → product_name
product_id → product_price
The last two FDs say product attributes depend on only PART of the PK.
That is a partial dependency — 2NF violation.
Consequences: the product name and price repeat on every order_item. Raising a product's price requires updating every order_items row. A product that has never been ordered cannot be stored here at all.
CREATE TABLE products (
product_id VARCHAR(10) PRIMARY KEY,
product_name VARCHAR(100) NOT NULL,
product_price DECIMAL(10, 2) NOT NULL CHECK (product_price >= 0)
);
CREATE TABLE order_items (
order_id INT NOT NULL,
product_id VARCHAR(10) NOT NULL REFERENCES products(product_id),
quantity INT NOT NULL CHECK (quantity > 0),
PRIMARY KEY (order_id, product_id)
);
products order_items
+------------+---------------+---------+ +----------+------------+----------+
| product_id | product_name | price | | order_id | product_id | quantity |
+------------+---------------+---------+ +----------+------------+----------+
| P-100 | USB-C Cable | 12.99 | | 1001 | P-100 | 2 |
| P-200 | HDMI Adapter | 19.99 | | 1001 | P-200 | 1 |
+------------+---------------+---------+ | 1002 | P-100 | 3 |
| 1003 | P-200 | 1 |
+----------+------------+----------+
Price is now stored once per product. A new product can exist without any order. Raising a price is one UPDATE.
Rule: the relation is in 2NF, and no non-prime attribute is transitively dependent on a candidate key. Equivalently (Zaniolo's definition): for every non-trivial FD X → A, either X is a superkey or A is a prime attribute.
A transitive dependency is when K → X and X → A where X is not a superkey and A is not prime. The non-key attribute A depends on the key only "through" another non-key attribute.
The classic rhyme, attributed to the database-design folklore of the 1980s:
"The key, the whole key, and nothing but the key — so help me Codd."
1NF enforces the key (atomic values keyed by the PK). 2NF enforces the whole key (no partial dependencies). 3NF enforces nothing but the key (no transitive dependencies).
employees (2NF but not 3NF)
PK = emp_id
+--------+----------+---------+-----------+---------------+
| emp_id | name | dept_id | dept_name | dept_location |
+--------+----------+---------+-----------+---------------+
| 101 | Alice | D10 | Research | Building A |
| 102 | Bob | D10 | Research | Building A |
| 103 | Carol | D20 | Sales | Building C |
| 104 | Dave | D20 | Sales | Building C |
+--------+----------+---------+-----------+---------------+
FDs:
emp_id → name, dept_id
dept_id → dept_name, dept_location
dept_name depends on emp_id only via dept_id — transitive dependency.
CREATE TABLE departments (
dept_id VARCHAR(10) PRIMARY KEY,
dept_name VARCHAR(100) NOT NULL,
dept_location VARCHAR(100)
);
CREATE TABLE employees (
emp_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
dept_id VARCHAR(10) REFERENCES departments(dept_id)
);
departments employees
+---------+-----------+---------------+ +--------+-------+---------+
| dept_id | dept_name | dept_location | | emp_id | name | dept_id |
+---------+-----------+---------------+ +--------+-------+---------+
| D10 | Research | Building A | | 101 | Alice | D10 |
| D20 | Sales | Building C | | 102 | Bob | D10 |
+---------+-----------+---------------+ | 103 | Carol | D20 |
| 104 | Dave | D20 |
+--------+-------+---------+
Moving a department to a new building is one UPDATE. Inserting D30 Marketing with no employees yet is a simple INSERT into departments. Deleting the last Sales employee does not erase facts about Sales.
Rule: for every non-trivial FD X → Y, X must be a superkey. BCNF is stricter than 3NF because it does not allow the "A is prime" escape hatch.
Raymond F. Boyce and Edgar F. Codd proposed BCNF in 1974 as a refinement of 3NF. In practice, most well-designed 3NF schemas are already in BCNF. The difference shows up only with overlapping candidate keys.
Suppose a consulting firm tracks which consultant advises which client, and each client has one "primary consultant per specialty":
consulting (in 3NF, not BCNF)
Candidate keys: (client_id, specialty) and (client_id, consultant_id)
+-----------+------------+---------------+
| client_id | specialty | consultant_id |
+-----------+------------+---------------+
| C-1 | Tax | CON-100 |
| C-1 | Audit | CON-200 |
| C-2 | Tax | CON-100 |
| C-2 | IT | CON-300 |
+-----------+------------+---------------+
FDs:
{client_id, specialty} → consultant_id
consultant_id → specialty ←← each consultant has one specialty
Every non-prime attribute? There aren't any — every attribute is prime.
Therefore this satisfies 3NF trivially.
But consultant_id → specialty has a non-superkey determinant.
consultant_id alone is not a superkey (you can't derive client_id).
BCNF violated.
The anomaly: if consultant CON-100's specialty shifts from Tax to Advisory, every row mentioning CON-100 must be updated in lockstep, or the table will claim CON-100 has two specialties.
CREATE TABLE consultants (
consultant_id VARCHAR(10) PRIMARY KEY,
specialty VARCHAR(50) NOT NULL
);
CREATE TABLE client_consultants (
client_id VARCHAR(10) NOT NULL,
consultant_id VARCHAR(10) NOT NULL REFERENCES consultants(consultant_id),
PRIMARY KEY (client_id, consultant_id)
);
The decomposition is lossless (joining the two back reconstructs the original), but it is not dependency-preserving. The FD {client_id, specialty} → consultant_id can no longer be enforced by a single-table key constraint — you would need an application-level or trigger-based check. This trade-off is the reason textbooks teach "decompose to 3NF if dependency preservation matters, to BCNF if anomaly-freedom matters."
Ronald Fagin introduced 4NF in 1977 to handle multi-valued dependencies (MVDs), which FDs cannot describe.
Notation: X ↠ Y (a double arrow) reads "X multidetermines Y." An MVD says: the set of Y-values associated with a given X-value is independent of any other attributes in the relation.
Rule: a relation is in 4NF if it is in BCNF and every non-trivial MVD X ↠ Y has X as a superkey.
An employee can know several programming languages and several spoken languages, and the two sets are independent of each other. Stuffing them into one table creates a spurious cross-product:
employee_skills (BCNF, not 4NF)
PK = (emp_id, prog_lang, spoken_lang) -- the whole row
+--------+-----------+-------------+
| emp_id | prog_lang | spoken_lang |
+--------+-----------+-------------+
| 101 | Python | English |
| 101 | Python | Spanish |
| 101 | Go | English |
| 101 | Go | Spanish |
+--------+-----------+-------------+
MVDs:
emp_id ↠ prog_lang
emp_id ↠ spoken_lang
Adding a third spoken language forces 2 new rows (one per prog_lang).
Adding a third prog_lang forces 2 new rows (one per spoken_lang).
CREATE TABLE employee_prog_langs (
emp_id INT NOT NULL,
prog_lang VARCHAR(30) NOT NULL,
PRIMARY KEY (emp_id, prog_lang)
);
CREATE TABLE employee_spoken_langs (
emp_id INT NOT NULL,
spoken_lang VARCHAR(30) NOT NULL,
PRIMARY KEY (emp_id, spoken_lang)
);
employee_prog_langs employee_spoken_langs
+--------+-----------+ +--------+-------------+
| emp_id | prog_lang | | emp_id | spoken_lang |
+--------+-----------+ +--------+-------------+
| 101 | Python | | 101 | English |
| 101 | Go | | 101 | Spanish |
+--------+-----------+ +--------+-------------+
Now adding a third spoken language is one row. The Cartesian explosion is gone.
Rule: a relation is in 5NF if every non-trivial join dependency is implied by the candidate keys. Informally: the table cannot be losslessly decomposed into smaller projections except via its keys.
5NF eliminates certain three-way (or higher) relationships that cannot be captured by binary FDs or MVDs. In practice, 5NF violations are rare and often contrived, but they do exist.
Suppose agents sell products from companies, with the business rule: if agent A represents company C, and company C makes product P, and agent A sells product P (to any company), then agent A must sell product P for company C as well. This is a three-way symmetric constraint.
agent_company_product (not 5NF)
+-------+---------+---------+
| agent | company | product |
+-------+---------+---------+
| A1 | Acme | Widget |
| A1 | Acme | Gadget |
| A1 | Beta | Widget |
| A2 | Acme | Gadget |
+-------+---------+---------+
No FD or MVD violation. But under the three-way rule, the table can be
losslessly reconstructed from three binary projections:
agent_company(agent, company)
company_product(company, product)
agent_product(agent, product)
That reconstruction-via-three-projections is a JOIN DEPENDENCY.
It is not implied by any key. Hence not 5NF.
Decomposition: store the three binary tables separately. The original is recovered by the natural three-way join.
In real-world design, most developers never encounter a 5NF violation. It is worth knowing the form exists mostly to reason about data-vault-style EAV designs and certain many-to-many-to-many junction tables.
Ronald Fagin introduced DKNF in 1981. Rule: every constraint on the relation is a logical consequence of the domain constraints (valid values for each attribute) and the key constraints.
If a schema is in DKNF, no insert, update, or delete anomaly can occur by definition. In principle, achieving DKNF guarantees correctness at the schema level.
In practice, DKNF is mostly theoretical:
DKNF is worth studying for the insight it gives into why the lower forms exist — every lower form is a step toward (but not reaching) this ideal.
6NF was introduced by C.J. Date, Hugh Darwen, and Nikos Lorentzos in the 2000s, primarily for temporal databases. Rule: a relation satisfies 6NF if it admits no non-trivial join dependencies at all — equivalently, it cannot be decomposed further without losing information.
In practice, 6NF means each table has a key plus at most one non-key attribute. Every fact is stored in its own tiny table.
Anchor Modeling / 6NF style:
employees emp_name emp_salary
+--------+ +--------+-------+ +--------+-------+----------+
| emp_id | | emp_id | name | | emp_id | amt | valid_to |
+--------+ +--------+-------+ +--------+-------+----------+
| 101 | | 101 | Alice | | 101 | 80000 |2024-06-30|
| 102 | | 102 | Bob | | 101 | 90000 |9999-12-31|
+--------+ +--------+-------+ | 102 | 75000 |9999-12-31|
+--------+-------+----------+
Usage:
Trade-off: extreme decomposition produces huge join counts. A simple "show this employee" query may need 15 joins. Columnar engines (Snowflake, BigQuery, Databricks) handle this better than row stores, which is why 6NF is associated with modern data warehousing rather than traditional OLTP.
A normalization exercise typically proceeds through these steps:
Normalization is a rule set; denormalization is a deliberate exception. The question is not "normalized or denormalized?" but "which parts of the schema benefit from which approach?"
cust_name, prod_category) avoids joins on every query.order_total stored alongside order_items rather than recomputed per query.Every denormalized copy of a fact is a potential inconsistency waiting to happen. The cost is paid in:
Materialized views and columnar caches give the query-speed benefit of denormalization while keeping the system of record normalized. The warehouse pattern is:
OLTP (3NF/BCNF) ───► ELT ───► Warehouse (Star / Snowflake) ───► BI
system of record pipeline query-optimized dashboards
Two schools of thought govern enterprise warehouse design:
Ralph Kimball's The Data Warehouse Toolkit advocates dimensional modeling from day one. The warehouse is a collection of conformed star schemas, each serving a business process. Normalization is relegated to the source systems; the warehouse is denormalized on purpose.
Bill Inmon's Corporate Information Factory advocates a single, enterprise-wide 3NF warehouse as the integrated source of truth, with downstream dependent data marts in star form for specific analytical use cases.
Most real architectures combine both:
Start with a single denormalized "sales report" table — the kind you might export from a spreadsheet.
sales_flat (0NF / UNF)
+--------+------------+--------------+--------------------+---------+----------------+--------+
| sale_id| sale_date | customer | customer_email | region | items | total |
+--------+------------+--------------+--------------------+---------+----------------+--------+
| 1 | 2026-01-15 | Alice Chen | alice@example.com | West | P1:2, P2:1 | 45.97 |
| 2 | 2026-01-16 | Bob Ortiz | bob@example.com | East | P1:1 | 12.99 |
| 3 | 2026-01-17 | Alice Chen | alice@example.com | West | P3:3, P1:1 | 57.96 |
+--------+------------+--------------+--------------------+---------+----------------+--------+
The items column is a repeating group — violates 1NF.
CREATE TABLE sales_1nf (
sale_id INT NOT NULL,
sale_date DATE NOT NULL,
customer VARCHAR(100) NOT NULL,
customer_email VARCHAR(100),
region VARCHAR(50),
product_id VARCHAR(10) NOT NULL,
product_name VARCHAR(100) NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
quantity INT NOT NULL,
line_total DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (sale_id, product_id)
);
+---------+------------+------------+-------------------+--------+------------+--------------+-------+-----+----------+
| sale_id | sale_date | customer | customer_email | region | product_id | product_name | price | qty | line_tot |
+---------+------------+------------+-------------------+--------+------------+--------------+-------+-----+----------+
| 1 | 2026-01-15 | Alice Chen | alice@example.com | West | P1 | Cable | 12.99 | 2 | 25.98 |
| 1 | 2026-01-15 | Alice Chen | alice@example.com | West | P2 | Adapter | 19.99 | 1 | 19.99 |
| 2 | 2026-01-16 | Bob Ortiz | bob@example.com | East | P1 | Cable | 12.99 | 1 | 12.99 |
| 3 | 2026-01-17 | Alice Chen | alice@example.com | West | P3 | Hub | 14.99 | 3 | 44.97 |
| 3 | 2026-01-17 | Alice Chen | alice@example.com | West | P1 | Cable | 12.99 | 1 | 12.99 |
+---------+------------+------------+-------------------+--------+------------+--------------+-------+-----+----------+
Now atomic, but still a mess:
FDs:
sale_id → sale_date, customer, customer_email, region
product_id → product_name, unit_price
{sale_id, product_id} → quantity, line_total
The PK is (sale_id, product_id). Attributes that depend only on sale_id go to a sales table; attributes that depend only on product_id go to products.
CREATE TABLE sales_2nf (
sale_id INT PRIMARY KEY,
sale_date DATE NOT NULL,
customer VARCHAR(100) NOT NULL,
customer_email VARCHAR(100),
region VARCHAR(50)
);
CREATE TABLE products_2nf (
product_id VARCHAR(10) PRIMARY KEY,
product_name VARCHAR(100) NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL
);
CREATE TABLE sale_items_2nf (
sale_id INT NOT NULL REFERENCES sales_2nf(sale_id),
product_id VARCHAR(10) NOT NULL REFERENCES products_2nf(product_id),
quantity INT NOT NULL,
line_total DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (sale_id, product_id)
);
sales_2nf
+---------+------------+------------+-------------------+--------+
| sale_id | sale_date | customer | customer_email | region |
+---------+------------+------------+-------------------+--------+
| 1 | 2026-01-15 | Alice Chen | alice@example.com | West |
| 2 | 2026-01-16 | Bob Ortiz | bob@example.com | East |
| 3 | 2026-01-17 | Alice Chen | alice@example.com | West |
+---------+------------+------------+-------------------+--------+
products_2nf
+------------+--------------+------------+
| product_id | product_name | unit_price |
+------------+--------------+------------+
| P1 | Cable | 12.99 |
| P2 | Adapter | 19.99 |
| P3 | Hub | 14.99 |
+------------+--------------+------------+
sale_items_2nf
+---------+------------+-----+-----------+
| sale_id | product_id | qty | line_tot |
+---------+------------+-----+-----------+
| 1 | P1 | 2 | 25.98 |
| 1 | P2 | 1 | 19.99 |
| 2 | P1 | 1 | 12.99 |
| 3 | P3 | 3 | 44.97 |
| 3 | P1 | 1 | 12.99 |
+---------+------------+-----+-----------+
In sales_2nf, customer_email actually depends on the customer identity, not on sale_id directly. If Alice appears on three sales rows, her email is stored three times. Same for the fact that Alice is in the West region. Promote customers to their own table.
Similarly, line_total is a derived attribute — it equals quantity * unit_price. Strict 3NF design drops derived attributes (or makes them GENERATED columns).
CREATE TABLE customers_3nf (
customer_id SERIAL PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(100) UNIQUE,
region VARCHAR(50)
);
CREATE TABLE products_3nf (
product_id VARCHAR(10) PRIMARY KEY,
product_name VARCHAR(100) NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL
);
CREATE TABLE sales_3nf (
sale_id SERIAL PRIMARY KEY,
sale_date DATE NOT NULL,
customer_id INT NOT NULL REFERENCES customers_3nf(customer_id)
);
CREATE TABLE sale_items_3nf (
sale_id INT NOT NULL REFERENCES sales_3nf(sale_id) ON DELETE CASCADE,
product_id VARCHAR(10) NOT NULL REFERENCES products_3nf(product_id),
quantity INT NOT NULL CHECK (quantity > 0),
-- line_total is derived; either drop it, or make it a generated column:
line_total DECIMAL(10, 2) GENERATED ALWAYS AS (quantity * 0.00) STORED,
PRIMARY KEY (sale_id, product_id)
);
customers_3nf
+----+------------+-------------------+--------+
| id | name | email | region |
+----+------------+-------------------+--------+
| 1 | Alice Chen | alice@example.com | West |
| 2 | Bob Ortiz | bob@example.com | East |
+----+------------+-------------------+--------+
sales_3nf
+---------+------------+-------------+
| sale_id | sale_date | customer_id |
+---------+------------+-------------+
| 1 | 2026-01-15 | 1 |
| 2 | 2026-01-16 | 2 |
| 3 | 2026-01-17 | 1 |
+---------+------------+-------------+
Now Alice's email and region are stored once. A customer moves from
West to East: one UPDATE. A product price changes: one UPDATE.
Is every determinant a superkey?
customer_id → name, email, region — customer_id is the PK of customers_3nf. OK.product_id → name, unit_price — product_id is the PK of products_3nf. OK.sale_id → date, customer_id — sale_id is the PK of sales_3nf. OK.{sale_id, product_id} → quantity — that's the PK of sale_items_3nf. OK.Every non-trivial FD has a superkey determinant. The schema is in BCNF.
For the BI dashboard that runs "sales by region by month," joining four tables on every query is wasteful. Build a gold-layer star schema on top:
CREATE TABLE dim_customer (
cust_key SERIAL PRIMARY KEY,
customer_id INT NOT NULL,
customer_name VARCHAR(100) NOT NULL,
region VARCHAR(50)
);
CREATE TABLE dim_product (
prod_key SERIAL PRIMARY KEY,
product_id VARCHAR(10) NOT NULL,
product_name VARCHAR(100) NOT NULL
);
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD
full_date DATE NOT NULL,
year INT, month INT, quarter INT
);
CREATE TABLE fact_sales (
sale_id BIGSERIAL PRIMARY KEY,
cust_key INT NOT NULL REFERENCES dim_customer(cust_key),
prod_key INT NOT NULL REFERENCES dim_product(prod_key),
date_key INT NOT NULL REFERENCES dim_date(date_key),
quantity INT NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
revenue DECIMAL(12, 2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);
OLTP stays in BCNF. OLAP stays in star. ELT pipes data from one to the other on a schedule. Each world is normalized (or denormalized) for its own access pattern.
No tool replaces human reasoning about functional dependencies — FDs are business rules, and business rules come from people. But several tools help validate and maintain a normalized schema:
EXPLAIN (ANALYZE, BUFFERS)). Reveal whether denormalization is paying off or whether a well-placed index makes the normalized form fast enough.Every join has a cost. A transactional workload that decomposes its schema into 20 tables when 6 would do is paying for joins on every user request. If a 3NF decomposition produces tables that are always joined together in every query, consider whether the decomposition is serving a real business constraint or an academic one.
The opposite mistake: assuming "it's OLAP, it's denormalized, anything goes." Even star schemas have rules. Conformed dimensions must be consistent across fact tables. A customer attribute duplicated across three fact tables that drift apart is the same update anomaly that 3NF was designed to prevent. Dimensional modeling is a different kind of normalization, not a lack of it.
1NF says values are atomic from the perspective of the data model. A JSON column holding a self-contained document that the database treats as a single opaque value is defensible. A JSON column that applications routinely pick apart to filter and join on is a repeating group by another name — move those fields to proper columns or child tables.
An FD is a statement about every possible instance of the relation, not just the rows you have today. Deriving FDs by inspecting current data will miss real constraints that sample data happens to satisfy by accident, and will infer fake constraints that the business never promised. FDs come from domain conversation, not from SELECT COUNT(DISTINCT ...).
The normal forms are diagnostic tools, not a certification. A schema that passes every form on paper but is painful to query or evolve is not a success. The goal is a schema that accurately models the business and supports the workload. Normalization is a method, not the destination.
When decomposing to BCNF, it is possible to end up with a schema where some original FD cannot be enforced by any single-table key. The FD must then be enforced by triggers or application code, which is error-prone. If dependency preservation matters more than strict anomaly-freedom, stop at 3NF.
Normalization is Codd's prescription for storing each fact in exactly one place. Functional dependencies express which attributes determine which others; normal forms translate those dependencies into progressively stricter structural rules. The practical workflow is:
The normalized OLTP store and the denormalized OLAP mart are not competitors — they are two sides of the same well-designed data architecture.