Author: C.J. Date
Audience: SQL Developers
Reviewer: Joe Celko
This book on “Normal Forms and All That Jazz” has more good stuff in fewer pages than anything else you will find in the bookstore. This highly informative review is also a good read.
You ought to know who Chris Date is. Return with me now to those thrilling days of yesteryear (the late 1970's, early 1980's) when RDBMS has just been created by Dr. E. F. Codd. The problem was that Dr. Codd was a mathematician whose earlier work was with self-reproducing cellular automata. He wrote and thought like a mathematician, not a programmer. His notation was abstract and mathematical. He used standard set operators for Union, Intersect, Set Difference, membership and so forth. Projections (SELECT in SQL) was shown with a letter pi (π) with subscript parameters, the selection (FROM in SQL) was shown with a letter sigma (σ) with subscript parameters and he invented the butterfly or bow ties for joins. In short, nobody could read it unless they were a math major. We did a lot of work with this notation and if you like curling up with a glass of sherry and warm calculus book, the best mathematical book on RDBMS is still Theory of Relational Databases by David Maier, 1983, see side panel. ISBN: 978-0914894421).
But the real problem was not that the early papers were academic. When the first SQL products came out, RDBMS was like pre-teen sex. Everyone claimed that they knew what it was and that they were good at it. Yeah. Right. Chris Date and Dr. Codd formed a consultancy to educate the world. Dr. Codd was the brains and the big name; Chris Date was the “Great Explainer” who wrote magazine articles and gave lectures. People could understand Chris Date! His INTRODUCTION TO DATABASES was a standard college textbook in the early days of RDBMS. His collections of columns in DBMS and DATABASE PROGRAMMING & DESIGN should be part of any RDBMS library.
Date has since written a lot of books on databases for many different publishers. But even today, his sample database of Suppliers who provide Parts for Jobs is referred to by the name “SPJ” in the literature. Chris Date does not like SQL and writes his books in a language called Tutorial D. This language is directly based on Relational Algebra and Relational Calculus. You can find more about it at http://www.dcs.warwick.ac.uk/~hugh/TTM/CHAP05.pdf. But you do not need to know this language to read the book; the code used is obvious and can quickly map into SQL when an equivalent SQL code is not given.
If you think you already know enough about Normal Forms and all that jazz, then you are wrong. I was. Date is back to being “The Great Explainer” again. He has a running example thru the whole book that uses simple, small data base schemas. He does not assume you are a math major, but just a working programmer who needs to be grounded in real work to get to the theory.
The chapters come in pairs; the first is an informal look at the topic, the second is more formal explanation. For example: Chapter 4 is “FDS and BCNF (Informal)” then Chapter 5 is “FDs and BCNF (Formal)” and that pattern continues in the rest of the book. When you get bogged down in the formal stuff, go back to the previous section. Each chapter ends with exercises; do not worry, there are answers in the back of the book. Chris has written too many textbooks, and a lot of his exercises are really discussion questions.
Keys are discussed in detail and he demolishes the surrogate key concept. Then he spends all of chapter 8 debunking the myths of denormalization. Yes, people still do it after all this time.
Do you know the two purposes Normalization serves? I think of it as a method (not the only one) to reduce redundancy in a schema. But it can also correct a bad design. These are two distinct problems, but people get them confused. I am now thinking of which normal form I need from the hierarchy before I design a schema instead of a clean up after the first design.
Seeing the current Normal Form Hierarchy made me feel like I had not kept up with my reading. The few SQL programmers that even know what a Normal Form is at all, think that it is “First Normal Form (1NF) is flat files”, they have no idea about Second Normal Form (2NF), think that “Third Normal Form (3NF) is when I declare some column as PRIMARY KEY, and I am done” and they have no idea about other normal forms at all.
Date gives a list of the following nine Normal Forms. Then he shows how something can be in one Normal Form and is by implication in all the lesser Normal Forms, but not in any of the higher Normal Forms.
- 1NF = First Normal Form
- 2NF = Second Normal Form
- 3NF = Third Normal Form
- BCNF = Boyce-Codd Normal Form
- 4NF = Fourth Normal Form
- RFNF = Redundancy Free Normal Form
- SKNF = Superkey Normal Form
- 5NF = Fifth Normal Form
- 6NF = Sixth Normal Form
But this is not all of them! We also have Elementary Key Normal Form (EKNF), Complete Key Normal Form (CKNF) and Domain Key Normal form (DKNF) as well.
You get a simple introduction to Functional Dependencies (FD) and Multi-Valued Dependencies (MVD) and the algebra that goes with them. This how we can safely get rid of redundant tables in a schema and know with mathematical certainty we have not lost data.
There are also a lot of discussion of practical considerations. Let me throw out one example for your to play with, re-written in SQL:
CREATE TABLE Payments
(cust_nbr INTEGER NOT NULL
REFERENCES Customer_Accounts (cust_nbr),
payment_date DATE NOT NULL,
PRIMARY KEY (cust_nbr, payment_date),
payment_amt DECIMAL (10,2) NOT NULL
CHECK (payment_amt > 0.00))
CREATE TABLE Customer_Accounts
(cust_nbr INTEGER NOT NULL PRIMARY KEY,
payment_amt_tot DECIMAL (10,2) NOT NULL
CHECK (payment_amt_tot > 0.00));
The business rule is that account balance agrees with the sum of the payments:CREATE ASSERTION Correct_Balances
FROM (SELECT A.cust_nbr, A.payment_amt_tot, SUM(P.payment_amt) OVER (PARTITION BY P.cust_nbr)
FROM Customer_Accounts AS A, Payments AS P
WHERE P.cust_nbr = A.cust_nbr)
WHERE P_amt_tot <> A.payment_amt_tot));
The Customer_Accounts table is clearly redundant, since it is all derived data. There are four possible ways to handle this.
- Ignore the problem and do not write the constraint. You see this “solution” all too often. And then someone tries to fix it in the application layers of the system.
- Declare the constraint. This happens to be harder in T-SQL because we do not have a CREATE ASSERTION statement. You might try using Triggers, but what happens when someone monkeys with the Customer_Accounts directly without waiting for a trigger to fire in the Payments table?
- Create a VIEW. But the user has to know that he cannot touch the Customer_Accounts
- Create a snapshot. Now the user can touch the Customer_Accounts, but that this is dangerous for data integrity.
So, how would you fix it? And why?
Let me leave you with a more complicated problem. You have Shipments and Invoices that have this DDL and a one-to-one relationship:CREATE TABLE Invoices
(invoice_nbr INTEGER NOT NULL PRIMARY KEY,
shipment_nbr INTEGER NOT NULL UNIQUE -– foreign key
REFERENCES Shipments(shipment_nbr), Invoice_stuff VARCHAR(50) NOT NULL);
CREATE TABLE Shipments
(shipment_nbr INTEGER NOT NULL PRIMARY KEY,
invoice_nbr INTEGER NOT NULL UNIQUE -–
shipment_stuff VARCHAR(50) NOT NULL);
If you diagram this out, you will see a cycle. This is much like the old maxim about “You cannot get a job without experience, but you cannot get experience without a job” dilemma. The obvious work-around is to turn off the DRI constraints and do two insertion statements in a single transaction.
INSERT INTO Invoices (invoice_nbr,
VALUES (10, 100,..), (20, 200,..);
INSERT INTO Shipments (shipment_nbr,
VALUES (100, 20,..), (200, 10,..);
Oops! This puts the wrong shipments and invoices together. What if we declare (invoice_nbr, shipment_nbr) as compound key in both tables? Try it, and you get a many-to-many relationship. Maybe this ought to be one table, but why?
This book will be a slow read, but very useful. And do the exercises! And read the appendices! This has more good stuff in fewer pages than anything else you will find in the bookstore.
Author: James Forshaw
Publisher: No Starch Press
Audience: developers interested in security
Reviewer: Kay Ewbank Despite the rather worrying subtitle of 'a hacker's guide to capture, analysis, and exploitation', this is in fact a useful book.
Author: Russell Jurney
Audience: Data scientists
Reviewer: Kay Ewbank This practical book works through the tools and techniques for modern data analysis of structured and unstructured data, showing how to create data analysis applications using Spark and Kafka.