Database Design & Relational Theory

Author: C.J. Date
Publisher: O'Reilly
Pages: 278
ISBN: 978-1449328016
Audience: SQL Developers
Rating: 5
Reviewer: Joe Celko

This book on “Normal Forms and All That Jazz” has more good stuff in fewer pages than anything else you will find in the bookstore. This highly informative review is also a good read.

You ought to know who Chris Date is. Return with me now to those thrilling days of yesteryear (the late 1970's, early 1980's) when RDBMS has just been created by Dr. E. F. Codd. The problem was that Dr. Codd was a mathematician whose earlier work was with self-reproducing cellular automata. He wrote and thought like a mathematician, not a programmer. His notation was abstract and mathematical. He used standard set operators for Union, Intersect, Set Difference, membership and so forth. Projections (SELECT in SQL) was shown with a letter pi (π) with subscript parameters, the selection (FROM in SQL) was shown with a letter sigma (σ) with subscript parameters and he invented the butterfly or bow ties for joins. In short, nobody could read it unless they were a math major. We did a lot of work with this notation and if you like curling up with a glass of sherry and warm calculus book, the best mathematical book on RDBMS is still Theory of Relational Databases by David Maier, 1983, see side panel. ISBN: 978-0914894421).

But the real problem was not that the early papers were academic. When the first SQL products came out, RDBMS was like pre-teen sex. Everyone claimed that they knew what it was and that they were good at it. Yeah. Right. Chris Date and Dr. Codd formed a consultancy to educate the world. Dr. Codd was the brains and the big name; Chris Date was the “Great Explainer” who wrote magazine articles and gave lectures. People could understand Chris Date! His INTRODUCTION TO DATABASES was a standard college textbook in the early days of RDBMS. His collections of columns in DBMS and DATABASE PROGRAMMING & DESIGN should be part of any RDBMS library.

Date has since written a lot of books on databases for many different publishers. But even today, his sample database of Suppliers who provide Parts for Jobs is referred to by the name “SPJ” in the literature. Chris Date does not like SQL and writes his books in a language called Tutorial D. This language is directly based on Relational Algebra and Relational Calculus. You can find more about it at http://www.dcs.warwick.ac.uk/~hugh/TTM/CHAP05.pdf. But you do not need to know this language to read the book; the code used is obvious and can quickly map into SQL when an equivalent SQL code is not given.

databasedesign

If you think you already know enough about Normal Forms and all that jazz, then you are wrong. I was. Date is back to being “The Great Explainer” again. He has a running example thru the whole book that uses simple, small data base schemas. He does not assume you are a math major, but just a working programmer who needs to be grounded in real work to get to the theory.

The chapters come in pairs; the first is an informal look at the topic, the second is more formal explanation. For example: Chapter 4 is “FDS and BCNF (Informal)” then Chapter 5 is “FDs and BCNF (Formal)” and that pattern continues in the rest of the book. When you get bogged down in the formal stuff, go back to the previous section. Each chapter ends with exercises; do not worry, there are answers in the back of the book. Chris has written too many textbooks, and a lot of his exercises are really discussion questions.

Keys are discussed in detail and he demolishes the surrogate key concept. Then he spends all of chapter 8 debunking the myths of denormalization. Yes, people still do it after all this time.

Do you know the two purposes Normalization serves? I think of it as a method (not the only one) to reduce redundancy in a schema. But it can also correct a bad design. These are two distinct problems, but people get them confused. I am now thinking of which normal form I need from the hierarchy before I design a schema instead of a clean up after the first design.

Seeing the current Normal Form Hierarchy made me feel like I had not kept up with my reading. The few SQL programmers that even know what a Normal Form is at all, think that it is “First Normal Form (1NF) is flat files”, they have no idea about Second Normal Form (2NF), think that “Third Normal Form (3NF) is when I declare some column as PRIMARY KEY, and I am done” and they have no idea about other normal forms at all.

Date gives a list of the following nine Normal Forms. Then he shows how something can be in one Normal Form and is by implication in all the lesser Normal Forms, but not in any of the higher Normal Forms.

1NF = First Normal Form
2NF = Second Normal Form
3NF = Third Normal Form
BCNF = Boyce-Codd Normal Form
4NF = Fourth Normal Form
RFNF = Redundancy Free Normal Form
SKNF = Superkey Normal Form
5NF = Fifth Normal Form
6NF = Sixth Normal Form

But this is not all of them! We also have Elementary Key Normal Form (EKNF), Complete Key Normal Form (CKNF) and Domain Key Normal form (DKNF) as well.

You get a simple introduction to Functional Dependencies (FD) and Multi-Valued Dependencies (MVD) and the algebra that goes with them. This how we can safely get rid of redundant tables in a schema and know with mathematical certainty we have not lost data.

There are also a lot of discussion of practical considerations. Let me throw out one example for your to play with, re-written in SQL:

CREATE TABLE Payments(cust_nbr INTEGER NOT NULL REFERENCES Customer_Accounts (cust_nbr), payment_date DATE NOT NULL, PRIMARY KEY (cust_nbr, payment_date), payment_amt DECIMAL (10,2) NOT NULL CHECK (payment_amt > 0.00))

CREATE TABLE Customer_Accounts(cust_nbr INTEGER NOT NULL PRIMARY KEY, payment_amt_tot DECIMAL (10,2) NOT NULL CHECK (payment_amt_tot > 0.00));

The business rule is that account balance agrees with the sum of the payments:

CREATE ASSERTION Correct_BalancesCHECK(NOT EXISTS(SELECT * FROM (SELECT A.cust_nbr, A.payment_amt_tot, SUM(P.payment_amt) OVER (PARTITION BY P.cust_nbr) AS P_amt_tot FROM Customer_Accounts AS A, Payments AS P WHERE P.cust_nbr = A.cust_nbr) WHERE P_amt_tot <> A.payment_amt_tot));

The Customer_Accounts table is clearly redundant, since it is all derived data. There are four possible ways to handle this.

Ignore the problem and do not write the constraint. You see this “solution” all too often. And then someone tries to fix it in the application layers of the system.
Declare the constraint. This happens to be harder in T-SQL because we do not have a CREATE ASSERTION statement. You might try using Triggers, but what happens when someone monkeys with the Customer_Accounts directly without waiting for a trigger to fire in the Payments table?
Create a VIEW. But the user has to know that he cannot touch the Customer_Accounts
Create a snapshot. Now the user can touch the Customer_Accounts, but that this is dangerous for data integrity.

So, how would you fix it? And why?

Let me leave you with a more complicated problem. You have Shipments and Invoices that have this DDL and a one-to-one relationship:

CREATE TABLE Invoices(invoice_nbr INTEGER NOT NULL PRIMARY KEY, shipment_nbr INTEGER NOT NULL UNIQUE -– foreign key REFERENCES Shipments(shipment_nbr), Invoice_stuff VARCHAR(50) NOT NULL);CREATE TABLE Shipments(shipment_nbr INTEGER NOT NULL PRIMARY KEY, invoice_nbr INTEGER NOT NULL UNIQUE -– foreign key REFERENCES Invoices(invoice_nbr), shipment_stuff VARCHAR(50) NOT NULL);

If you diagram this out, you will see a cycle. This is much like the old maxim about “You cannot get a job without experience, but you cannot get experience without a job” dilemma. The obvious work-around is to turn off the DRI constraints and do two insertion statements in a single transaction.

BEGIN ATOMICINSERT INTO Invoices (invoice_nbr, shipment_nbr, ..)VALUES (10, 100,..), (20, 200,..);INSERT INTO Shipments (shipment_nbr, invoice_nbr, ..)VALUES (100, 20,..), (200, 10,..);END;

Oops! This puts the wrong shipments and invoices together. What if we declare (invoice_nbr, shipment_nbr) as compound key in both tables? Try it, and you get a many-to-many relationship. Maybe this ought to be one table, but why?

This book will be a slow read, but very useful. And do the exercises! And read the appendices! This has more good stuff in fewer pages than anything else you will find in the bookstore.

Driving Value With Sprint Goals

Author: Maarten Dalmijn
Publisher: Addison-Wesley
Pages: 256
ISBN: 9780137381920
Print: 0137381921
Kindle:B0C7ZJR7N2
Audience: Scrum developers
Rating: 5
Reviewer: Kay Ewbank

Over the years I've read a lot of books about agile development and Scrum, and most concentrate on the methodology rather tha [ ... ]

+ Full Review

Machines Like Me

Author: Ian McEwan
Publisher: Vintage, 2019
Pages: 304
ISBN: 978-1529111255
Print: 1529111250
Kindle: B07HR6SGQ9
Audience: General
Rating: 4.5
Reviewer: Mike James
A novel about a synthetic human has become so much more relevant recently and guess what - it features Alan Turing.

+ Full Review

More Reviews

<ASIN:0914894420>

<ASIN:0321197844>

Last Updated ( Tuesday, 11 December 2012 )