Codd and his Rules |

Written by Mike James | |||

Thursday, 05 October 2017 | |||

Page 1 of 2 Theories of how we should organize databases are thin on the ground. The one exception is the work of E.F. Codd, the originator of the commandment-like “Codd’s Rules”. This approach to database has been codified into SQL - Structured Query Language - and so into most of the databases on the planet, despite what the NoSQL movement might want you to think. So what are Codd's Rules and what is a relational database? Database is the strangest of computer applications. - It is vital - every computer manages some sort of database,
- it is difficult - real world database quickly exceeds any reasonable complexity
and - it is lucrative - more programmers are occupied with database work than any other.
Strangely, though, in the early days theories of how we should organize databases were very thin on the ground. For such an important subject it was been relatively neglected. It is almost as if the subject was so obvious that it didn't need a theory as such so most people simply got on with it and ignored the need for organizing principles. The one exception is perhaps the work of E.F. Codd. Immortalized to a generation or two of database creators, he is revered as the originator of the commandment-like “Codd’s Rules”.
Codd was a mathematician and this is an important fact that explains much about his work. The theory of relational databases is one that has its roots in a very abstract mathematical idea. He was born in the UK and studied math at Oxford before moving to the USA in 1949 to do some work at the University of Tennessee. Before very long he was seduced by the new subject of computing and was working for IBM as a mathematician and programmer. The first machine that he worked on was the SSEC - one of the early valve machines. He moved on to help with the logic design of the IBM 701, one of the first modern computers, and then the 702. At this point (1953) he took a four-year leave and went to study for his PhD in Canada, at the University of Michigan. Codd’s work at Michigan was on a relatively new area, Cellular Automata. A cellular automaton is a collection of relatively simple computing units, all of which run the same program and interact with their neighbors on a rectangular grid. If you know Conway's game of Life then you know at least one cellular automata. This was highly theoretical stuff at the time but it was connected with areas of AI and Artificial Life that are now very important. His thesis was on self reproduction of cellular automata. Later the work was published as a book, “Cellular Automata”, in 1968. Many theoretical computer scientists know this book but often never suspect that the Codd who wrote it is the same one who did the eminently practical database work! ## Database relationshipsHowever Codd’s main work began in 1969 when he became interested in database software. Being a mathematician he saw database in a very different way to the average programmer. A programmer tends to see the physical way that data is coded and stored. He sees records composed of fields of different types and the whole lot is stored as a file with some type of structure. The operations that are performed on a database are seen as a general programming problem. To work with a database you write a program that reads the file and works with it in any way that the user/customer considers necessary. If you need to access two records from different parts of the file at that same time it is up to the programmer to get inside and dig around the file in anyway they like. This is a recipe for complexity to get out of hand! You can well imagine that the database file could be come altered in ways that are incorrect - even to the extent of it not being a valid database file any longer! What Codd did was to see that the simplest example of a database was nothing more than a table of values. The rows of the table are records and the columns fields and the table itself can be implemented as a file. More recently we have data stored in spreadsheets as a simple table. This change in viewpoint is, however, more than just a relabeling of the parts. A table, or set of multiple values in mathematical terms, is known as a “relationship”. For example, suppose I ask you to tabulate the relationship x<y for the integers 1 to 5. What you end up with is a table of pairs - or in general “tuples”: x y Any relationship, not just the nice neat regular ones like =, <>, < etc., can be represented by a set of tuples and this is exactly the way that Codd saw a database table and hence the name of the new approach “relational database”. ## Relational not relatedIt is amusing to note that over the years lots of proponents of Codd’s methods have misunderstood the use of the term “relational”, thinking that it comes from the way separate tables are “related” in some way. If you know some math then the definition
will say it all. If you are not a mathematician then this definition is "dense" to say the least. However like many mathematical definitions it is easy to understand if you take it a step at a time. The Cartesian product, AXB of two sets A and B say is just the set of all possible pairs like (a,b) where a is taken from set A and b is taken from set b. A relation is a simply a subset of this complete set of pairs or tuples. So for example, if A={1,2,3} and B={1,2,3} then AXB is the set {(1,1),(1,2),(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3)} Any subset of this set is a relation. For example: {(1,1),(2,2),(3,3)} is the relation a=b. We may not give names like less than or greater than to all of the subsets but they are no less relations to the mathematician. ## Relation -> Relational DatabaseOf course just calling a database a table or relation isn’t enough to change the way programmers work. What Codd did next is to work out a theory of how traditional database operations could be implemented using an algebra of sets rather than programs. The first thing is that becomes apparent is that, as a relation is a set, you can’t have duplicates. If you have a set A that contains an item then adding the item again doesn't make any difference - if it is already in the set then it is all ready in the set, there are no duplicates, This is already different from the way a programmer thinks of a database where you can quite easily have the same information - a user name say stored multiple times.. However the changes had to go both ways. The idea of a pure relation had to be modified to make it more suitable to describe a database. For example, to make relations more like real databases you need to add the idea of key fields, i.e. a column in the table that identifies each record uniquely - that is a given key value belongs to one and only one record. According to Codd a relational database consists of more than one relation or table and our familiar notion of “relating” tables comes from the idea that one table has a key field that is a field in another table - this is known as a “foreign key”. For example, Name might be a field in an invoice database and a key field in an address database. Records in each table are loosely speaking “connected” by the foreign key. ## Database OperationsAs well as modifying definitions it was also necessary to work out what operations are allowed on database tables. Although relations don't have special operations they are just sets and so database operations tend to follow the way set operations work. As an example of a database operation, consider the union of two tables. A union of two tables is just the table that results if you write out the rows of the second table after the rows of the first. Notice that you can form the union between two tables with different fields. Similarly the intersection between two tables is just the table that consists of the rows that they have in common. It is clear that these definitions are modelled on the same operation for sets but they are not quite the same. You can see that you could carry on like this and define operations that combine one table with another to give you a third. It is this way of thinking that changes the “For first record To last Do xxx” way of processing a database. Now we are thinking the non-procedural A+B=C and this is what the relational model and algebra is all about. <ASIN:0201141922> <ASIN:0596523068> <ASIN:1425122906> |
|||

Last Updated ( Thursday, 05 October 2017 ) |