Page 1 of 4
Classic data structures produce classic tutorials. In this edition of Babbage's Bag we investigate the advanced ecology of trees - perfectly balanced trees, AVL trees and B-Trees.
Trees and indexes
The tree is one of the most powerful of the advanced data structures and it often pops up in even more advanced subjects such as AI and compiler design. Surprisingly though the tree is important in a much more basic application - namely the keeping of an efficient index.
Whenever you use a database there is a 99% chance that an index is involved somewhere. The simplest type of index is a sorted listing of the key field. This provides a fast lookup because you can use a binary search to locate any item without having to look at each one in turn.
The trouble with a simple ordered list only becomes apparent once you start adding new items and have to keep the list sorted - it can be done reasonably efficiently but it takes some advanced juggling. A more important defect in these days of networking and multi-user systems is related to the file locking properties of such an index. Basically if you want to share a linear index and allow more than one user to update it then you have to lock the entire index during each update. In other words a linear index isn't easy to share and this is where trees come in - I suppose you could say that trees are shareable.
A tree is a data structure consisting of nodes organised as a hierarchy - see Figure 1.
Figure 1: Some tree jargon
There is some obvious jargon that relates to trees and some not so obvious both are summarised in the glossary and selected examples are shown in Figure 1.
I will try to avoid overly academic definitions or descriptions in what follows but if you need a quick definition of any term then look it up in the glossary.
A worthwhile simplification is to consider only binary trees. A binary tree is one in which each node has at most two descendants - a node can have just one but it can't have more than two.
Clearly each node in a binary tree can have a left and/or a right descendant. The importance of a binary tree is that it can create a data structure that mimics a "yes/no" decision making process.
For example, if you construct a binary tree to store numeric values such that each left sub-tree contains larger values and each right sub-tree contains smaller values then it is easy to search the tree for any particular value. The algorithm is simply a tree search equivalent of a binary search:
start at the root
REPEAT until you reach a terminal node
IF value at the node = search value
IF value at node < search value
THEN move to left descendant
ELSE move to right descendant
Of course if the loop terminates because it reaches a terminal node then the search value isn't in the tree, but the fine detail only obscures the basic principles.
The next question is how the shape of the tree affects the efficiency of the search. We all have a tendency to imagine complete binary trees like the one in Figure 2a and in this case it isn't difficult to see that in the worst case a search would have to go down the to the full depth of the tree. If you are happy with maths you will know that if the tree in Figure 2a contains n items then its depth is log2 n and so at best a tree search is as fast as a binary search.
Figure 2a: The "perfect" binary tree .
The worst possible performance is produced by a tree like that in Figure 2b. In this case all of the items are lined up on a single branch making a tree with a depth of n. The worst case search of such a tree would take n compares which is the same as searching an unsorted linear list.
So depending on the shape of the tree search efficiency varies from a binary search of a sorted list to a linear search of an unsorted list. Clearly if it is going to be worth using a tree we have to ensure that it is going to be closer in shape to the tree in Figure 2a than that in 2b.
Figure 2b: This may be an extreme binary tree but it still IS a binary tree