- a lost art?
Back in the bad old days when languages were primitive and programmers had to punch cards most knew about the academic sounding Storage Mapping Functions or SMFs. In these days of data abstraction and encapsulation SMFs are less well known but just as useful.
A storage mapping function is a function that gives the location in memory where something is stored.
For example, an array of two byte integers can be implemented using the simple SMF:
location = base + 2*index
where base is the starting address of the array.
In general a one-dimensional array composed of elements that each occupy n bytes can be implemented using the SMF:
location = base +n*index
Obviously to find the location of a(i), assuming that you have decided to call the array a, all you have to do is work out base+n*i and then retrieve or store the value associated with a(i) there.
SMFs for one-dimensional arrays are a bit too simple to be interesting but at least they make good examples.
The same principle applies to two-dimensional arrays. A two-dimensional array a(i,j) of n byte elements can be implemented using the SMF:
location = base + n*i + m*n*j
where m is the size of the second dimension, i.e. the array is a(n,m). (It is also assumed that the array indices all start at 0, if not then simply change i to i-1, and j to j-1 etc.)
Key mapping and trees
The two-dimensional array SMF is beginning to look a little more interesting but it still nothing more than an application of the idea of using a function f(key) to locate the data corresponding a key.
In the case of the one-dimensional array the key is simply the index i and the location of a(i) is given by f(i) and in the two-dimensional case the keys are i and j and the location of a(i,j) is given by f(i,j). Higher dimensional arrays can be treated in the same way.
At this point my guess is that many a high-level language programmer will be thinking how could an SMF be at all useful to me?
After all, high-level languages keep programmers well away from real memory locations. The answer is that while most languages provide simple data structures such as arrays, they often don't provide the rarer data structures such as trees. The trick is to use an array as if it was a chunk of old-fashioned linear memory and a suitable SMF to map the new data structure onto the array.
Well let's try a simple binary tree implementation. A binary tree is a tree structure such that each node, apart from the final or terminal nodes, has exactly two children - hence the binary in binary tree. It each node is associated with a value that takes n bytes to store then a suitable SMF is:
location = base + n*i + n*(2^j-1)
which gives the location of the value stored at ith node at the jth level.
To see why this works consider the following small binary tree:
j node i
level 0 0
level 1 0 1
/ \ / \
/ \ / \
level 2 0 1 2 3
and so on.
Notice that going from one level to the next doubles the number of nodes and you should be able to work out that the number of nodes at level j is simply 2^j.
With this insight you should now also be able to see how the SMF works. The tree is stored as lines of nodes one level at a time. That is:
location 0 1 2 3 4 5 6
node 0 | 0 1 | 0 1 2 3
level 0 | 1 | 2
To make use of this SMF in a high-level language all you have to do is declare an array of elements that will hold the node values and use the SMF with n=1 to determine which element node i at level j is stored in.
For example, if the array is a(N) then node i at level j is stored in a(i+2^j-1).
You can see the general idea. If you have a data structure which has a regular pattern then you need to find a function that maps the regular pattern to a linear sequence of storage locations. Basically you need a function which enumerates the elements of the data structure one after another. Inventing a storage mapping function can be tricky but its always possible - unless the data structure can't be enumerated.
A practical application
If binary trees sound too unlikely to gain your interest in SMFs then perhaps it is worth pointing out their use with respect to disk files. For example, suppose you need a 2D array bigger than RAM will allow then as long as you have a fast hard disk available what could be easier than implementing a virtual array using a random access file.
Simply set up a random access file such that each record can hold exactly one value and then when you want to access the value of a(i,j) seek record number i + m*j where mis the size of the second dimension.
You can probably work out for yourself various ways of making the process more efficient - for example by defining a record large enough to hold a complete row of the array etc. The same principles apply to creating any virtual data structure on disk - store each element of the data structure in a record and use an SMF to find which record corresponds to each element.
If you are still unimpressed by the idea of an SMF then perhaps my last example will please you. The whole idea of an SMF can be generalised to include a function that maps some elements into the same storage location.
This may seem like a crazy idea but you might have come across it before under the names "scatter storage" or "hash functions". Whatever you call it, it has to be one of the nicest ideas in the whole of computing. The principle of an SMF is that given one data value, the key, you can find the location of an associated data value using f(key).
All of the SMFs we have looked at so far have been very regular. They all make use of the layout of the data structure but this doesn't have to be. Suppose you can find a function, any old function, f(key) that gives you a location for all possible values of the key and in most but not all cases gives you different locations for different keys - they why not use it?
For example, suppose you want to store words in an array. You could use the SMF given by adding together the ASCII codes of the first two letters minus 128. For example, f(CAT) would be 67+65-128 (ASCII codes of C and A minus 128) or 4. This means that you could store CAT in location 4. In the same way f(DOG) is 68+79-128 = 19 and so DOG would be stored in location 19. The only problem with this scheme is that as only the first two letters are used f(CAR) is the same as f(CAT) and we would attempt to store the two at the same place.
This is called a collision and different scatter storage or hashing schemes deal with the problem in different ways. The easiest thing to do is to check to see if the location given by f(key) has been used and if it has check f(key)+1, f(key)+2 and so on until you find a free location. There are lots of variations on collision management but this linear search is simple and fairly efficient.
If you don't see the point of hashing functions then try the problem of storing and subsequently finding names in an array. Without hash functions you either have to perform a sequential search of the array or sort the array and perform a quadratic (binary) search. The former is inefficient and the latter complex and has the overhead of a complete sort. A hashing function gives you the location of any word in one evaluation and even if there is a collision you should find the word after a short linear search.
There is a lot more to be said about hash functions but the main thing is that you see them as nothing more than slightly odd SMFs.
Finally, I can't resist pointing out that the index to a database is nothing more than an SMF implemented as a lookup table! In other words, if you want to find the record containing the data corresponding to "key" you first look "key" up in the index which gives the record number. The only realistic alternative to using an index is to use a hash function and this is just an SMF implemented without the use of a lookup table.
So it's not a question of whether or not to use an SMF to find records in a database ... just how you implement it! So it looks as if Storage Mapping Functions are alive and well after all. Perhaps the sub-title of this article should have been the "hidden" rather than "lost" art.