Page 2 of 3
There is another old software saying which goes “efficiency is mostly about hardware”.
While it is true that choosing the wrong algorithm can make a task take a time longer than the lifetime of the universe, efficiency isn’t the main aim of software.
Software design is about sophistication and reliability. First it should do the job well and only then should we worry about how long the job takes.
Of course in the real world we do have to worry about how long the job takes but this should be a treated as a separate, orthogonal, design factor.
For example, what do we get if we adopt the seemingly inefficient global/local solution?
The first thing is sophistication at little extra cost. If the document moves the garbage collection condition follows it.
Consider how you would arrange for the database to be updated to reflect the change in a document’s location? How does the database even "know" that a document has been moved unless there is a contract with every other piece of software in the system to keep the database up-to-date. Now this really is horrible and very fragile.
Equally the user can query and change the condition without the need for access to a central database. If the user has access to the file they have access to the garbage collection condition.
What really matters here is that the file is the object that the user regards as the document. For example, if the user deletes the file and creates a new file with the same name – does it have the same garbage collection condition?
Clearly if the condition is stored in the document’s metadata it is automatically destroyed along with the file – if it isn’t then the metadata mechanism is incorrectly implemented. In the same way the metadata automatically follows the document as it is copied, moved and generally manipulated in ways that the user finds easy to understand.
Compare this to the reference to the file stored in a separate database, possibly not even on the same machine, as the file. There is no such “natural” association between the file reference and its garbage collection condition and the reference certainly doesn’t automatically track any changes in the file.
Now consider how much complexity you have to add to the database mechanism to make it track the state of the file?
If you want to reproduce the close relationship between file and metadata you can’t even use a relaxed “fix the reference when it fails” pattern to keep the two in sync. If you store the references in the database and then wait for them to fail, i.e. generate a file not found error when you try to garbage-collect the file, you can’t simply search for the file because you haven’t tracked its history. It might now represent an entirely different document.
As I said earlier this approach it is truely horrible!
At this point, you should be inventing lots of ways of keeping the object and the reference in sync without sacrificing efficiency – but all the ways that you can think of are essentially brittle.
For example, if you use a “file watcher” pattern and ask the OS to pass each file change to your application you can attempt to keep in sync – but think of all the things that can go wrong!
If your “watcher” task fails to load, or interact with the OS correctly you have an almost unrecoverable loss of sync – you cannot reconstruct the lost history of the document if there is any interruption of the update mechanism.
Also consider what happens if there is some sort of glitch in the storage of the data. In the metadata case the problem might affect a document or two but in the case of the database a logical inconsistency could destroy all of the information. The central approach has one very big single point of failure - the database.
In short, the metadata forms a distributed database complete with all the advantages and disadvantages of the same. However it’s a good choice of architecture if you can solve the efficiency problems.
But can you solve it?
The spreadsheet paradigm
Currently it has to be admitted that implementing distributed local action in the sense described above is a real problem.
To discover exactly what sort of challenge is posed by a distributed implementation it is a good idea to conduct a thought experiment where hardware is in abundance.
What sort of hardware would you need to make a distributed architecture really work?
You might be thinking in terms of ultra-fast processors and ultra-fast disk drives. All of these help but there is another way. Imagine that all of the files are represented by people – one person per file. Now imagine that all of the people are gathered together on a football pitch and you simply ask everyone to put up their hand if their file’s “condition” evaluates to true – and then please leave the pitch.
This is an example of the “spreadsheet” paradigm, perhaps the simplest approach to parallel processing ever invented. One processor is allocated to each item of data and each processor has a, usually simple, set of instructions to obey.
It is also an object-oriented approach in the sense that data and process are tied together. In an ideal computing environment all data would be encapsulated within an object that provided its first level of processing.
In and even more ideal world every object would not only have its own thread of execution but its own processor taking care of that thread. There are some cases where this is indeed a possible architecture.
Of course in practice the spreadsheet usually isn’t implemented as a single processor per data item, instead whatever processing power is available is used to simulate the parallelism. This said, it is worth mentioning that machines have been built that implement the spreadsheet approach to parallel processing with thousands of processors.
But back to the real world - in practice the importance of the spreadsheet paradigm is that it provides us with good ways of thinking about distributed processing. The point is that each of the files “knows” if its time is up but we have to use serial methods to simulate the ideal parallel implementation.
In this case things are simple enough not to pose any real problems.
The best solution is to use a lazy approach to garbage collection and run a file scanner in processor-idle time. This is just a simulated implementation of the spreadsheet algorithm visiting each data cell in turn and computing the result – but in the background. Of course as you allocate more and more threads to the process the simulation becomes increasingly parallel.
The result is not a sophisticated solution but it’s a fairly standard one for distributed systems where the outcome isn’t time critical. In this case it doesn’t matter exactly when a document is garbage collected as long as it happens sometime.
The same paradigm applies to web crawlers, disk defragmenters, indexing software and so on. Only when the outcome is time critical or the results could depend on the outcome of multiple data items - such as a seat booking database or solving a mathematical problem - do we have to confront the challenge of real-time distributed systems and true parallel processing.
What can you do when a lazy implementation isn't good enough?
In this case you can still use the spreadsheet paradigm but we now need to implement a central database that is in charge not of what happens but of when it happens and what it happens to. We need to implement the activation record pattern.
In the case of the document garbage collection you would have a central list of files that had active relevant metadata associated with them and this would be used to dispatch an agent to deal with the task of checking each file.
This is, of course how spreadsheets actually do the job. They store a central list of active cells to be computed rather than visiting every single cell in case it needs to be up-dated.