Page 1 of 2
The big issue in architecture is usually the choice between global and local implementation. However, this is so all-pervasive a choice that we tend to miss that there is more to the idea that you might think.
“Think global, act local”
may be a bit overused, or not used enough depending on your point ov view, but it encapsulates one of the biggest issues in software architecture – do you go for a local or global architecture?
At this point you may be a little vague as to what local or global architecture is, but don’t worry – everyone is in the same position.
Most of the time you can tell which direction a decision moves your design – more local or more global – but the point at which it deserves the exact label “local” or “global” remains ill-defined.
A good way to think about it is to reserve the term local for things that don't have a full view of the overall situation and global for things that have a commanding overview. There is also the point that there are many local and only one global.
Let’s try to be a little more concrete.
As projects become more complicated increasingly there are decision points where you can select between a local or global design.
We can ignore any simple programs that are self-contained and just take in some data, do some sums and terminate. After all such programs are so simple they don’t really need a grand design - they hardly need a design at all!
The sort of projects that are interesting are the ones that work with multiple, possibly complex, data sources interacting with multiple entities and presenting fresh data to yet more entities.
To be even more concrete let’s consider a simple case study.
Collecting the garbage
Suppose you need to implement a document garbage collection scheme.
Each document can be either ready to be deleted or not and the condition that determines it can either change with time or be fixed.
For example, some documents might be marked for deletion by the user; others might need to satisfy a complex condition such as five new versions exist and the document is six months old.
Think about how you might implement the garbage collector.
I hope you can see at once that your first major architectural decision is one of those critical decision points that select for either a local or global design. This is a critical decision because once made, even if made by accident, the decision is very difficult to un-make. Unless you explore all of the possibilities before you commit to a design you could well end up with a global or a local implementation simple because it was the first one you thought of and what you have do to implement you solution depends very much on which it is.
In this sense the decision determines the overall architecture and there rest comes down to decoration and furnishings!
I hope you noticed, before you started to refine your design, that you really do have a choice.
- You can opt to use a central database to record the details of the documents
- you can record the details along with the documents as metadata.
At the current state of technology it is worth observing that local versus global often does reduce to database or central "silo" versus distributed metadata.
To be clear, you could set up a database that records the file’s location and an indicator of its state or you could store the state information about the file as part of its metadata, i.e. store the state either in or “alongside” the file in some way. Exactly how this is done is important because the association between the metadata and file should make them effectively a single entity. This is essential because otherwise the metadata and its object could get separated. For example, what happens to the metadata when a file is copied?
Interestingly the Window NT file system NTFS had a hard time implementing metadata in such a way that copying the file to a FAT file system would preserve the data. This is the reason Microsoft had to invent the idea of multiple files associated with a single file name Alternate Data Streams ADS. It worked as long as you stuck with NTFS file system but it never really caught on as the best way to implement metadata.
Think And Act!
What is local and what is global does depend on where you draw the system boundaries and is one of the reason why the problem doesn’t arise if you consider only small projects. For example, in the storing of the state of a single file the global/local issue hardly arises - you simply get on with the job and do it.
In this case, however, the example serves to highlight a slightly different point and a better usage of the terminology.
You really shouldn’t think of a solution as being just global or local as applied to the static data.
You should, returning to the well worn expression, always think global and, wherever possible, act local.
The key words here are "think" and "act".
You need to implement a database of global garbage collection status i.e you want to delete all of the files that are no longer required to say reduce total storage or perhaps time to back up, but you should store the data locally in each file rather than collect it together into a single database.
This is a good example of global thinking and local action.
In this sense the metadata solution is a local action implementing a global objective and the database is a global action doing the same job.
The internet is another good example of this principle in that the entire system is controlled for global aims by implementing local actions.
Pros and Cons
Let's examine the particular example of database v metadata.
So what are the advantages and disadvantages of each approach?
The database storage solution has the apparently huge advantage that it is efficient. You don’t have to search for the documents that are potential candidates for garbage collection.
All you have to do is scan the list of documents, evaluate their associated conditions and, if appropriate, delete the document.
Compare this to the global-thinking/local-action approach. In this case you have to search the entire file system to locate documents that have garbage collection conditions in their metadata, evaluate the conditions and delete the document if appropriate. As each delete status indicator is stored along with the documents the only way of finding out which documents need to be deleted is to scan though all eligible documents.
This sounds terrible!
The central database sounds like a much more practical idea.
Efficiency And Sophistication
There is another old software saying which goes
“efficiency is mostly about hardware”.
While it is true that choosing the wrong algorithm can make a task take a time longer than the lifetime of the universe, efficiency isn’t the main aim of software.
Software design is about sophistication and reliability. First it should do the job well and only then should we worry about how long the job takes.
Of course in the real world we do have to worry about how long the job takes but this should be a treated as a separate, orthogonal, design factor - that is you should be aware of what aspects of your chosen archecture are there because of efficiency.
For example, what do we get if we adopt the seemingly inefficient metadata solution?
The first thing is sophistication at little extra cost. If the document moves the garbage collection condition follows it.
Consider how you would arrange for the database to be updated to reflect the change in a document’s location?
How does the database even "know" that a document has been moved unless there is a contract with every other piece of software in the system to keep the database up-to-date.
Now this option starts to look horrible and very fragile.
Equally the user can query and change the condition without the need for access to a central database. If the user has access to the file they have access to the garbage collection condition.
What really matters here is that the file is the object that the user regards as the document. For example, if the user deletes the file and creates a new file with the same name – does it have the same garbage collection condition?
Clearly if the condition is stored in the document’s metadata it is automatically destroyed along with the file – if it isn’t then the metadata mechanism is incorrectly implemented. In the same way the metadata automatically follows the document as it is copied, moved and generally manipulated in ways that the user finds easy to understand.
In short a property of an object should be stored along with the object.
Compare this to the reference to the file stored in a separate database, possibly not even on the same machine, as the file. There is no such “natural” association between the file reference and its garbage collection condition and the reference certainly doesn’t automatically track any changes in the file.
Now consider how much complexity you have to add to the database mechanism to make it track the state of the file?
If you want to reproduce the close relationship between file and metadata you can’t even use a relaxed “fix the reference when it fails” pattern to keep the two in sync. If you store the references in the database and then wait for them to fail, i.e. generate a file not found error when you try to garbage-collect the file, you can’t simply search for the file because you haven’t tracked its history. It might now represent an entirely different document.
At this point, you should be inventing lots of ways of keeping the object and the reference in sync without sacrificing efficiency – but all the ways that you can think of are essentially brittle.
For example, if you use a “file watcher” pattern and ask the OS to pass each file change to your application you can attempt to keep in sync – but think of all the things that can go wrong!
If your “watcher” task fails to load, or interact with the OS correctly you have an almost unrecoverable loss of sync – you cannot reconstruct the lost history of the document if there is any interruption of the update mechanism.
Also consider what happens if there is some sort of glitch in the storage of the data. In the metadata case the problem might affect a document or two but in the case of the database a logical inconsistency could destroy all of the information. The central approach has one very big single point of failure - the database and its comitment to stay up-to-date and refect the state of the entire system.
In short, the metadata forms a distributed database complete with all the advantages and disadvantages of the same. However it’s a good choice of architecture if you can solve the efficiency problems.
But can you solve it?