Page 2 of 2
The Spreadsheet/Reactive Paradigm
It has to be admitted that implementing distributed local action in the sense described above is a real challenge.
To discover exactly what sort of challenge is posed by a distributed implementation it is a good idea to conduct a thought experiment where hardware is in abundance.
What sort of hardware would you need to make a distributed architecture really work?
You might be thinking in terms of ultra-fast processors and ultra-fast disk drives.
All of these help but there is another way. Imagine that all of the files are represented by people – one person per file. Now imagine that all of the people are gathered together on a football pitch and you simply ask everyone to put up their hand if their file’s “condition” evaluates to true – and then please leave the pitch.
This is an example of the “spreadsheet” paradigm, perhaps the simplest approach to parallel processing ever invented. One processor is allocated to each item of data and each processor has a, usually simple, set of instructions to obey. It is an example of the think global act local idea because the entire spreadsheet computes something but each cell only computes its local result.
It is also an object-oriented approach in the sense that data and process are tied together. In an ideal computing environment all data would be encapsulated within an object that provided its first level of processing.
In and even more ideal world every object would not only have its own thread of execution but its own processor taking care of that thread. There are some cases where this is indeed a possible architecture.
Of course in practice the spreadsheet usually isn’t implemented as a single processor per data item, instead whatever processing power is available is used to simulate the parallelism. This said, it is worth mentioning that machines have been built that implement the spreadsheet approach to parallel processing with thousands of processors.
But back to the real world - in practice the importance of the spreadsheet paradigm is that it provides us with good ways of thinking about distributed processing. The point is that each of the files “knows” if its time is up but we have to use serial methods to simulate the ideal parallel implementation.
In this case things are simple enough not to pose any real problems.
The best solution is to use a lazy approach to document garbage collection and run a file scanner in processor-idle time. This is just a simulated implementation of the spreadsheet algorithm visiting each data cell in turn and computing the result – but in the background. Of course as you allocate more and more threads to the process the simulation becomes increasingly parallel.
The result is not a sophisticated solution but it’s a fairly standard one for distributed systems where the outcome isn’t time critical. In this case it doesn’t matter exactly when a document is garbage collected as long as it happens sometime.
The same paradigm applies to web crawlers, disk defragmenters, indexing software and so on. Only when the outcome is time critical or the results could depend on the outcome of multiple data items - such as a seat booking database or solving a mathematical problem - do we have to confront the challenge of real-time distributed systems and true parallel processing.
What can you do when a lazy implementation isn't good enough?
In this case you can still use the spreadsheet paradigm but we now need to implement a central database that is in charge not of what happens but of when it happens and what it happens to. We need to implement the activation record pattern.
In the case of the document garbage collection you would have a central list of files that had active relevant metadata associated with them and this would be used to dispatch an agent to deal with the task of checking each file.
This is, of course how spreadsheets actually do the job. They store a central list of active cells to be computed rather than visiting every single cell in case it needs to be up-dated.
Redundancy Is Good
Moving on to more general situations we quickly find it difficult to generalise the solutions.
There are two distinct aspects to any global solution corresponding to time and space, or more accurately processing and data.
A local approach can involve parallel processing but more likely simulated parallel processing. It also involves distributed data which, as already discussed, in the absence of true parallel processing is often perceived as an inefficiency and hence a problem.
It could be that this perception is causing us to miss some good approaches to well-known problems.
For example, currently we regard data redundancy as a serious problem. It wastes space and it risks inconsistencies. This is the rationale behind the normalisation of relational databases. Don’t store anything more than once. If you need something in more than a single logical place then use a reference not a copy.
The same argument is currently being played out at a higher level in the use of “de-duping” software that removes multiple copies of the same document fragment from large data stores. Reducing redundancy must be good because it reduces storage needs and enforces consistency – as one copy cannot contradict itself.
However redundancy has its good points. It’s the basis of all error-correcting codes for example. Put simply having multiple copies of data means that you still have it even if you lose it repeatedly. Redundancy can also make data easier to process. Putting the data into a more amenable format can be worth doing even if it wastes storage.
Many procedures in artificial intelligence make use of distributed coding, i.e. spreading the information over more bits than strictly necessary. This makes the detection and extraction of patterns easier.
Certainly it is the case that the need for storage efficiency often works against a sophisticated, flexible and distributed design.
SOA – is global/local?
Popular approaches to distributed architecture are SOA and web services.
Essentially the “services” idea distributes a system across servers in such a way that the solution is loosely coupled and can scale without a software redesign.
However, services can only provide a robust distributed system if there isn’t a single choke point – either due to data or due to process. The problem is that simply splitting a system up into services doesn’t necessarily provide a solution that puts enough thought into the global.
It is too easy to decompose a system into atomic services that look distributed but are in fact simply concentrating the data and processing into one place.
Imagine a garbage document collection service consisting of a database behind a service interface. A client could register the collection status of a file simply by connecting. The garbage collector itself could connect to the service to discover if a file needed attention. It’s a nice design but … it suffers from the global database problem even if it looks like a client server solution.
Implementing services encourages you to think in terms of provisioning the service with exactly what the service needs to do its job and this in turn tends to emphasise the use of a central database that gathers in all the information ready to be used as soon as a client requests service.
Consider for a moment how you might build a service that provides a client with garbage collection data?
Actually once you have considered the proposal it’s not that difficult.
All you have to do is shift the lazy scanner algorithm to a service. Allow the scanner to work as and when appropriate and allow it to build up a database of garbage collection conditions in its own private database. Now a client can query the service to discover the state of any file and can either go and update the file or perform the deletion. The service database isn’t guaranteed to be up-to-date in any way and the client might find that the file isn’t where it is supposed to be and there may be files ready for collection that are not yet listed – but it doesn’t matter in this case.
You might recognise this as being related to the "eventual consistency" property of some No-SQL databases.
The real data is still managed at a local level along with each file but the service provides a cache snapshot that allows clients to process most of the outstanding task.
This is a difficult argument to sum up but if challenged to express it in ten words or fewer I would conclude:
Think global, act local, find ways of making it efficient.