Page 3 of 3
Redundancy is good
Moving on to more general situations we quickly find it difficult to generalise the solutions.
There are two distinct aspects to any global solution corresponding to time and space, or more accurately processing and data.
A global/local approach can involve parallel processing but more likely simulated parallel processing. It also involves distributed data which, as already discussed, in the absence of true parallel processing is often perceived as an inefficiency and hence a problem.
It could be that this perception is causing us to miss some good approaches to well-known problems.
For example, currently we regard data redundancy as a serious problem. It wastes space and it risks inconsistencies. This is the rationale behind the normalisation of relational databases. Don’t store anything more than once. If you need something in more than a single logical place then use a reference not a copy.
The same argument is currently being played out at a higher level in the current vogue for “de-duping” software that removes multiple copies of the same document fragment from large data stores. Reducing redundancy must be good because it reduces storage needs and enforces consistency – as one copy cannot contradict itself.
However redundancy has its good points. It’s the basis of all error-correcting codes for example. Put simply having multiple copies of data means that you still have it even if you lose it repeatedly. Redundancy can also make data easier to process. Putting the data into a more amenable format can be worth doing even if it wastes storage.
Many procedures in artificial intelligence make use of distributed coding, i.e. spreading the information over more bits than strictly necessary. This makes the detection and extraction of patterns easier.
Certainly it is the case that the need for storage efficiency often works against a sophisticated, flexible and distributed design.
SOA – is global/local?
Currently the most popular approaches to distributed architecture are SOA and web services. Essentially the “services” idea promises to distribute a system across servers in such a way that the solution is loosely coupled and can scale without a software redesign.
Services promise to end the “silo” mentality where data is piled high in a single, all-encompassing, but relatively inaccessible, database.
However, services can only provide a robust distributed system if there isn’t a single choke point – either due to data or due to process. The problem is that simply splitting a system up into services doesn’t necessarily provide a solution that puts enough thought into the global.
It is too easy to decompose a system into atomic services that look distributed but are in fact simply concentrating the data and processing into one place. Imagine a garbage collection service consisting of a database behind a service interface. A client could register the collection status of a file simply by connecting. The garbage collector itself could connect to the service to discover if a file needed attention. It’s a nice design but … it suffers from the global database problem even if it looks like a client server solution.
Implementing services encourages you to think in terms of provisioning the service with exactly what the service needs to do its job and this in turn tends to emphasise the use of a central database that gathers in all the information ready to be used as soon as a client requests service.
Consider for a moment how you might build a service that provides a client with garbage collection data?
Actually once you have considered the proposal it’s not that difficult. All you have to do is shift the lazy scanner algorithm to a service. Allow the scanner to work as and when appropriate and allow it to build up a database of garbage collection conditions in its own private database. Now a client can query the service to discover the state of any file and can either go and update the file or perform the collection. The service database isn’t guaranteed to be up-to-date in any way and the client might find that the file isn’t where it is supposed to be and there may be files ready for collection that are not yet listed – but it doesn’t matter in this case.
The real data is still managed at a local level along with each file but the service provides a cache snapshot that allows clients to process most of the outstanding task.
This is a difficult argument to sum up but if challenged to express it in ten words or fewer I would conclude:
Think global, act local, find ways of making it efficient.
If you would like to be informed about new articles on I Programmer you can either follow us on Twitter, on Facebook or you can subscribe to our weekly newsletter.