Researchers have developed a mathematical tool that reduces the amount of space needed to store data in data centers.
The Microsoft Research team worked alongside members of the Windows Azure Storage group to develop the tool that reduces the amount of space stored data requires and so slashes the cost of storing that data.
The work started with the conundrum posed by the desire of customers storing data in the cloud to be sure that their data is safe. The easiest way to provide integrity for data is to duplicate it, with the classic three full copies typically used to keep the data safe even in the event of server failures. The overhead of having three copies is obviously high in storage costs. The Microsoft team looked at the alternative of coding the data to create a shortened description that could be reassembled and delivered to a user.
Coding data to remove duplicates is hardly ground-breaking, but the Microsoft team have taken an innovative approach. At the moment Windows Azure Storage condenses stored data with a technique called “lazy erasure coding” that happens in the background. When a data chunk (called an extent) is opened and filled, three copies of it are stored. When it is sealed and the data center has spare capacity, the erasure coding is launched in the background.
The coding splits the extent into equally sized data fragments that are coded to generate a number of parity fragments. Each of the data fragments and parity fragments are stored in a different physical unit. The choice of location is made so that the failure of any single module in a data center such as a power unit, switch, computer, or disk will affect only one data or parity fragment. Once the data is erasure-coded and all data and parity fragments are distributed, all three original copies can be deleted. The effect of the coding on the performance of the data center is minimized by the work being carried out in the background.
One method used for the erasure-coding is called Reed-Solomon coding. This was used in the U.S. space program to reduce communications errors, and has also been used in coding on compact discs. Reed-Solomon code is described in terms of the number of data and parity fragments, so a 6+3 Reed Solomon code, converts three copies of data to nine fragments—six data and three parity, each a sixth the size of the original data. This results in halving the data footprint, so saving half the servers.
The downside of coding is the fact it takes time for servers to reassemble data from code, particularly if data has to be reconstructed from fragments if hardware failures occur. The goal of the new approach is to reduce the time and cost of performing data retrieval, especially during hardware failures. The team also wanted to improve the data compression, and while they considered more radical Reed-Solomon configurations, they decided to work on a new approach called Local Reconstruction Codes (LRCs). This enables data to be reconstructed more quickly than with Reed-Solomon codes, because only half the number of data fragments have to be read to re-create the original data in the majority of the failure patterns.
The “local” in the coding technique’s name refers to the concept that, in the event of a fragment being offline because of an event such as a server failure, the code needed to reconstruct data is available rather than being spread across the data center’s servers.
LRCs is good on data durability - a data chunk can have three failures and still be rebuilt with 100 percent accuracy, and is faster on reconstructions. It also uses less space than Reed-Solomon, providing a 14 percent reduction.
The work by the researchers was awarded the Best Paper award during the 2012 USENIX Annual Technical Conference for Erasure Coding in Windows Azure Storage. Looking outside the data center, the team says one use for the technique would be in flash appliances, devices made by combining several flash-memory drives. LRC could improve the efficiency of flash appliances while cleaning up old or unused data.