Zip for the Genome - G-SQueeZ
Written by Mike James   
Monday, 19 July 2010

The G-SQueeZ algorithm is a new way of compressing genome data that could save a lot of money - but does it deserve to be patented?

Banner

 

The Zip compression algorithm is so ubiquitous that even dumb users understand the verb "to zip". It is as much as any algorithm inventor can hope for. Now we have the G-SQueeZ compression algorithm aimed at making genomic squencing data take up a lot fewer bytes.

Sqeez

It is claimed that the new algorithm can achieve compression rates of 80%, which is not at all surprising given the high degree of regularity and hence redundancy in the data. What makes the compression method so suitable for the data is that it preserves the order and it allows access to portions of the data without decompressing the entire file.

The new algorithm is a direct application of  Huffman encoding - a fundamental algorithm from information theory. A standard Huffman encoding is computed from each base plus additional data - the higher the frequency of occurrence of the base+data the shorter the code used to represent it. This is pure Huffman coding procedure.

The data to be encoded is first scanned to create a frequency table and then a Huffman coding tree. The data is then scanned a second time and the data encoded using the Huffman code. The resulting code is then written out, complete with a header giving the Huffman code dictionary. To most programmers this would seem to be a common sense application of an existing algorithm.

If you are looking for the novel idea that makes this algorithm different you probably aren't going to find it. This is a worthwhile  implementation of the standard Huffman algorithm - but it isn't a new idea and it isn't a new algorithm.

If you want to use G-SQueeZ for Academic/non-commercial use then fine, you can even have the source code, but if you want to investigate or use it from a within a company that  makes money, no matter how small, then you have to pay for the privilege. Which is a bit strange considering the company concerned is TGen which claims to be a Non-profit Biomedical Research Institute.

The creators are also applying for a patent on the algorithm. If they get the patent then presumably it will cover any compression method that is an implementation of the Huffman code - which is arguably all such methods.

Sqeez

What is worse is that if you want to read the details of the algorithm you will have to pay, typically $25 for a day's access to the journal in question - bioinformatics.

Link to the paper on the Bioinformatics Journal

Of course if you have an academic affiliation the chances are that you will be allowed to access the paper for "free" - which means the institution pays for you. Programmers often don't have an academic affiliation and the combination of patent and academic restriction is oppressive in every sense.

The final though has to be - what if Huffman had patented his much more groundbreaking and fundamental algorithm?

Further reading

Coding theory

Data compression the dictionary way

Claude Shannon

Information theory

 

Banner


GR00T Could Be The Robot You Have Always Wanted
27/03/2024

We may not have flying cars, but we could well soon have robots that match up to predictions for the 21st century. Nvidia has announced GR00T, a cleverly named project to build robots using foundation [ ... ]



100 Episodes of 5mins of Postgres
08/03/2024

The popular PostgreSQL explainer series is celebrating its 100th release and beyond. Let's take a look at what it makes it so special.


More News

<ASIN:0470089857>

<ASIN:0262101068>

<ASIN:059615450X>

<ASIN:354072799X>

<ASIN:0199230234>

<ASIN:1420063677>

<ASIN:0763751863>

Last Updated ( Monday, 19 July 2010 )