Page 2 of 2
The original purpose of computing a hash was to find a way to store and retrieve data, but hashes have many other uses. A hash serves as a label as well as an address for an item of data.
For example, suppose I send you a file of important data and I also tell you that
is 1234 where 1234 is standing in for a very large number in practice. Now suppose you received the file and stored it as data'. Then if you compute the same hash function:
and discover that it is 2342 you know that the data has been changed after it was sent.
So a hash can be used to detect changes in the data. Now consider for a moment what you can conclude if the hash of the data had worked out to 1234?
You can only conclude that there is a high probability that nothing has changed in the data. The reason is that the hash value acts as a digest of the data and more than one set of data will give you the same digest - i.e. there will be collisions. So getting the same digest simply means that the data is OK apart from any changes that might give you the same digest value. The key issue here is what is the probability that changing the data is going to result in the same digest value and the answer is usually that it is very small.
In other words, you are very likely to detect any change to the data.
This is the reason you often see an MD5 hash value quoted for downloads. MD5 is a cryptographic, i.e. very high quality, hash function which is used to compute a digest which you can use to check for errors.
Interestingly MD5 is not a good cryptographic hash function. There are ways of finding collisions fairly quickly and this could be used to make changes that leave the digest unchanged.
With a little modification, the idea of a digest hash can be made secure so that both the data and the digest can be securely sent to someone, who can in turn use the digest to verify the data and who the data was sent by. This is the basis of a digital signature.
The Password Mechanism
The fact that a hash can be used to label data is also the basis of most password systems in use. The basic idea is that the user invents a password and the system stores a hash of the password. When the user logs on the system applies the hash function to the password that the user has supplied and if it matches the stored hash then the user is allowed to proceed.
Notice that we have the usual problem with collisions - if the user gives the wrong password that just happens to give the same hash then they get in anyway.
The big advantage of this scheme is that even if an attacker gets the hash value they cannot use it to log in. They need the password that generates the hash and this is difficult to find. The reason is that while it is easy to generate the hash from a password, it is hard to generate a password from the hash. In other words, hash(password) is easy to compute, but hash-1(value) is hard.
Just because it is hard doesn't mean it is impossible and so called password crackers use big computers and GPUs to compute the inverse.
Proof Of Work
One of the more unusual uses of the hash idea is in Bitcoin's proof of work algorithm. In this case a block of data is published and before a server can claim to have validated the transaction it has to find an item of data that when added to the block gives a hash with a given number of zeros. As computing the inverse hash function isn't easy the only way to do this is to add trial data and compute the hash until you find the result with the required number of zeros. This takes time and means that there is a high probability that just one server will find the answer first - and so avoid the problem of multiple servers validating the block.
A nice touch it that the algorithm anticipated the fact that hardware would get better. It includes feedback by measuring the average time taken to solve the hash problem and adds additional zeros to the requirement. You can prove that the time taken to find some data to add to the block to give n zeros in the hash goes up exponentially with n. In other words the difficulty of the problem can be varied to keep the time to solve about the same.
Once the problem is solved and the block declared valid the data is appended to the block so that the data it contains cannot be changed from that point on without invalidating the hash making the transaction tamper proof.
Hash Functions In Algorithms
There are also lots of uses for hash functions to speed up algorithms. For example, if you need to weed out duplicate records in a set of size N you might end up comparing every possible pair, i.e. roughly N2/2 comparisons, or better sorting the table. An alternative is to compute hash values for all the records and store pointers to each record in a hash table. Duplicate records are then just collisions in the table.
You can also use hash functions with additional properties to implement matching algorithms. If you can invent a hash function that maps data that you consider to be similar to hash values that are close, you can use hashing to search not just for duplicates, i.e. identical values, but values that are close. This approach is important in string matching, acoustic matching and even pattern recognition.
There are lots of very clever algorithms that make use of the quick and easy way you can check for inequality between data. The basic principle is always that if two objects have different hash values then they can't be the same thing.
Make sure you check out the Bloom filter for example.
The Bloom Filter
The Invertible Bloom Filter
Storage Mapping Functions
Inside Bitcoin - virtual currency
Assemblers and assembly language