Hash

From Organic Design wiki
Revision as of 18:26, 10 August 2014 by Nad (talk | contribs) (intro)
Glossary.svg This page describes a concept which is part of our glossary

Hashing is the name given to a function that chops up, scrambles and compresses or expands its argument until it is unrecognisable. Encryption performs the same function as hashing, with hashing one gets a fixed number of random bits, where as encryption produces random bits that are the same length as the input data. The same data will always produce the same hash, and different data a different hash.

In an ideal hashing function there is no predictable correspondence between the initial data and the output data such as two similar inputs resulting in closer outputs, which means that a hash of a piece of data can act as a kind if finger print that identifies it specifically from all other data, and also there is no way of obtaining the original data or anything about it from its hash. In a typical 32 byte hash, the chance of two different pieces of data having the same hash value (called a hash collision) is astronomically small, like picking atoms completely at random out of the entire universe and happening to pick the same one twice.

In reality though cryptographic weaknesses have been found in different hashing algorithms as available processing power increases which allow attackers to predict certain things about the original content from a hash value, or to deliberately create collisions. For this reason hash functions are continually improving, for example MD5 is now only used for checksums or non-critical applications, and the SHA group are used for more security-conscious applications.

Verification

Hashes allow data to be referred to by it's hash value which is much more economical on resources than transferring and storing the entire content. When a client needs the content they can request it and then hash it themselves to ensure they get the same result so they can be sure that it's exactly the same data as they requested. This is why download sites often supply the MD5 hash along with each downloadable item so the client can easily get the hash of their file locally and ensure it matches the one specified on the original site so they know that the file hasn't been corrupted or truncated in transit.

Hash lists and trees

In many systems, the content changes over time, and in this case the hash verification process can be a hash of not only the current state of the content, but also of the hash of the previous state, which in turn contains the hash of the state before, and so on. In this way it's impossible to modify any of the previous states without creating a totally different hash in the current state. The same applies to trees (called Merkle trees) where each node contains a hash of the child node's hashes so that any change of data of any node in the tree will cause different hash values to propagate up the tree all the way to the root node.

Proof of work

Hashing can be used to prove that a certain amount of computation has been done that corresponds to some piece of data which is called "proof of work". This is simply done by appending the data with an arbitrary number (called a "nonce") that satisfies a particular constraint on the resulting hash. For example the "Hashcash" system (described below) requires that an email message header appended with its nonce has twenty zeros at the start of its binary hash result. Since a hash with many zeros at the beginning is very rare, hundreds of millions different nonce values will need to be tried before finding such a hash result.

In this way a piece of data can be accompanied by a proof of work in the form of a nonce, so that a client that wishes to verify that this work has really been done can simply hash the data appended with the nonce value and check if the resulting hash value satisfies the criteria, such as having twenty leading zeros in the case of Hashcash, or being less than a certain numeric value determined by the current difficulty in a crypto-currency network.

Hashcash

Hashcash is a proof-of-work system designed to limit email spam and denial-of-service attacks. Hashcash was proposed in May 1997 by Adam Back. This was originally an idea for how to stop spam (or more generally denial-of-service attacks). The idea was that tokens could be made by requesters of a service which require significant processing power to make (as described above), and can be validated by the recipient to ensure that the work has indeed been done and it matches the headers of the email it's within. This would make spamming impractical because every message requires at least a second of work to process.

It takes an average of about one second for 1GHz hardware to find a hash with twenty leading zeros, but this criteria of twenty needs to increase over time to maker it harder to find a valid token because processing power always increases and becomes cheaper (Moore's law).

These valid proof-of-work hashes contain inherent value within them, because they've taken time to build and prove that they belong to (i.e. are owned by) the message header they've been constructed for. The Hashcash algorithm could be used in a more general context than email by allowing them to be reused and transferred between owners. A central provider could "mint" them and keep track of the current owner of each one.

In the Bitcoin network, the coins and who owns each (and all historical transfers) are held in a distributed database called the blockchain. The bitcoin owner is in the form of a bitcoin address rather than an email address. The bitcoin address is actually a public key and only the person with the corresponding private key has the abiltity to transfer it to another address.

Bitcoin also distributes the minting process by allowing anyone to do it, and has a difficulty level (how numerically low the hash result has to be to be considered valid) that auto-adjusts to ensure that the issuance rate matches a specific pre-determined curve no matter how many people are mining, or how fast the computer can process the hashes. The difficulty automatically adjusts to keep the average time it takes to find a valid block about ten minutes. The Bitcoin Whitepaper

MD5

In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. As an Internet standard (RFC 1321), MD5 has been employed in a wide variety of security applications, and is also commonly used to check the integrity of files. An MD5 hash is typically a 32-character hexadecimal number. Recently, a number of projects have created MD5 "rainbow tables" which are easily accessible online, and can be used to reverse many MD5 strings into their original meanings.

MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function, MD4. In 1996, a flaw was found with the design of MD5; while it was not a clearly fatal weakness, cryptographers began to recommend using other algorithms, such as SHA-1. In 2004, more serious flaws were discovered making further use of the algorithm for security purposes questionable, but it's still very useful for simple data integrity purposes.

On any Unix-like OS you can simply use the following command

md5sum filename

Digest::MD5::Perl is a perl implementation of Ronald Rivests md5 hash algorithm. It is written in perl only and because of this it is slow but it works without C-Code. You should use Digest::MD5 instead of this module if it is available.

See also