Hash

From Organic Design wiki
Revision as of 19:37, 2 December 2021 by Nad (talk | contribs) (See also: Verkle trees)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Glossary.svg This page describes a concept which is part of our glossary

Hash (not to be confused with hashtags or hashish which are other things entirely) is the name given to a function that chops up, scrambles and compresses or expands its argument until it is unrecognisable. Hashing is a very similar process to encryption, but with hashing one gets a fixed number of random binary bits (hashes are usually shown in base 16 rather than binary so they're a little easier to read) that cannot ever be returned to the original data again. For example hashing the sentence "the quick brown fox jumps over the lazy dog" with the popular SHA256 hash function yields the result of 1153a4080f1fcb04425aa0b841c2b14606fe6df25d9076d2a1face2d5af57129, but if we make just a tiny change to the input, such as using a capital "T" instead of lower-case, the result becomes c03905fcdab297513a620ec81ed46ca44ddb62d41cbbd83eb4a5a3592be26a69 - almost completely different, but always exactly the same size.

The same data should always produce the same hash, and different data a different hash, but because the hash value is usually very short compared to the data being hashed, it's possible for different data to yield the same hash value. This problem is called a hash collision. However, in a typical hash of over a hundred bits in length, the chance of a collision is astronomically small, like picking atoms completely at random out of the entire universe and happening to pick the same one twice.

In an ideal hashing function there is no predictable correspondence between the initial data and the output data such as two similar inputs resulting in closer outputs, which means that a hash of a piece of data can act as a kind of finger print that identifies it specifically from all other data, and there is no way of obtaining the original data or anything about it from its hash value.

In reality though cryptographic weaknesses have been found in different hashing algorithms as available processing power increases which allow attackers to predict certain things about the original content from a hash value, or to deliberately create collisions. For this reason hash functions are continually improving, for example the one very popular MD5 algorithm is now only used for checksums or non-critical applications, and now the SHA group are used for more security-conscious applications.

Hashing is used in practically every computing application and plays especially important roles in the domain of crypto-currencies which are described below. If you're learning about Bitcoin or other crypto-currencies but are not familiar with hashing, hash-based verification methods or proof-of-work systems, you should read this before starting the Bitcoin Whitepaper.

Verification

Hashes allow data to be referred to by it's hash value which is much more economical on resources than transferring and storing the entire content. When a client needs the content they can request it and then hash it themselves to ensure they get the same result so they can be sure that it's exactly the same data as they requested. This is why download sites usually supply a hash along with each downloadable item so the client can easily get the hash of their file locally and ensure it matches the one specified on the original site so they know that the file hasn't been corrupted or truncated in transit.

Passwords

Another very simple but useful way of using hashes is for storing user passwords, if somebody hacks the system and obtains all the user data, all they have is hash values which they can't obtain the original password from without trying every possible combination. Actually trying every possible combination isn't that hard if you have a simple password, which is why sites always recommend you use upper and lower case, numbers and symbols in your passwords. When a user logs in, the server simply hashes the password they enter and checks if the hash matches the stored hash, if it does the server can be sure the user has entered the same value. If a user forgets their password, they'll need to create a new one, because the server cannot tell them what their password is since it only has hash values stored.

Public-key cryptography

This is a very over-simplified explanation but is sufficient to yield a basic understanding of the mechanism. The idea with public-key cryptography is that a user holds a pair of keys, one is to be kept completely secret and backed up securely, the other is able to be shared with the public. The idea is that anyone who has the public key can use it to encrypt information that they know only the person with the corresponding private key can decrypt.

There are many different variations on this theme, such as PGP keys are used for encrypted email which means that people can publish their public PGP key allowing anyone to send encrypted email to them that only they can decrypt with their private PGP key. There are also RSA key-pairs which allow people to give access to services using a persons public RSA key knowing that only the holder of the corresponding private RSA key will be able to access. A bitcoin address is a public key that anyone can send coins to knowing that only the holder of the corresponding private bitcoin key can spend.

The way that public and private keys are related is very similar to how a piece of content is related to its hash - you can think of a public key as a hash of the corresponding private key. If you have the public key there's no way that you can find out anything at all about the private key, but if you have the private key you can easily create the corresponding public key.

Digital signatures

People can use keys and hashes to digitally sign content. The holder of the private key can create a unique signature that corresponds to the hash of some content such as a statement, and then anyone with that person's public key can verify that the signature was signed by the holder of the private key. All different types of keys, no matter what their main purpose offer this signing mechanism, because the signature process is fundamental to the performing of the encryption or access mechanism.

Hash lists and trees

In many systems, the content changes over time, and in this case the hash verification process can be a hash of not only the current state of the content, but also of the hash of the previous state, which in turn contains the hash of the state before, and so on. In this way it's impossible to modify any of the previous states without creating a totally different hash in the current state. The same applies to trees (called Merkle trees, see also Vitalik's post on Merkle trees) where each node contains a hash of the child node's hashes so that any change of data of any node in the tree will cause different hash values to propagate up the tree all the way to the root node.

Another really useful kind of hash-tree is used in deterministic wallets for cryto-currencies. These are wallets that allow many different addresses, even across multiple different types of coins, to be accessed by a single seed (usually composed of a twelve word phrase). This is a protocol called BIP-32 which uses the seed as the root of a tree of addresses. It's basically the idea of appending the seed with path information just like the folder tree on your hard-drive except all the folders are numbers, e.g. "seed/0/1/1" or "seed/2/5/3/3". Taking the hash of the seed and path gives a unique seed from which to create an address for a particular asset type, and you'll always get the same address from the same seed and path - just like you always get the same hash result from the same content.

Proof of work

Hashing can be used to prove that a certain amount of computation has been done that corresponds to some piece of data which is called "proof of work". This is simply done by appending the data with an arbitrary number (called a "nonce") that satisfies a particular constraint on the resulting hash. For example the "Hashcash" system (described below) requires that an email message header appended with its nonce has twenty zeros at the start of its binary hash result. Since a hash with many zeros at the beginning is very rare, hundreds of millions of different nonce values will need to be tried before finding such a hash result.

As a concrete example, lets take the sentence above, "the quick brown fox jumps over the lazy dog", and find the first number on the end of the sentence that yields a hash starting with four zeros in the standard base sixteen result (which is like saying sixteen zeros at the start of the result if it were shown in binary). I wrote a few lines of code to quickly loop through trying each hash starting with "the quick brown fox jumps over the lazy dog1" revealing an SHA256 hash value of 0c5c721f918280786d93fddd1e520892d5946140620efbbccffa51df782807c6 which doesn't start with any zeros, so it tried 2,3,4 and so on... It didn't find a hash starting with four leading zeros until it got all the way up to the number 84730 on the end of the sentence which has a hash value of 0000448bf6a27135258b9ad0de7b8f2fff84a12e9688b5b349ffe3c48900ea04. The only way I could find a number (my nonce) that gave me a hash with four leading zeros like this was to keep trying either randomly or counting in some way, there's no short cut, I had to do the work. But for you to prove that my number does give these zeros (i.e. my nonce is valid), you just need to get the hash of "the quick brown fox jumps over the lazy dog84730", which on the command line in Linux or Mac you can do as follows:

echo "the quick brown fox jumps over the lazy dog84730" | shasum -a 256


Which gives you the following output:

0000448bf6a27135258b9ad0de7b8f2fff84a12e9688b5b349ffe3c48900ea04

And there's the four leading zeros showing that my nonce of 84730 is valid, and that it really is proof that I must have done this work, and done it specifically for this exact sentence and no other. I don't need to show you my script, tell you what language I used, how long it took or anything else, the nonce stands independently as proof of the work.

In this same way a piece of data can be accompanied by a proof of work in the form of a nonce, so that a client that wishes to verify that this work has really been done can simply hash the data appended with the nonce value and check if the resulting hash value satisfies the criteria, such as having twenty leading zeros in the case of Hashcash or crypto-currency networks.

Just in case you're curious, here's the one-liner I used to find my nonce which will work on any Unix-like system. You can modify this to experiment with your own nonce-searches by changing the message or the four zeros. In plain English this code says to "keep increasing the value of a variable called $n while our sentence appended with $n and sent through the sha256 hash function does not start with 0000, then after that (i.e. it does start with 0000) print the value of $n".

perl -e '$n++ while`echo "the quick brown fox jumps over the lazy dog$n"|shasum -a 256`!~/^0000/;print$n'

The above code is very inefficient and took me twenty minutes to find my nonce, but is good for a quick example that doesn't require anything from the environment. Here's another version which will find the nonce in a fraction of a second, but does require that your version of Perl has hash support, which probably most do by default these days. This version is fast enough to find the nonce for six leading zeros (twenty four binary zeros) in about ten seconds (on my 3GHz core i7) which is 16769096.

perl -e 'use Digest::SHA qw(sha256_hex);$n++ while sha256_hex("the quick brown fox jumps over the lazy dog$n\n")!~/^000000/;print $n'

By printing the time taken along with the nonce, you can use it as a quick one-liner for a rough cpu benchmark your hardware :-)

perl -e '$t=time();$n++ while`echo $n|shasum -a 256`!~/^123/;print"$n:".(time()-$t)'

Proof of work in crypto-currencies

this section is still in-progress

In crypto-currency networks like Bitcoin, the mining process collects together unconfirmed transactions (ones that are not part of any block yet) into blocks and accompanies each block with a proof-of-work nonce.

The reason this is confusing is because the phrase "working on" makes it sounds as if when a new transaction is received the whole block header would change and therefore all the work done so far to find the nonce for the current block will have been wasted.

This however is not the case because the chances of finding a valid nonce are always the same no matter how much work you've already done, to think otherwise is a logical flaw known as "the gamblers fallacy". It's like thinking that if you've previously tossed a coin and it's come up heads many times in a row, that your chances of getting heads this time are much lower than 50%, but of course it's always 50/50.

So to sum that up, it's not wasting any time when new unconfirmed transactions keep arriving while trying to find a valid nonce. When one is finally found it will contain all the unconfirmed transactions that miner has seen up to that point, it doesn't have to queue any up for the next block while it's figuring this one out.

Another thing is that I wondered why everybody counts from zero and increments when trying to find a nonce since I thought this would mean everybody is racing through the same set of hash results (at least, those that had the same set of unconfirmed transactions in the same order, would be), but actually everybody's block header is guaranteed to be different because they all contain at least a different address to which the block reward should go.

Hashcash

Hashcash is a proof-of-work system designed to limit email spam and denial-of-service attacks. Hashcash was proposed in May 1997 by Adam Back. This was originally an idea for how to stop spam (or more generally denial-of-service attacks). The idea was that tokens could be made by requesters of a service which require significant processing power to make (as described above), and can be validated by the recipient to ensure that the work has indeed been done and it matches the headers of the email it's within. This would make spamming impractical because every message requires at least a second of work to process.

It takes an average of about one second for 1GHz hardware to find a hash with twenty leading zeros, but this criteria of twenty needs to increase over time to maker it harder to find a valid token because processing power always increases and becomes cheaper (Moore's law).

These valid proof-of-work hashes contain inherent value within them, because they've taken time to build and prove that they belong to (i.e. are owned by) the message header they've been constructed for. The Hashcash algorithm could be used in a more general context than email by allowing them to be reused and transferred between owners. A central provider could "mint" them and keep track of the current owner of each one.

In the Bitcoin network, the coins and who owns each (and all historical transfers) are held in a distributed database called the blockchain. The bitcoin owner is in the form of a bitcoin address rather than an email address. The bitcoin address is actually a public key and only the person with the corresponding private key has the abiltity to transfer it to another address.

Bitcoin also distributes the minting process by allowing anyone to do it, and has a difficulty level (how numerically low the hash result has to be to be considered valid) that auto-adjusts to ensure that the issuance rate matches a specific pre-determined curve no matter how many people are mining, or how fast the computer can process the hashes. The difficulty automatically adjusts to keep the average time it takes to find a valid block about ten minutes.

MD5

In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. As an Internet standard (RFC 1321), MD5 has been employed in a wide variety of security applications, and is also commonly used to check the integrity of files. An MD5 hash is typically a 32-character hexadecimal number. Recently, a number of projects have created MD5 "rainbow tables" which are easily accessible online, and can be used to reverse many MD5 strings into their original meanings.

MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function, MD4. In 1996, a flaw was found with the design of MD5; while it was not a clearly fatal weakness, cryptographers began to recommend using other algorithms, such as SHA-1. In 2004, more serious flaws were discovered making further use of the algorithm for security purposes questionable, but it's still very useful for simple data integrity purposes.

On any Unix-like OS you can simply use the following syntax examples, the first gives you the hash of the content of a file, and the second gives you the hash of the given text.

md5sum filename
echo -n "some text to hash instead of a file" | md5sum

SHA1

On 23 Feb 2017, Google and a team from Amsterdam announced the first SHA1 collision created using a practically reproducible technique. It is now urgent that security sensitive applications using SHA1 move to SHA256.

Multihash

Multihash is a Multiformat protocol that follows the TLV (type-length-value) pattern, and is used for differentiating outputs from various well-established hash functions, addressing size and encoding considerations. It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist.

Multihash is particularly important in systems which depend on cryptographically secure hash functions. Attacks may break the cryptographic properties of secure hash functions. These cryptographic breaks are particularly painful in large tool ecosystems, where tools may have made assumptions about hash values, such as function and digest size. Upgrading becomes a nightmare, as all tools which make those assumptions would have to be upgraded to use the new hash function and new hash digest length. Tools may face serious interoperability problems or error-prone special casing.

  • How many programs out there assume a git hash is a sha1 hash?
  • How many scripts assume the hash value digest is exactly 160 bits?
  • How many tools will break when these values change?
  • How many programs will fail silently when these values change?

This is precisely where Multihash shines. It was designed for upgrading. When using Multihash, a system warns the consumers of its hash values that these may have to be upgraded in case of a break. Even though the system may still only use a single hash function at a time, the use of multihash makes it clear to applications that hash values may use different hash functions or be longer in the future. Tooling, applications, and scripts can avoid making assumptions about the length, and read it from the multihash value instead. This way, the vast majority of tooling – which may not do any checking of hashes – would not have to be upgraded at all. This vastly simplifies the upgrade process, avoiding the waste of hundreds or thousands of software engineering hours, deep frustrations, and high blood pressure.

See also