Basic concepts in hashing and password storage. -- Gregg Ink

Basic concepts in hashing and password storage.

Hashing, like anything else in a computer, is a system and systems have the following structure:

The input of a hashing function could be any binary data (text, sound, images, etc.) of arbitrary length. The output is a number, typically represented in hexadecimal form, of fixed length.

There are many different hashing algorithms. One common example is SHA256. SHA stands for Secure Hashing Algorithm. We will talk in a minute about the significance of the word secure. The number 256 refers to the length of the output: 256 bits or 32 bytes.

We will not concern ourselves with the details of the processing in this post. The SHA256 is an algorithm in the public domain. If you care to know you can easily google it; suffice to say that it is a load of XORs and shifts.

The fixed length number produced as output, also known as a "hash", does not mean anything in particular. It is not some kind of average or anything of any statistical or other significance. A good hash does however need to satisfy three properties which will make sense once we discuss what they are used for.

1. Deterministic

Exactly the same output should always produce exactly the same hash.

2. Irreversible

Once you are given a hash, the output, it should be impossible to infer what the input was. This might seem surprising since the algorithm is in the public domain. Can you not simply do the same thing in reverse? No, you cannot! Let's pretend the processing is a simple addition and the output is 7. You don't have any way of knowing whether the input was 6 and 1 or 4 and 3 or 10 and -3. Hashing algorithms are designed such that they cannot be reversed.

There is kind of a way of reversing it: a brute force attack. You can try many different inputs and see if they produce your target hash. If you don't have any hint about the input, this approach is usually not practical.

3. A slight difference in input produces a completely different output.

This is really a consequence of irreversibility. Imagine this third property didn't exist and you want to figure out the input of a target hash. You could try a completely random input and hash it. Let's say this produces hash A. You could than slightly tweak the input and produce hash B. You could then assess whether Hash B brings you closer of further away from your target output. You could then continuously tweak the input until it produces the target output. The third property ensures this strategy will not work to reverse the hash and there is really no way to do it other than a brute force attack.

About collisions

With an infinite number of possible inputs and a very large but finite number of possible outputs, collisions are inevitable. A collision simply means that two different inputs hash to the same output. This is fine as long as collisions are far and wide apart. Remember the "secure" in SHA256. This means that brute force attacks are not practical with current hardware. This is so because it would take too long for each hash but also because the likelihood of a collision is very small.
One simple way of testing the quality of a hashing algorithm is to see how easy it is to find a collision.

This is fine but now you might be wondering how hashing could be useful. There are in fact many uses.

1. Bitcoin

SHA256 is an essential component which makes bitcoin happen. I have already written a 20 page e-book on how bitcoin works. It is free. You can find it here: http://bitcoin.greggink.com

2. Password storage.

Let's pretend that you want to log into your google account. Google does not know your password. That is a good thing. If it did store your password, it could be stolen by a hacker. Since it doesn't know your password, how can you log in?

When you first create your account, you tell Google which password you want. Google will hash that password and store the hash in its database. When you log in next time, Google will hash the password you are giving it and compare it with the one in its database. If the two hashes match, it logs you in. This works because of the first property; identical passwords should always hash to the same thing.

There is a bit of a twist here. I have oversimplified things. If Google were to store your passwords that way they wouldn't really be all that secure. Hackers could hash entire dictionaries and store it in a database. If they hack Google and get loads of hashed passwords, they could compare everything with their own dictionary of hashes.

To get around that you use a technique known as a salted hash. "Salt" is a nonsense word or number which is added to the password and then this is hashed. This should render a hacker's dictionary of hashes useless. Of course Google probably does do even more things they are not telling anyone about.

When you click on the link "forgot my password", Google does not send you your password but instead gives you a chance to reset your password once you convinced them you are the legitimate owner of the account. Sometimes websites do store plain text passwords. You can tell because when you click on the link, they mail you your password. When this happens, sit down, take a moment and cry. Seriously though, you should stay away from these websites as much as possible. I wonder whether there has ever been a court case over this. I am seriously considering starting a website which lists all websites which do not hash passwords or do not observe basic security standards. I did write the Volcep (Voluntary Code of Ethics for Programmers) in an attempt to encapsulate what counts as good practice. You find it here: http://greggink.com/volcep/

3. Checking file transfer.

As data goes through the internet, it gets corrupted once in a while. How to be sure you have your file arrived the way it should? The website hosting the content could hash the file and publish the hash. In the context the hash is called a checksum. You could then hash the received file and compare the hash to the checksum. They should be identical.

4. Legitimacy of software

What if you download a piece of software from a mirror? You want to make sure that it is the legitimate software and not something which contains malware. The software developer could publish the checksum of the real thing. Anything downloaded could be hashed and compared to the published checksum. The MD5 hashing algorithm was widely for this purpose in the 1990s.

Hackers have found ways of exploiting this though. They took some legitimate software, added malware and added some junk. They kept changing the junk randomly until the entire file hashed to the same checksum as the original, legitimate software. They were able to do this because hardware had gotten a lot faster than when MD5 was designed.

MD5 might have made sense for the hardware of the day but it is not up to the task in the 21st century. It is strongly recommended you do not use MD5 for any security-sensitive application today.

5. Changes in files

You could use hashing just to see if two files are different. Imagine you want to download a large file from a server but only if it has been updated. If it is the same as your local version, you want to pass to save time and bandwidth. You could take a hash of the server version and your local version, then compare the two. If the file is on a trusted server then security is not a concern and speed is the main consideration. For situations like this you could consider Meow hash. Meow hash is optimized for speed and is not suitable for security-sensitive applications. Read more about Meow Hash here: https://mollyrocket.com/meowhash

Trade-offs

Software engeneering is full of trade-offs. When it comes to hashes the most important trade-off is between speed and security. Hashes which need to be secure have to be slow so as to make brute force attacks impractical. When security is less of a concern, you can afford to use a faster hash.

Gregg Ink
twitter: @Gregg_Ink