How do data scientists fail in the real world?

Jorge Castro
Cook php
Published in
7 min readAug 17, 2019

--

Let’s play, shall we?

We have a hashed value and we want to decrypt it. We will use a weak method of generation of hash and, on top of that our secret value will be “unsafe”.

It is our magic hash:

61629b4533168c8e5a418347887b5b97

Hypothesis 1: MD5 is an insecure way to store a password because it is too fast.

It is true (about the speed), but it means nothing. I will explain it later. One of the alternatives suggested is to use bcrypt because it is slower.

..Bcrypt on the other hand, with a cost factor of five, only allows 71,000 guesses per second. This is 253,500,000% slower than MD5, 88,700,000% slower than SHA-1, and even 412% slower than SHA-512

Functionally, we don’t want a slow system, so this solution sounds the opposite of what usually wants to.

Hypothesis 2: A password must be at least eight characters long

Why? Because it is hard to decrypt it using force but. Again, it is about slowness and the numbers of combinations. Mathematically it makes sense.

However, it is a burden for the end-user that it must remember this password (if not enter a lengthy and complex password).

Challenge

It is our challenge, I will encrypt a text using a so-called unsafe method, and you will try to break it. You won’t need a data center or a Cray supercomputer to try to break it. The text consists of only 4 letters/numbers.

It is my method: (it could work with other methods too) it is anything but special or fancy. Where “salt” is a secret code and “value” is our target (the value that we must be obtained).

md5(md5(salt + value) + salt)

DATA SCIENTISTS SAY:

MD5 is a no go because it is too fast.

And our system works in this fashion: It allows entering an incorrect value 50 times (usually this number is 9 or 10, but we are generous). If we fail 50 times in a row, then the user is locked, and it is forced to enter a new password. It is nothing new. It could also be locked by time, for example, 10 tries per minute.

Examples:

To simplify our exercise, my value consists of 4 letters composed by A-Z,a-z and 0–9. I.e. it is composed of 26+26+10 characters = 62 different kinds of characters.

DATA SCIENTISTS SAY:

It is not correct; our password must consist of texts, numbers, and a random symbol.

Mathematically speaking, it means

62 x 62 x 62 x 62 combinations = 14.776.336‬ combinations

Some benchmarks say it could generate 220.047 md5 hashes per second (a regular CPU, not GPU or any spectacular). Other benchmarks promise to generate 1 million md5 hashes per second.

DATA SCIENTISTS SAY:

It could be decrypted in 134 seconds, 67 seconds for each md5 but it uses 2 md5, so the time is double.

But we could even help a bit further. Our secret value is based on a dictionary. So the combinations are even fewer.

DATA SCIENTISTS SAY:

Our secret value must not be based on a dictionary. Reason: Rainbow table attack.

Now, can you decrypt it?

No, you won’t.

Why?. While it is possible (not really, I will explain it later) to generate 14 million combinations (in a couple of minutes) and it could generate our value in one of those numbers. The “name of the game” is to guess which (amongst 14 millions of combinations) is the right value and the system is not telling you if you are right or not unless you enter the right value (and you have 50 chances out of 14 million).

For example, let’s say we generated 14 million of hash (how?)

5b7167bf6322fa196de747722a341334
aeac4cfe07ee8e2cf3ffc2e455d1773b
6328b84337fb432bffd87667164d8339
0e30e6db58d8215f63672f5bd715ed0b
eac76778c3c3c7caebdaf6bf4045ae88
39cff3dbad8223f16dad866d5fc8c9ef
665135ec9d543e4539d1df1929fea732
dab009ff8e4eac42855fa53dffbd27d8
87aa806c5c2840383a49fde0d46ed490

The generator of hashes will not show this message:

7a112d629698f3f4455b4463a4a395bf (YOU FOUND IT, YAY!)

Even with 3 letters, it is really hard (unless you are really lucky).

for 3 characters, the chances 62 x 62 x 62 combinations = 238.328‬ combinations

Also, the complexity of the password is also meaningless.

Let’s say our value contains the text “correct”.

md5(“correct”) // e5d7cffe25654f7e3a1e334118c71549

It could be decrypted with a rainbow table.

but

md5(md5(“correct”))
md5(‘e5d7cffe25654f7e3a1e334118c71549’)

double generating a hash harder to find on a rainbow table (but not impossible)

However, we are adding a SALT value, and it converts a simple and short password into a long (and complex) password.

Now, let’s say our salt is “horse battery stable”, so our encryption is

md5(“correct”+”horse battery stable”)

So the chances to find this value in a rainbow table are practically zero, and we are not burdening the user with a long (and mostly useless) password.

Now, what we know:

XXXX????????? We know the X values; they are some alphanumerics values. But we don’t know the rest. We could generate 14 million values and discard all the values that don’t match with this criteria (if the first 4 characters are not alphanumerics), so we could reduce our probabilities considerably, let’s say by 1/5 (3 million combinations).

However, we are double generating the hash.

The md5 hash of 123 is 202cb962ac59075b964b07152d234b70

The md5 hash of 124 is c8ffe9a587b126f152ed3d89a146b445

md5 doesn’t divulge much information even for consecutive values.

So the only evidence (to guess our value) is the length of characters (32 characters), so we could reduce the number of probabilities by knowing that the first md5 generates 32 characters.

Without salt it does:

md5(md5(‘123’)) // is equals to
md5(‘202cb962ac59075b964b07152d234b70’)

However, we are adding a SALT value, so our weak evidence of 32 characters is now null and void.

md5(md5(SALT+‘123’)+SALT)

Let’s say our SALT consists of 10 random characters.

It means 256 x 256 x 256 x 256 x 256 x 256…=256 ^ 10 i.e. a big number.

Our value could be found by comparing the salt (simple md5) with the double salt. Example

md5(SALT+VALUE) // XXXXXXXXXX????
md5(md5(SALT+VALUE)+SALT) // ??????????????XXXXXXXXXX

Where XXXX is equaled in both cases (and ???? we don’t know the value but we know). And, how many values match this condition?

256 ^ 10

So the chances to find our value are worse because our text is longest than before.

Let’s make it simple

Let’s say our secret number is a simple number (from 0 to 9).

It is our hash

md5(md5(SALT+VALUE)+SALT)

56ca388d7f838a970cc2025ab3981d3e

We have a single chance: what is our number?. (excluding lucky).

If you are quick, you already find the trick.

What is the trick?

  • First, the challenge wasn’t to guess a single number but two, our VALUE and the SALT. You can’t generate hashes if you don’t have the SALT and you can try values if you don’t have the SALT. You can try every SALT, but it is futile (even for md5) because if the SALT is safe (not obvious) then it’s not telling you any information or if you find the right SALT. Is it cheating? Yup but it is the real world.
  • If we could generate all the random hashes, our challenge limits the number of “tries” to find our value to 50. Almost any ATM and Operation System does that, so the more values we generate, then the less chance to find our value.
  • But even if we don’t limit our tries, but the real world also acts as a natural limit. If the attacker doesn’t have access to our server (and our SALT), then he could try our value online. It means that he is limited to the bandwidth and slowness of the system, also the DDOS barrier. We could also limit to 10 tries (per account) per minute. It means 970 years.
  • Bcrypt could generate 71k hashes per second; however, a moderately fast REST API (I discard websites such as a login page because they are slower). could serve 1000 requests per second (or 10k if you are generous). It means that if we test online the combinations, the speed of encryption means NOTHING because it is considerably slower than the accessing speed to the service.
  • But what if the list of hashes is leaked?. It means nothing too. The hashes are based in other hashes that were generated with a SALT value (that we don’t know). In our simplified exercise, it is our “leaked” hash. It is our hash leaked:56ca388d7f838a970cc2025ab3981d3e. What is our value?. 🤷

--

--