PDA

View Full Version : md5sum theory



tejas
11th June 2005, 08:30 PM
Hello, I had some doubts about how md5sum works, which was hard to find answers in google for because everywhere I look is only a link to download an md5sum checker.

I got to thinking. I'f I'm not wrong, I think it must be possible for 2 files to have the same md5sum. Each md5sum has 32 letter/numbers, resulting in a total no of combinations of (26 + 26 + 10) ^ 32 = 2.27 * 10^57 combinations (2272657884496751345355241563627544170162852933518 655225856) to be exact. (yes, I used bc, and cut and pasted). Can the md5sum exceed 32 digits?

As there are probably more files then this, there must be at least two files with the same md5sum.

If not: (md5sum is unique)
and, every file has a unique one, why isn't it possible to reverse the md5sum algorithm, and generate a complete file from the md5sum?

If true: (two files have same md5sum)
what are the conditions for two files to have the same md5sum? Does this mean it is not really as accurate as I believed? Would it be possible, in theory, to generate a file from the md5sum and another parameter, say, the size, and the first n bytes of the file?

Please point me in the wrong direction.

PS: No, I'm not drunk

-Tejas Dinkar

jtang613
11th June 2005, 08:38 PM
Here's the RFC that describes The MD5 Message Digest Algorithm
http://www.faqs.org/rfcs/rfc1321.html

Enjoy,
Jason

tejas
11th June 2005, 08:42 PM
thanks jtang613

I was hoping (but not really expecting) for a shorter answer.

Anyway, I'll have to read this when it is not 1:30am (ie. NOW!)

Looks complicated

Shakes
11th June 2005, 08:47 PM
From the link given



4. Summary

The MD5 message-digest algorithm is simple to implement, and provides
a "fingerprint" or message digest of a message of arbitrary length.
It is conjectured that the difficulty of coming up with two messages
having the same message digest is on the order of 2^64 operations,
and that the difficulty of coming up with any message having a given
message digest is on the order of 2^128 operations. The MD5 algorithm
has been carefully scrutinized for weaknesses. It is, however, a
relatively new algorithm and further security analysis is of course
justified, as is the case with any new proposal of this sort.

keithmo
11th June 2005, 09:11 PM
There has been a lot work recently on MD5 collisions. See http://www.cits.rub.de/MD5Collisions/ for an easy read on the subject.

tejas
12th June 2005, 04:46 AM
Very easy to understand keithmo.

They actually let you get a copy of two files with the same md5sum

Well, I guess the only way to be 100% accurate algorithm is to download the file again. :)

Just out of curiosity, has there ever been a documented case where the incorrect file downloaded has the same md5sum as the original. I know the chances are a:2.27*10^57, but still?

keithmo
12th June 2005, 11:15 AM
The number of possible MD5 hashes is actually 2^128, which is about 3.4*10^38.

(Each MD5 sum has 32 characters, but they're really hexadecimal digits. The string is a hex representation of the 128-bit hash. Since the hash is 128 bits, the number of hash values is 2^128.)

I've never heard of a case of an "accidental" MD5 collision. It's not likely, but it's not impossible either.

markkuk
12th June 2005, 12:15 PM
I got to thinking. I'f I'm not wrong, I think it must be possible for 2 files to have the same md5sum. Each md5sum has 32 letter/numbers, resulting in a total no of combinations of (26 + 26 + 10) ^ 32 = 2.27 * 10^57 combinations (2272657884496751345355241563627544170162852933518 655225856) to be exact. (yes, I used bc, and cut and pasted). Can the md5sum exceed 32 digits?

As there are probably more files then this, there must be at least two files with the same md5sum.

Looks like you don't understand how large that number (2.27E57) really is. Earth has around 10^49 - 10^50 atoms according to Fermilab (http://www.fnal.gov/pub/inquiring/questions/atoms.html) and other sources (http://pages.prodigy.net/jhonig/bignum/qaearth.html). Current technology doesn't allow storing 10 million files on a single atom :)

Even with the correct number of possible MD5 values (3.4E38) is much larger than any reasonable number of files in existence, it would mean about 5*10^28 files for each person on Earth.

keithmo
12th June 2005, 12:49 PM
Accidental collisions are thankfully quite rare.

The more interesting question is: how difficult is it to create a "malicious collision"? This is the subject of the document I linked to earlier. The authors provide two postscript files. They are the same length. They have identical MD5 hashes. However, they print differently (one is a letter of recommendation for Alice, the other grants her a higher security clearance).

The authors claim it took "just a few hours on a customary PC" to find the collision.

MD5 is starting to crack under combined pressure of Moore's Law and modern cryptoanalysis techniques. Other available algorithms (SHA-256, for example) appear to be more resistant to these attacks. Only time will tell.

bitrain
12th June 2005, 12:57 PM
When md5 isn't good enough any more, they just make the 32 bits something like 1024 (or more) and then it will take even on the most modern computers a long time.

tejas
12th June 2005, 01:29 PM
but they're really hexadecimal digits. The string is a hex representation of the 128-bit hash.I just learnt something new. I did an md5sum. As I saw a couple of letters, I assumed that the md5sum could contain A-Z, a-z and 0-9, hense 26 + 26 + 10. I see the error now. I never realised that none of the letters exceeded 'f', corresponding to 15 in hex. Thanks for the clarification.