[insert blogger navbar here]

Random Ramblings

Monday, 20 July, 2009

RR (17): Breaking The Daily WTF's CAPTCHA

As you should know, TDWTF is one of those sites where awful code gets posted roughly weekly, along with the more frequent anecdotes that are pretty much fail too. So as it turns out, their own CAPTCHA code is pretty fail, or "WTF" by their community's parlance. The CAPTCHA (iirc) is only used in one place: posting comments to an article while not logged in. (I can't remember if registration requires solving the CAPTCHA though.)

So yeah. If you go to any article, just about any article at all, check the comments; you should find at least one spam post. Of course, some of these are done manually, and some seem like they were done automated. I present a simple method of solving their CAPTCHA with a probability of at least 5%. (Before you complain that 5% is too low, take note that with 50 tries the chances of failing all of them is a mere 7.7%.) First, preprocessing of the CAPTCHA image will be helpful, though not strictly necessary. A simple blur-and-threshold seems to allow one particular OCR to get the text "Trishque" instead of the correct "tristique", which seems to indicate that the rate of failure per letter is about one in four. (Without preprocessing it gives a messy "Tf‘(5‘{'(qu€ `".) Approximating the average CAPTCHA as having 6 characters, this gives the chance of perfect accuracy as 17.8%. Hey, that's not too bad! But what we get another nine-letter CAPTCHA? This still gives a probability of 7.5% for perfect accuracy.

There's still a fly in the ointment though. Consistently spamming the TDWTF server with CAPTCHA trials is a bit slow. We could do better. Much better. In particular, sending requests for the CAPTCHA image itself seems to reencode the solution into a CAPTCHA, and with different distortions. Assuming that the chances of attempts being perfectly accurate are mutually independent, with 9 images a nine-letter CAPTCHA has 50.5% chance of containing the correct one. That's still pretty useless information though. But taking excess of dozens images of the CAPTCHA allows us to solve each individually, then finding the most common answer. I'm not bothering to do a Monte Carlo simulation of this yet, but I believe it is feasible.

Guess what? We can do better yet! The letters are all cleanly separated. A vertical line might be able to cut more than one letter, but this still doesn't pose a problem. After the above blur-and-threshold of the image, seam carving can be used to determine where there are no letters, and where there are; splitting the problem into smaller problems has some advantages. First, the OCR only works on one letter at a time; there's no risk of multiple letters being confused as one. Second, the method mentioned in the above paragraph becomes even more effective with this. The fly hasn't been fully eradicated though; letters that are split into multiple parts (just "i" for now) require special handling; if not the tittle (or whatever separate parts) would become a letter by itself.

There is another minor optimisation that can be made. The background of the CAPTCHA is almost like a uniform horizontal gradient; by taking the minima and maxima brightness values at the two ends a more exact thresholding can be obtained; note however that this step is mostly useless regardless.

I also note the existence of specialised CAPTCHA solving programs such as PWNTCHA, but I'm lazy. (It's not a Debian package, for one.) From what I've read about PWNTCHA it seems to manipulate weaknesses in the CAPTCHA generation algorithm, even assuming an otherwise secure implementation. One these flaws happen to be the use of constant font. (I think it's Comic Sans.)

0 Comments:

Post a Comment

<< Home