[Java] Defeating CAPTCHA Images

Disclaimer: Depending upon the country you currently reside in, programmatically defeating CAPTCHA images may technically be illegal. Whether or not there is any merit behind such a law I leave as a matter for you to work out with your representatives or equivalent lawmaking body. But suffice to say, the information in this post is intended for educational and informative purposes only, and should not be used in any other context. It should also be noted that the CAPTCHA images that are used in this example are quite old, and were cracked by others long ago.

I’ve always been mildly amused at the continually growing use of CAPTCHA images, or more accurately, at their ever-increasing complexity. It seems that the only truly effective CAPTCHA’s are the ones that even human beings can barely decipher. But more interesting to me is the fact that these distorted snippets of letters and numbers have become a sort of de-facto Turing test. If you can determine what the characters are, then you are human; otherwise you are not. For whatever reason, these images have become a symbolic line in the sand separating man from machine, and by exploring ways to cross this line we may move ever so slightly closer towards the creation of true artificial-intelligence.

So let’s examine a very basic CAPTCHA image, one that was used in a popular online-forum distribution before it was cracked long ago:

PHPBB2 CAPTCHA

This CAPTCHA works on the principle of contrast. Human beings can discern distinct regions in an otherwise noisy image so long as each distinct region meets some minimum contrast level above/below that of the background noise. This kind of image can be difficult to decipher computationally, because pulling out coherent regions from amongst the background noise requires contextual understanding of large portions of the image at once, which is generally a difficult thing to accomplish programmatically. That isn’t to say it can’t be done, however.

A human being looking at this image is able to recognize that there is some threshold created by the background noise that has been introduced, above which an element is part of the encoded data, and below which an element is simply part of the background noise and should be discarded. Once that is done discerning the text becomes a simple matter of discarding everything below the noise threshold, and keeping everything above. So let’s see if we can code it. First, we need a way to determine the noise threshold:

        //init, determine the average color intensity of the image
        int average = 0;
        for (int row = 0; row < image.getHeight(); row++) {
                for (int column = 0; column < image.getWidth(); column++) {
                        int color = image.getRGB(column, row) & 0x000000FF;  //only need the last 8 bits
                        average += color;
                }
        }
        average /= image.getWidth() * image.getHeight();

This bit of code determines the average color intensity of the entire image (216 / 255 in this case). Because this CAPTCHA is in grayscale it only needs to look at a single component of the pixel color, but colorized CAPTCHA images could be processed in a similar fashion by computing the intensity using the full RGB value. In any case, now we have a basic threshold that we can use for determining which parts of the image contain valuable data, and which parts contain only noise. We can do that like so:

        //first pass, mark all pixels as WHITE or BLACK
        for (int row = 0; row < image.getHeight(); row++) {
                for (int column = 0; column < image.getWidth(); column++) {
                        int color = image.getRGB(column, row) & 0x000000FF;  //only need the last 8 bits
                        if (color <= average * .70 ) {
                                image.setRGB(column, row, BLACK);
                                darkRegion = true;
                        }
                        else if (color < .85 * average && darkRegion && row < image.getHeight() - 1 
                                && (image.getRGB(column, row + 1) & 0x000000FF) < .85 * average) {
                                image.setRGB(column, row, BLACK);
                        }
                        else if (color < .85 * average && ! darkRegion && row < image.getHeight() - 1 && column > 0 
                                && column < image.getWidth() - 1 
                                &&  (((image.getRGB(column, row + 1) & 0x000000FF) < color) 
                                        || ((image.getRGB(column + 1, row) & 0x000000FF) < color) 
                                        || ((image.getRGB(column - 1, row) & 0x000000FF) < color))) {
                                image.setRGB(column, row, BLACK);
                                darkRegion = true;
                        }
                        else {
                                image.setRGB(column, row, WHITE);
                                darkRegion = false;
                        }
                }
        }

Note that this code assumes that darker pixels are part of the data and lighter pixels are part of the background noise, because that is how the input CAPTCHA is set up. A smarter approach would be to look at the number of pixels falling above the noise threshold and the number of pixels falling below, and then keep whichever group is smaller. For a CAPTCHA like this one to be effective, there must be more noise than data, so it follows that the data that you’re looking for will always be in the smaller group of pixels.

In any case, what the above code does is traverse the image, and turn any pixels that appear to be noise white, and any pixels that appear to be data black. Note that it includes some rudimentary region-detection code, owing to the fact that we expect our data pixels to be tightly clustered together in distinct regions. So when the code encounters a pixel that it considers to be part of the data, it also lowers the selection criteria for the next pixel because there is a strong possibility that the next pixel will also be data. This helps prevent false-negatives from erroneously dropping out valuable pieces of data. Let’s take a peek at what our CAPTCHA image looks like at this point:

PHPBB2 CAPTCHA, after first pass

It’s not perfect, but it is definitely improved. We have successfully removed all of the background noise from the image, but unfortunately we have also removed some pieces of the actual data. The data that is left is all in the right place, however, so perhaps we can amplify and/or reconstruct it:

                //second pass, eliminate horizontal gaps
                for (int row = 0; row < image.getHeight(); row++) {
                        for (int column = 0; column < image.getWidth(); column++) {
                                int color = image.getRGB(column, row) & 0x000000FF;  //only need the last 8 bits
                                if (color == 255) {
                                        consecutiveWhite++;
                                }
                                else {
                                        if (consecutiveWhite < 3 && column > consecutiveWhite) {  
                                                for (int col = column - consecutiveWhite; col < column; col++) {
                                                        image.setRGB(col, row, BLACK);
                                                }
                                        }
                                        consecutiveWhite = 0;
                                }
                        }
                }
                consecutiveWhite = 0;
                
                //third pass, eliminate vertical gaps
                for (int column = 0; column < image.getWidth(); column++) {
                        for (int row = 0; row < image.getHeight(); row++) {
                                int color = image.getRGB(column, row) & 0x000000FF;  //only need the last 8 bits
                                if (color == 255) {
                                        consecutiveWhite++;
                                }
                                else {
                                        if (consecutiveWhite < 2 && row > consecutiveWhite) {
                                                for (int r = row - consecutiveWhite; r < row; r++) {
                                                        image.setRGB(column, r, BLACK);
                                                }
                                        }
                                        consecutiveWhite = 0;
                                }
                        }
                }

This code fills in any small vertical and horizontal runs of white pixels with black pixels, the rationale being that any small group of white pixels that is surrounded on either end by black pixels is virtually guaranteed to be part of the data that was erroneously discarded. Again we can take a peek at our result:

PHPBB2 CAPTCHA, after third pass

Getting better, but we’re not quite there yet. Our characters are much more distinct, but there is still some missing data. A fair bit of the missing data is now contained in small regions of white pixels that are actually encapsulated within our characters. Filling them in is a relatively simple matter:

                //fourth pass, attempt to fill regions
                for (int row = 0; row < image.getHeight(); row++) {
                        for (int column = 0; column < image.getWidth(); column++) {
                                if (image.getRGB(column, row) == WHITE) {
                                        int height = countVerticalWhite(image, column, row);
                                        int width = countHorizontalWhite(image, column, row);
                                        int area = width * height;
                                        if ((area <= 12) || (width == 1) || (height == 1)){
                                                image.setRGB(column, row, BLACK);
                                        }
                                }
                        }
                }
                
                //fifth pass repeats the fourth
                for (int row = 0; row < image.getHeight(); row++) {
                        for (int column = 0; column < image.getWidth(); column++) {
                                if (image.getRGB(column, row) == WHITE) {
                                        int height = countVerticalWhite(image, column, row);
                                        int width = countHorizontalWhite(image, column, row);
                                        int area = width * height;
                                        if ((area <= 12) || (width == 1) || (height == 1)){
                                                image.setRGB(column, row, BLACK);
                                        }
                                }
                        }
                }

Here we check, for each white pixel, how many adjacent white pixels exist both vertically and horizontally. This gives us a rough estimate of the size of the current region of white pixels. If the size is too small, then the code assumes that the white pixel is actually supposed to be part of the data, and turns it black. Note that the algorithm is methodical in its approach, in that when it detects a small region of white pixels, it toggles only the initial pixel that it tested in that region. This toggling will reduce the region-size reported for any adjacent white pixels, increasing the likelihood that they will be toggled as well on the next iteration, which is why two passes of the same algorithm are applied. And yes, I know having the same code repeated twice is poor coding style, but for illustrative purposes it gets the job done. Anyways, we now have:

PHPBB2 CAPTCHA, after fifth pass

Many of the gaps are now filled in, and the text is starting to look fairly legible. There are now, however, a few spurious black pixels that have cropped up along the edges of the characters. We could go back and refine the previous step, but instead let’s just prune out these outliers:

                //sixth pass, clear any false-positive
                for (int row = 0; row < image.getHeight(); row++) {
                        for (int column = 0; column < image.getWidth(); column++) {
                                if (image.getRGB(column, row) != WHITE) {
                                        if (countBlackNeighbors(image, column, row) < 3) {
                                                image.setRGB(column, row, WHITE);
                                        }
                                }
                        }
                }

This pruning step removes any black pixels that are bordered by 3 or fewer black pixels. This is a fairly strict threshold, and will have the effect of smoothing/rounding out corners (i.e. some legitimate data will be discarded), but it will also clear out any spurious black pixels that exist in the image. Now our image looks like so:

PHPBB2 CAPTCHA, after sixth pass

The letters have taken on a softer, more rounded quality. They also happen to look vaguely reminiscent of what you might get if you were to scan a text document using an older scanner. Which is worth mentioning because we will eventually be feeding our cleaned-up CAPTCHA image to an optical-character-recognition program that is designed to process just this sort of data. First, however, our characters are all misaligned. We’ve come this far, so we might as well fix the alignment issue while we’re at it:

                //now find the characters
                List<CharacterBox> characters = new ArrayList<CharacterBox>();
                int totalCharWidth = 10;
                int maxCharHeight = 0;
                for (int column = 0; column < image.getWidth(); column++) {
                        int highestBlack = countVerticalWhite(image, column, 0);
                        if (highestBlack < image.getHeight()) {
                                totalCharWidth += 5; //5 px spacing in between chars
                                CharacterBox box = new CharacterBox();
                                box.setX(column);
                                while (column < image.getWidth() && countVerticalWhite(image, column, 0) < image.getHeight()) {
                                        int currentBlack = countVerticalWhite(image, column, 0);
                                        if (currentBlack < highestBlack) {
                                                highestBlack = currentBlack;
                                        }
                                        column++;
                                }
                                box.setWidth(column - box.getX());
                                box.setY(highestBlack - 5);
                                box.setHeight(image.getHeight() - highestBlack + 5); //can trim this later
                                if (box.getHeight() > maxCharHeight) {
                                        maxCharHeight = box.getHeight();
                                }
                                totalCharWidth += box.getWidth();
                                characters.add(box);
                        }
                }

Here we simply compute a bounding box for each distinct region of black pixels (i.e. each character), plus some additional padding so that our output image will draw nicely. Speaking of output image, we can now create it by positioning our characters in correct alignment with each other in a new image, like so:

                //output a new image with aligned characters
                BufferedImage dst = new BufferedImage (totalCharWidth, maxCharHeight,
                                                           BufferedImage.TYPE_INT_BGR);
                for (int column = 0; column < dst.getWidth(); column++) {
                        for (int row = 0; row < dst.getHeight(); row++) {
                                dst.setRGB(column, row, WHITE);
                        }
                }
                int xPos = 5;
                int yPos = 0;
                for (CharacterBox box : characters) {
                        for (int oldY = box.getY(); oldY < box.getY() + box.getHeight(); oldY++) {
                                for (int oldX = box.getX(); oldX < box.getX() + box.getWidth(); oldX++) {
                                        dst.setRGB(xPos + (oldX - box.getX()), yPos + (oldY - box.getY()), image.getRGB(oldX, oldY));
                                }
                        }
                        xPos += box.getWidth() + 5;
                }
                ImageIO.write(dst, "png", new File(OUTPUT));

Now we have the following:

PHPBB2 CAPTCHA, fully processed

The characters are nicely aligned and uniformly spaced. We now have something that is suitable for sending into a character-recognition program. For this example we use tesseract, a free and open-source OCR program that provides a good level of accuracy. We can send our output to tesseract like so:

                Process tesseractProc = Runtime.getRuntime().exec(TESSERACT_BIN + " " + OUTPUT + " " + TESSERACT_OUTPUT);
                tesseractProc.waitFor();

This invokes tesseract on our output image, and it writes its results to a text file located at ‘TESSERACT_OUTPUT‘. In this case, the text file contains the following:

IKEECL

…which is 100% correct.

Using a handful of very simple image filtering loops based around a brief examination of how a human being would approach the image, and some existing OCR software, the CAPTCHA has been defeated. Of course, this only works for this one specific style of CAPTCHA, but the basic approach of reducing noise, amplifying data, and isolating characters should be broadly applicable to a wide range of different CAPTCHA styles. The challenge lies not in breaking the CAPTCHA, but in devising an algorithm that can attempt to break any number of different CAPTCHA styles dynamically and with a success rate comparable to that of a human being. It needs a way to determine, from the CAPTCHA image itself, what kind of noise exists and how it should best be removed. That is the real challenge, and it’s beyond the scope of this article.

Note that for the sake of preserving some sense of brevity I’ve left out the implementation of some minor utility functions and variable declarations and the like. In general, you can assume that a function (or variable) does what its name implies. If, however, you would like a complete copy of the source-code used, you can download it using this link (zipped Eclipse project).

Note that in order to get it to run you will also need to install tesseract on your system, and edit the values at the start of the Java code to point at your local tesseract installation.

This entry was posted in coding, java and tagged , , , . Bookmark the permalink.

30 Responses to [Java] Defeating CAPTCHA Images

  1. kele says:

    please send me full code, I really need it . Thank you very much!

  2. crintevn says:

    Dear,

    I’m need this code full. Please send for me. Thanks you
    email: crintevn@gmail.com

  3. caikeo says:

    it is very good for me,can you sent your full code for me,please?

  4. james_grrr says:

    Brilliant man, absolutely brilliant !!!, I would love to get more detail around this (for educational purposes only of course), (e.g. written descriptions, examples, source code, hints, gotchas, anything really ;) Thank you for what you’ve provided here so far, it’s a God send !!!

    • james_grrr says:

      Oops, didn’t see the link to the eclipse project source, gracias for that. I would absolutely love to see some algorithms that could detect if the text was curvy, straight, thin, fat, two or more words with different styles spaced apart from each other …. ;)

  5. mjoode says:

    I got this error when I try to run it

    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: Coordinate out of bounds!
    at sun.awt.image.ByteInterleavedRaster.getDataElements(Unknown Source)
    at java.awt.image.BufferedImage.getRGB(Unknown Source)
    at au.com.suncoastpc.captcha.CaptchaProcessor.main(CaptchaProcessor.java:181)

    can any one help me ?

  6. randi says:

    how to defeating captcha images in here…

    http://cp27.web.id/1.bmp

    i try your code

    I got this error when I try to run it

    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: Coordinate out of bounds!
    at sun.awt.image.ByteInterleavedRaster.getDataElements(Unknown Source)
    at java.awt.image.BufferedImage.getRGB(Unknown Source)
    at au.com.suncoastpc.captcha.CaptchaProcessor.main(CaptchaProcessor.java:181)

  7. Azat says:

    Could you please send me full code to azatmar@rambler.ru

  8. Carlos Sarmiento says:

    From Colombia, Could you please send me full code to carlossarmientor@gmai.com

  9. Navyasree says:

    please send the full code of image captcha n thanks in advance..

  10. Arthur says:

    From Brasil. Could you send me full code to arthur.parahyba@gmail.com?

  11. Chandana says:

    thanks for your code but i do got the same exception when tried to execute on other images can you please tell me where to modify the code so that it works for all the images. thank u

  12. Leon says:

    From Taiwan. Could you send me full code to me3540@yahoo.com.tw?

  13. Alan says:

    Can I also have you full code?
    alam@weedstudio.hk

  14. Betoa says:

    How would a picture of a Down ? Base 64

    

  15. Andrey says:

    Please send me full code

  16. Pravin Nawale says:

    This is very helpful.
    Kindly send me the full code to the mentioned email id

  17. Lux Go says:

    Great!!!
    Can I also have your full code?
    luxgo.vn@gmail.com

  18. laxmi says:

    please send me full code it looks good.

  19. sanjay negi says:

    can you send to me full code and please tell what is character box

  20. Mope says:

    Please send me the full code to mmope.lephoto@gmail.com

  21. arun kumar says:

    this algos working fine. can you send me the complete code on my mailId : 04arun1986@gmail.com. Actually I am enable to perform last three steps accuratly.

  22. Gautam says:

    Hi I am very excited to see full code. Kindly share to gautam.kakadiya37@gmail.com

  23. Jonh says:

    This is very helpful.
    Kindly send me the full code to the mentioned email id : rootingspy@gmail.com

  24. Vinicius says:

    Hi,
    Could you send me the full code pls ?
    Email: vinicius647@gmail.com

  25. CROELAN GRANDEZ says:

    Hi friend,
    Could you send me the full code java please…
    My email is croelanjr@gmail.com
    regards

  26. Sreeni says:

    hi,

    There is an urgent requirement for this captcha decode. Can you please share me the code to my email address spaluvuri@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>