OCR (Optical Character Recognition) with Tesseract
The purpose of this page is to provide a simple tutorial for the OCR application Tesseract. Tesseract is an Open-Source project, available at Google Code: Tesseract-OCR
Tesseract is useful to create an editable digital version of an article or even a book. In this tutorial, I will use a page from the Dutch book "Leven in mijn tuin", which was written by my grandmother.
The picture, made with a Canon EOS 1000D on a tripod and a remote control (original 3888 x 2592 pixels)
The same picture, now cropped.
Result after using Tesseract (tesseract Crop.tif Output)
Result after using Tesseract, now using the Dutch language files (tesseract Crop.tif Output -l nld)
A bit more accurate, but still completely unreadable. The reason is the presence of colours in the picture, i.e. the picture is not in grayscale yet.
Same picture, now in grayscale
A much better result! (tesseract Crop.tif Output -l nld)
However, there are still quite some errors. This is because of the multi-column layout, Tesseract doesn't work that well with this kind of layout. After editing the picture, the results improve once more. The last step is to adjust the brightness and contrast of the picture.
The results from the first version of the one-column layout are fine, except for the bottom-right corner. When the contrast is increased, it becomes clear that the bottom-right corner is a bit darker. A quick edit results in an almost flawless result!
I hope you learned something from this little tutorial. A quick recap:
Provide a picture in a decent format, i.e. as many dpi/ppi as possible.
Edit the picture (crop, grayscale, adjust brightness and contrast).
Try different parameters in the Tesseract application (language, batch, nobatch, batch.nochop, etc.)
Questions or suggestions? Don't hesitate to send me an e-mail:
, or for more information, please visit the website of Tesseract at Google Code.