Tesseract is the best program for converting image to text, on Ubuntu/Linux. I’ve tried several OCR (Optical Character Recognition) applications but its accuracy is certainly higher than any other applications.

Tesseract is a simple and easy to use command line utility. It’s cross-platform application, and of course – it’s a free and open source software! You can supply various input formats and it can convert into 60+ languages.

Installing Tesseract in Ubuntu / Linux

sudo apt-get install tesseract-ocr

Further, you can install any language packages if required.

Now, before you start using Tesseract, you need to convert the files (png/jpg) to tif format (input format supported by tesseract). Use the following command (you may need to install imagemagick package) –

convert file_name.png out_file_name.tif

Now, you can try reading the content using Tesseract.

tesseract your_scanned_file.tif output_content

The results will be saved to output_content.txt file. If you want to OCR for other languages then pass it as the additional parameter, specified by -l. (and of course, you would have to first install that language pack)

e.g For scanning images that contains Hindi, Sanskrit text, you can use this command :

tesseract your_scanned_page.tif output_content -l hin

Visit official page for more details about the project.

Leave a comment

Your email address will not be published. Required fields are marked *