Tesseract is the best program for converting image to text, on Ubuntu/Linux. I’ve tried several OCR (Optical Character Recognition) applications but its accuracy is certainly higher than any other applications.
Tesseract is a simple and easy to use command line utility. It’s cross-platform application, and of course – it’s a free and open source software! You can supply various input formats and it can convert into 60+ languages.
Installing Tesseract in Ubuntu / Linux
sudo apt-get install tesseract-ocr
Further, you can install any language packages if required.
Now, before you start using Tesseract, you need to convert the files (png/jpg) to tif format (input format supported by tesseract). Use the following command (you may need to install imagemagick package) –
convert file_name.png out_file_name.tif
Now, you can try reading the content using Tesseract.
tesseract your_scanned_file.tif output_content
The results will be saved to output_content.txt file. If you want to OCR for other languages then pass it as the additional parameter, specified by -l
. (and of course, you would have to first install that language pack)
e.g For scanning images that contains Hindi, Sanskrit text, you can use this command :
tesseract your_scanned_page.tif output_content -l hin
Visit official page for more details about the project.