Converting Graphics to Text OCR

Scanners are fundamentally graphics devices—their product is a bitmap graphics stream, which is easily displayed in an X window or saved in a graphics file. Sometimes, though, the purpose of scanning a document is to convert it to text in order to edit it in a word processor, load data into a spreadsheet, or otherwise manipulate it in a nongraphical way. To accomplish this goal, optical character recognition (OCR) programs exist. These programs accept a graphics file as input and generate a text file that corresponds to the characters in the input file. Essentially, the OCR package "reads" the characters out of the input file. This is an extremely challenging task for a computer program, though; the software must overcome many obstacles, including streaks and blotches in the input file; the varying sizes and appearance of characters in different fonts; and the presence of nontextual information, such as embedded graphics. Therefore, OCR software tends to be imperfect, but it's often good enough to be worth using. Typically, you'll scan in a document and then proofread it against the original, making whatever corrections are appropriate. Here are the main Linux OCR packages:

Clara This program, based at http://www.claraocr.org, is intended for large-scale OCR projects, such as converting out-of-print books to digital format. The program includes an X-based GUI, but it doesn't interface directly to scanners. Thus, you must scan your documents into files and then use Clara on them.

GOCR This program is headquartered at http://iocr.sourceforge.net, and it is an OCR program that works from the command line. As such, it can be called by other programs, such as XSane or Kooka, to provide them with OCR capabilities.

OCR Shop This is a line of commercial OCR packages for Linux. It's a much more mature product than the open-source Clara or GOCR packages, but OCR Shop is also a very pricey product, with the entry-level package going for close to $1,500. OCR Shop doesn't use SANE as a back-end, so you must be sure that your scanner is supported before you buy the program. Check http://www.vividata.com for more information.

As an example of OCR in action, consider using GOCR from XSane. Follow these steps:

1. If necessary, install the GOCR package from your distribution or from the GOCR web page.

2. Launch XSane. Leave the XSane Mode set to Viewer; you'll acquire an image into the viewer and then have the viewer run GOCR.

3. Be sure that XSane is set to acquire a grayscale or a line-art image.

4. Acquire a preview by clicking the Acquire Preview button in the preview window.

5. Select the portion of the document you want to scan in the preview window.

6. Set the scanning resolution to between 150dpi and 300dpi; this range tends to produce the best OCR results.

7. Click Scan to scan the document. XSane should open a window in which the document is displayed. Chances are this window will be very large.

8. In the scanned document window, select File O OCR - Save as Text. The program displays a file selection dialog box in which you enter a filename.

9. Type in a filename, and click OK in the file selection dialog box. XSane doesn't show any indication that GOCR is working, but it is. Within a few seconds, the file you specified should be created and contain the text equivalent of the scanned file.

Unfortunately, GOCR's output isn't always as good as you might hope. As I write, GOCR is at version 0.37—in other words, it's a very early work. Its accuracy is likely to improve as its version number climbs, so check back with the GOCR website frequently if accurate OCR is important to you. You may also be able to improve GOCR's accuracy by adjusting various scanning parameters, such as the resolution, contrast, and brightness.

0 0

Post a comment