OCR scanned text

Optical Character Recognition is the process of turning printed text into electronic text. Utilizing it under Ubuntu is a breeze, as follows:

1. Start by using Synaptic to search for and install gocr. This is optical character recognition software that integrates into xSane, Ubuntu's scanner program. Once installed, it doesn't create an Applications menu entry.

2. Instead gocr is accessed through xSane, so start the program (Applications ^ Graphics ^ XSane Image Scanner). Before scanning, you must choose settings conducive to good OCR, so, on the main XSane control panel, set the image type dropdown list to Gray and the resolution dropdown to 300. These two dropdowns aren't labelled but can be found roughly in the middle of the XSane configuration window, as shown in Figure 3.55, on the next page.

3. In the XSane Preview window, click the Acquire Preview button. This will run a preview scan. In the resulting image, drag the selecting bounding box in the Preview window from the edges of the image in order to tightly define the text area that you want to scan. Ensure you crop-out as much surrounding area as possible— this will help avoid errors in the OCR output.

Figure 3.55: Changing the resolution and color settings of the OCR scan (see Tip 297, on the previous page)

4. Back in the main XSane control panel window, click the Scan button.

5. Once the scan is complete and the image viewer window appears, rotate the image so it's the right way up using the relevant toolbar buttons (if necessary). Then click File ^ OCR - Save As Text. A dialog box will then pop-up asking you for the name of the file you'd like to create. After you click the Save button, the OCR process will start and might take some time to complete, depending on the complexity of the scanned page. Alas, no progress display is provided, although the image viewer window will remain grayed-out and unresponsive until the OCR process has completed.

Once the OCR process has completed, take a look at the output file. It's unlikely this will be perfect and you should definitely check it against the original source to correct errors. I noticed that apostrophes seem to cause problems with the character recognition. You might even want to try scanning again, this time perhaps altering the brightness and contrast settings in the main XSane control panel window before scanning.

Perhaps it goes without saying that less complex documents tend to OCR better—straight text on a page is likely to produce a better result than complex magazine layouts involving pictures, colored backgrounds and different fonts/sizes. If you have to scan such documents, it might be worth scanning parts of the page piece by piece by selecting each column or block of text in the image scan preview window, scanning it separately, and running an OCR pass on it.

Was this article helpful?

0 0

Post a comment