Friday, October 22, 2010

Simple Solution: Extract text from Protected PDF document

Just two stages : First convert PDF page in to TIFF image format and then run OCR software (Optical character recognition) over it to get a Text. Here is step by step guide by using freeware tools.

One of the common hurdles which many users face is “How to copy text from protected PDF file”?

When PDF file gets protected, the feature which permits to copy selected text and later paste it gets disabled.

One can get the text easily from protect PDF file . To achieve it, following two freeware required

  1. IrfanView from http://www.irfanview.com : Free for non-commercial use, 32-Bit graphic viewer
  2. SimpleOCR from http://www.simpleocr.com : Freeware OCR software

How to Do ?

Step 1: Open PDF document, open desired page and perform Screen Capturing. You can use print Screen key (Prt Scr) of keyboard. This key is situated in the same section where Scroll Lock and Pause/Break keys lies.

Step2: Open IrfanView , and press Control-V to paste the screen shot of PDF page. Alternatively, one may also click on "Edit" menu and choose "Paste" form it

Step 3: This step is optional but better if get performed. Select that area of image which contains the desired Text by dragging a box with the help of mouse.

Step 4. Again optional step: If box is drawn as mentioned in step 3, then click on "Edit" menu and choose “Crop selection”. It will remove unwanted stuff.

Note: Even If you have not perform steps 3 and 4, you can directly go to step 5

Step5: It is an important step: Press Ctrl + G here or click on "Image" menu and select “convert to Grayscale “. This operation will remove the color information from image and it will make it grayscale (similar to black and white)

Step 6: Now it’s time to save file. Click on "File" menu and select “Save as”. Here select “Save as” type as “TIFF”. Save it to desired path.

So up to this stage we have saved the image of entire page of Protected PDF file or Image from protected PDF file (from Step 3 and Step 4) into Grayscale TIFF file format, a must required input to “ SimpleOCR “ software.

How to grab text ?

Step 1: Download and Install SimpleOCR from http:// www.simpleocr.com

Step 2 : Run the SimpleOCR Program. During the start up of SimpleOCR, select "Machine Print option".

As program gets open, click on "Add page", it will ask you for select Source. Choose File option, navigate through the desired path and select the TIFF file of Protected PDF .

Step 3 : Now Software will show you a Preview Screen . Just skip it and click on Continue button. Now one can see TIFF image in software. A new button Convert to text will appear.

Step 4 : Here lies the actual magic. Click on Convert to text button”. This process will run OCR operation on TIFF Image, try to recognize the text and display the text.

(There are some options available here with captured text like one can correct the spelling mistakes from text or skip it.)

Step 5: As text is captured, it can be saved too. Click on File menu, select “Save as” option. This option let you save in either as "MS Doc" document or plain text format. Better select plain text format instead of MS doc and save it to desired location.

Write your Comment............