8

Extracting content from PDFs to reformat using MS Word

– Ashish Gupta

When was the last time you regretted not having the source file of a PDF file you had to edit? Probably not long ago. We  have all lost source files to PDF files, only to realize later that the amount of editing we have to do cannot be done using even Acrobat. And then there are times when we’ve simply inherited or received PDF files from others, with no clue of their source files. While we learn how not to lose our source files (!), this article describes ways of generating them, starting with PDF files.

Acrobat, or for that matter any PDF editor, can only touch-up the text. For extensive edits you have to go back to the source files.

Acrobat includes tools to extract content that can save you a significant number of keystrokes that you would otherwise spend in re-typing. Do note that the way PDF format works, the best you can do is, extract the content with some formatting. It is difficult to retain formatting when converting from a PDF file. You have to re-format the content. Depending on the page layout and formatting, the re-formatting could be extensive.

How to extract content from a PDF file

The universally known, conventional option is to save PDF files in other formats using File > Save As in Acrobat. Use it to save PDF files in a rich text format or as a MS Word document. If you want to use the generated document in MS Word, then save it as an MS Word Document. If you plan to use the content (via say, importing) in any other application, then save as Rich Text Format. As we have all used this option often times, so I shall just give you a familiar looking list of all available options.

The File > Export option in Acrobat, is just a cousin of the above option. It provides almost all the options available in the Save As dialog (PDF and Optimized PDF formats are not present).

How to extract data from a PDF file

The best native methods of extracting data from a PDF file are, Saving as XML and Open Table in Spreadsheet options available in the Acrobat context menu.

To directly bring data into a spreadsheet program, select the data, right click on it, and select the Open Table in Spreadsheet option. The content menu is shown below.

Notice the Save As Table option in the context menu above. Choose this option, to save the selected data in CSV, XML, tab delimited, HTML, RTF, and txt formats. In one of these formats the data can literally be used in any application whatsoever.

Note: This option works best for the data which was originally formatted as a table in the source file.

How to extract images from a PDF file

Images get automatically extracted and inserted into the doc or RTF files. Also when converting to HTML the images are saved in a folder by the same name as the HTML file.

If you want to extract only the images, follow these steps:

  • Go to Advanced > Document Processing > Export All Images.
  • Choose JPEG, JPEG2000, TIFF, or PNG, as the file format.
  • (Optional) Change the settings as desired. If you are unsure, export with the default settings.
  • Select (or create) a folder in which all images should get extracted. Click Save.

Tip: You can prevent export of small and redundant images (like, thumbnails, company logos, advertisements in PDF files created from web pages, etc.) by setting the Exclude image smaller than option.

Note: Choosing an image format from the File > Save As dialog is not the right way to extract images of a PDF file. This option saves every page of the PDF file as an image and not the images contained in the pages.

How to extract content from scanned PDF files

You may need to extract content from a document scanned as a PDF file. Or you might need a soft copy document in editable format, from a hard copy document you have. Scanned PDF files are called image PDF files. The text can neither be copied nor edited. If you try selecting text the whole page gets selected and acts as if it is an image.

Optical character recognition (OCR) is the electronic re-generation of text present in scanned images into machine-encoded text. A limitation of OCR technology is that it can mis-interpret the optical characters. For example, i, l, and 1 can get mixed up. Hence for good results, it is mandatory, to proof-read the generated text.

Tip: Scan a hard copy at 72 dpi (at least) and in grayscale. For better performance 300 dpi or higher is recommended.

To extract text from an image PDF:

  • Click Document > OCR Text Recognition > Recognize Text Using OCR.
  • Select a page range to convert.
  • Click Edit to open the Recognize Text – Settings dialog box. In PDF Output Style, select Searchable Image or Searchable Image (Exact). Image resolution does not matter. Click OK to run OCR, which creates editable content.
  • Sanitize the generated content by correcting the suspect words.
  • After the process, export into doc, RTF, text, or XML format as you would export from any regular PDF file.

Instruction to batch process multiple PDF files are available here .
Tip: If you have a long document, select only one page to begin with. If the quality is desirable then you can extract text from the whole document. Otherwise you may wish to rescan the document at a higher resolution.

Tip: An image of a document, taken from a digital camera, can be easily converted to an image PDF file, by printing to the Adobe PDF printer. By applying the same OCR technique, content can then be extracted from photographs of documents!

How to convert a PDF file to a FrameMaker file

Getting the content from PDF files into FrameMaker files involves two powerful features of FrameMaker and Acrobat software. This is helpful for writers who use FrameMaker as authoring tool and do not possess the source fm files.

  • Using methods mentioned above convert your PDF files to RTF or doc formats.
  • Import the generated RTF file into FrameMaker as mentioned here in FrameMaker help.

How to get more out of PDF conversion

For most of the file formats you select, Acrobat offers some customizable settings. For example, while exporting a PDF file to RTF format, the available settings are shown.

If you need only the text then do not include images. Images add a large overhead to processing. If needed for illustrative purposes only, you can downsample the images. Time taken to process images is significantly more than that taken for text. Leave tag quality settings unchanged. Tags are recreated only during the conversion and do not show in final document.

Similarly while extracting images from PDF files or exporting PDF files as images, the image settings can be customized to suit your needs of quality and speed. Read more here.

Do you have many PDF files to convert

In the File > Export list, notice that there is an extra option at the end. And a life saving one it is. You can set up a batch process using this option. Simply add your PDF files to the dialog box that opens and you are good to go. This could be a lifesaver especially in an enterprise setup, where there are lots of PDF files to be processed. This feature is available only in Acrobat 9 Professional and Acrobat 9 Professional Extended versions. Read more about batch sequences here.

Why copy-edit the generated files

Some text strings and handwritten parts, which are not very clear to Acrobat, come out as images in doc, RTF, and HTML output, and get missed out in the text or xml outputs. If they are present in doc or RTF files, selecting them would reveal that they cannot be edited. So after any conversion carefully read through the generated document. Manual typing or copy-pasting the text from PDF file is required to fix it. Also all conversions require you to format the re-generated source file, for reasons stated in the next section.

Why some PDF files do not convert properly

PDF is designed to be a final consumable format. All roads lead to PDF and arriving at any other format starting from PDF is inherently very difficult. PDF conversions talked above are not magic and are never 100% accurate. There are some PDF files which convert gracefully while others don’t.

While exporting and OCR features work smoothly for simple document, they partially work for heavily formatted pages, pages with highly customized layouts, or pages using decorative fonts. Tagged PDF files convert better, because they contain information on the structure of the content. In case of image PDFs, it helps to have a good resolution (about 300 dpi).

Note: If security restrictions do not allow you to open or dissemble a PDF file, then you cannot convert it into any format directly. You need the password (from its author) to enable direct exporting and editing in Acrobat.

How to extract content from secure PDF files?

Caution: Use this for personal PDF files, for which you own the copyright and have misplaced the password. Do ascertain your rights on the PDF files received from other sources.

If you have misplaced the source file, as well as, the password to your secured PDF file, use the following workaround to get your content back:

  • If printing is allowed on your PDF file:
    • Print a hard copy.
    • Scan at highest possible resolution in grayscale.
    • Extract content from the image PDF as described above.
    • If printing also is not allowed on your PDF file:
      • Unix/Linux users can generate a PS file from the PDF file using the pdf2ps command.
      • Revert back to PDF format using the ps2pdf command. Re-generated image PDF file does not have the same font information and is inflated in size, but is not secured anymore.
      • Extract content from this image PDF as described above.

Do check out the cool export features in the upcoming Acrobat X here. Acrobat X has very powerful content extraction features, which will make all the above a cake walk. To explore more, visit the Acrobat product page here. (This article is based on the current Acrobat 9 family of products).

Your suggestions, tips, and comments are most welcome.

About the author:

Ashish works as a writer at Adobe India. He likes to trek, follow technology, and is a pro-environment hardliner. He can be reached on twitter @ashishguptaiitb.

About the illustration:

Used with permission from Nirupama Singh.

All product names, logos, and any trademarks used in the illustrations and elsewhere in this article are for identification purpose only, are the property of their owners, and their rights are acknowledged.

8 Comments

  1. Excellent article Ashish. The best part is the conversion of PDF to FM. I would surely try these methods for converting data from the PDF file.

    Thanks,
    Mrini

  2. Most store or florist lilies only have 2-4 buds mostly because they use small bulbs which are cheaper and because it is easier to ship and pack lilies with fewer buds. None of these deer repellents, however, are guaranteed to keep deer away from the garden.. Cheap burberry bags Because the trail passes into the wilderness, your dog must be leashed. MOST Americans just want Newt to finally, at long last, to disappear..
    Isabel Marant shop He currently life with his wife Jennifer, daughter Ava Elizabeth (born Come july 1st 15, 2002) and son Jonas Rocket (created on August Of sixteen, 2006), German Shepherd named Grey, Labrador Retriever named Chloe, and he recently got a new puppy as published on the Angels And Stereo Facebook page. Not only a grisly death, but a surprising one, since Dale is heavily featured within the comic books for a long time..
    http://www.abbeyvacation.com/common/furlahandbags.html These designs are a sure way of making the waist the focal point.. michael kors handbags
    Or do research on the restaurant budget menu prices.. Grubb, “dealer in timepieces, diamonds, fine jewelry, solid silver, and gold plated ware,” operated a store at this address from at least 1901 right up until his death inside 1917 .
    http://www.behaviour.com.br/images/michaeloutlet.html So I had a lot of affection for the guy.

  3. naturally like your website but you need to take a look at the spelling on quite a few of your posts. Several of them are rife with spelling issues and I in finding it very bothersome to inform the truth however I will certainly come again again.

Comments are closed.