Re: [R] read data from pdf file

From: Thomas Schönhoff <>
Date: Sat 22 Oct 2005 - 05:39:47 EST

2005/10/21, Ted Harding <>:
> On 21-Oct-05 Marco Venanzi wrote:
> > Hi, I'm trying to read data from a PDF file.Is it possible to do it
> > with R? Thanks, Marco
> Basically, No.
> But you may be lucky with "copy&paste" using the mouse, from
> the display generated in Acrobat Reader to a text file.
> The basic procedure here is
> 1. Click on the "Text Select Tool" (a button usually marked with a "T");
> 2. Use the mouse to highlight the block of text you want to copy;
> 3. Depending on your operating system/graphics display: In Windows
> you have (IIRC) to go to "Edit"->""Copy"; in Unlix/Linux with
> X Windows do nothing;
> 4. "Paste" it into your text file, again as appropriate for your
> operating system.
> However, you may not be lucky.
> PDF can store its content in stange ways, and what may look on the
> screen like contiguous and consecutive text is stored internally
> in separate "blocks" (what PDF calls "objects"). And this can apply
> even to little bits of text in a paragraph.
> When you paste the marked text, it will go in in the order that
> PDF finds the blocks in the file. As a result, your text file
> may contain bits of text in random order.
> This especially applies to things arranged in tables. But it
> very much depends on the software that generated the PDF in
> the first place.
> Since often the data in a PDF file which you may want to copy
> in this way will be tabular, you are likely to encounter this
> problem!
> You can tell this is going to happen when you use the mouse to
> highlight the text you intend to copy: starting with the mouse
> iin say the top LH corner, move it slowly towards the lower
> RH corner of the block. If the highlighting jumps all over the
> screen, and/or outside the area you are trying to highlight,
> then this is what's happening.
> In that case I have sometimes done it by copying lots of little
> blocks, too small to provoke the effect. But this is very tedious.
> There are other things one can try, such as printing from the
> PDF file to a PostScript file, and then using a program like
> ps2ascii (which can deal directly with PDF) or pstotext; but frankly
> no such program is likely to make a good job of this, because of
> the way PS and PDF work.
> Sorry to appear unhelpful! But you may get somewhere.

Hmm, if this doesn't work you should have a look to pdftolpe, which is assumed to convert aribitrary PDF files to some LPE readable format. LPE is a lightweight programmer's editor, that should be able save the converted file into txt format.

I never used this myself, though. In case you are running Windows my reply might not be of much help, sorry for that!

good luck

Thomas mailing list PLEASE do read the posting guide! Received on Sat Oct 22 06:36:30 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:40:46 EST