Re: [R] read data from pdf file

From: Thomas Schönhoff <tschoenhoff_at_gmail.com>
Date: Sat 22 Oct 2005 - 05:39:47 EST

2005/10/21, Ted Harding <Ted.Harding@nessie.mcc.ac.uk>:
> On 21-Oct-05 Marco Venanzi wrote:
> > Hi, I'm trying to read data from a PDF file.Is it possible to do it
> > with R? Thanks, Marco
>
> Basically, No.
>
> But you may be lucky with "copy&paste" using the mouse, from
> the display generated in Acrobat Reader to a text file.
>
> The basic procedure here is
>
> 1. Click on the "Text Select Tool" (a button usually marked with a "T");
>
> 2. Use the mouse to highlight the block of text you want to copy;
>
> 3. Depending on your operating system/graphics display: In Windows
> you have (IIRC) to go to "Edit"->""Copy"; in Unlix/Linux with
> X Windows do nothing;
>
> 4. "Paste" it into your text file, again as appropriate for your
> operating system.
>
> However, you may not be lucky.
>
> PDF can store its content in stange ways, and what may look on the
> screen like contiguous and consecutive text is stored internally
> in separate "blocks" (what PDF calls "objects"). And this can apply
> even to little bits of text in a paragraph.
>
> When you paste the marked text, it will go in in the order that
> PDF finds the blocks in the file. As a result, your text file
> may contain bits of text in random order.
>
> This especially applies to things arranged in tables. But it
> very much depends on the software that generated the PDF in
> the first place.
>
> Since often the data in a PDF file which you may want to copy
> in this way will be tabular, you are likely to encounter this
> problem!
>
> You can tell this is going to happen when you use the mouse to
> highlight the text you intend to copy: starting with the mouse
> iin say the top LH corner, move it slowly towards the lower
> RH corner of the block. If the highlighting jumps all over the
> screen, and/or outside the area you are trying to highlight,
> then this is what's happening.
>
> In that case I have sometimes done it by copying lots of little
> blocks, too small to provoke the effect. But this is very tedious.
>
> There are other things one can try, such as printing from the
> PDF file to a PostScript file, and then using a program like
> ps2ascii (which can deal directly with PDF) or pstotext; but frankly
> no such program is likely to make a good job of this, because of
> the way PS and PDF work.
>
> Sorry to appear unhelpful! But you may get somewhere.

Hmm, if this doesn't work you should have a look to pdftolpe, which is assumed to convert aribitrary PDF files to some LPE readable format. LPE is a lightweight programmer's editor, that should be able save the converted file into txt format.

I never used this myself, though. In case you are running Windows my reply might not be of much help, sorry for that!

good luck

Thomas



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Oct 22 06:36:30 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:40:46 EST