2005/10/21, Ted Harding <Ted.Harding@nessie.mcc.ac.uk>:
> On 21-Oct-05 Marco Venanzi wrote:
> > Hi, I'm trying to read data from a PDF file.Is it possible to do it
> > with R? Thanks, Marco
> Basically, No.
> But you may be lucky with "copy&paste" using the mouse, from
> the display generated in Acrobat Reader to a text file.
> The basic procedure here is
> 1. Click on the "Text Select Tool" (a button usually marked with a "T");
> 2. Use the mouse to highlight the block of text you want to copy;
> 3. Depending on your operating system/graphics display: In Windows
> you have (IIRC) to go to "Edit"->""Copy"; in Unlix/Linux with
> X Windows do nothing;
> 4. "Paste" it into your text file, again as appropriate for your
> operating system.
> However, you may not be lucky.
> PDF can store its content in stange ways, and what may look on the
> screen like contiguous and consecutive text is stored internally
> in separate "blocks" (what PDF calls "objects"). And this can apply
> even to little bits of text in a paragraph.
> When you paste the marked text, it will go in in the order that
> PDF finds the blocks in the file. As a result, your text file
> may contain bits of text in random order.
> This especially applies to things arranged in tables. But it
> very much depends on the software that generated the PDF in
> the first place.
> Since often the data in a PDF file which you may want to copy
> in this way will be tabular, you are likely to encounter this
> You can tell this is going to happen when you use the mouse to
> highlight the text you intend to copy: starting with the mouse
> iin say the top LH corner, move it slowly towards the lower
> RH corner of the block. If the highlighting jumps all over the
> screen, and/or outside the area you are trying to highlight,
> then this is what's happening.
> In that case I have sometimes done it by copying lots of little
> blocks, too small to provoke the effect. But this is very tedious.
> There are other things one can try, such as printing from the
> PDF file to a PostScript file, and then using a program like
> ps2ascii (which can deal directly with PDF) or pstotext; but frankly
> no such program is likely to make a good job of this, because of
> the way PS and PDF work.
> Sorry to appear unhelpful! But you may get somewhere.
Hmm, if this doesn't work you should have a look to pdftolpe, which is assumed to convert aribitrary PDF files to some LPE readable format. LPE is a lightweight programmer's editor, that should be able save the converted file into txt format.
I never used this myself, though. In case you are running Windows my reply might not be of much help, sorry for that!
This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:40:46 EST