Re: [R] Reading a web page in pdf format

From: Marc Schwartz <marc_schwartz_at_comcast.net>
Date: Wed, 09 May 2007 10:55:39 -0500

On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
> Each day the daily balance in the following link
>
> http://www.
> snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
>
> is
> updated.
>
> I would like to set up an R procedure to be run daily in a
> server able to read the figures in a couple of lines only
> ("Industriale" and "Termoelettrico", towards the end of the balance)
> and put the data in a table.
>
> Is that possible? If yes, what R-packages
> should I use?
>
> Ciao
> Vittorio

Vittorio,

Keep in mind that PDF files are typically text files. Thus you can read it in using readLines():

PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")

# Clean up
unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")

> str(PDFFile)

 chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...

# Now find the lines containing the values you wish # Use grep() with a regex for either term Lines <- grep("(Industriale|Termoelettrico)", PDFFile)

> Lines

[1] 33 34

> PDFFile[Lines]

[1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm (       46,6)Tj"
[2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm (       99,3)Tj"      


# Now parse the values out of the lines" Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])

> Vals

[1] " 46,6" " 99,3"

# Now convert them to numeric
# need to change the ',' to a '.' at least in my locale

> as.numeric(gsub(",", "\\.", Vals))

[1] 46.6 99.3

HTH, Marc Schwartz



R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 09 May 2007 - 16:01:38 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 09 May 2007 - 18:31:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.