Re: [R] Reading a web page in pdf format

From: Marc Schwartz <marc_schwartz_at_comcast.net>
Date: Wed, 09 May 2007 12:08:21 -0500

On Wed, 2007-05-09 at 10:55 -0500, Marc Schwartz wrote:
> On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
> > Each day the daily balance in the following link
> >
> > http://www.
> > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
> >
> > is
> > updated.
> >
> > I would like to set up an R procedure to be run daily in a
> > server able to read the figures in a couple of lines only
> > ("Industriale" and "Termoelettrico", towards the end of the balance)
> > and put the data in a table.
> >
> > Is that possible? If yes, what R-packages
> > should I use?
> >
> > Ciao
> > Vittorio
>
> Vittorio,
>
> Keep in mind that PDF files are typically text files. Thus you can read
> it in using readLines():
>
> PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
>
> # Clean up
> unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
>
>
> > str(PDFFile)
> chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...
>
>
> # Now find the lines containing the values you wish
> # Use grep() with a regex for either term
> Lines <- grep("(Industriale|Termoelettrico)", PDFFile)
>
> > Lines
> [1] 33 34
>
> > PDFFile[Lines]
> [1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm ( 46,6)Tj"
> [2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm ( 99,3)Tj"
>
>
> # Now parse the values out of the lines"
> Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])
>
> > Vals
> [1] " 46,6" " 99,3"
>
>
> # Now convert them to numeric
> # need to change the ',' to a '.' at least in my locale
>
> > as.numeric(gsub(",", "\\.", Vals))
> [1] 46.6 99.3

Vittorio,

Just a quick tweak here, given the possibility that the order of the values may be subject to change.

After reading the file and getting the lines, use:

# Use sub() with 2 back references, 1 for each value in the line Vals <- sub(".*\\((.*)\\).*\\((.*)\\).*", "\\1 \\2", PDFFile[Lines])

> Vals

[1] "Industriale 46,6" "Termoelettrico 99,3"

This gives us the labels and the values. Now convert to a data frame and then coerce the values to numeric:

DF <- read.table(textConnection(Vals))

> DF

              V1 V2
1 Industriale 46,6
2 Termoelettrico 99,3

DF$V2 <- as.numeric(sub(",", "\\.", DF$V2))

> DF

              V1 V2
1 Industriale 46.6
2 Termoelettrico 99.3

> str(DF)

'data.frame': 2 obs. of 2 variables:
 $ V1: Factor w/ 2 levels "Industriale",..: 1 2  $ V2: num 46.6 99.3

HTH, Marc



R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 09 May 2007 - 17:29:04 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 09 May 2007 - 17:31:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.