Re: [R] Incremental ReadLines

From: Frederik Lang <frederiklang_at_gmail.com>
Date: Sun, 17 Apr 2011 20:25:00 -0400

Hi again,

Changing my code by defining vectors outside the loop and combining them afterwards helped a lot so now the code does not slow down anymore and I was able to parse the file in less than 2 hours. Not fantastic but it works.

I will William's the last suggestion of how to parse it without looping through for next time I have to parse a large file.

Many thanks for your help!

Frederik

On Thu, Apr 14, 2011 at 4:58 PM, William Dunlap <wdunlap_at_tibco.com> wrote:

> [see below]
>
> From: Frederik Lang [mailto:frederiklang_at_gmail.com]
> Sent: Thursday, April 14, 2011 12:56 PM
> To: William Dunlap
> Cc: r-help_at_r-project.org
> Subject: Re: [R] Incremental ReadLines
>
>
>
> Hi Bill,
>
> Thank you so much for your suggestions. I will try and alter my
> code.
>
>
> Regarding the even shorter solution outside the loop it looks
> good but my problem is that not all observations have the same variables
> so that three different observations might look like this:
>
>
> Id: 1
> Var1: false
> Var2: 6
> Var3: 8
>
> Id: 2
> missing
>
> Id: 3
> Var1: true
> 3 4 5
> Var2: 7
> Var3: 3
>
>
> Doing it without looping through I thought my data had to quite
> systematic, which it is not. I might be wrong though.
>
> Doing the simple preallocation that I describe should speed it up
> a lot with very little effort. It is more work to manipulate the
> columns one at a time instead of using data.frame subscripting and
> it may not be worth it if you have lots of columns.
>
> If you have a lot of this sort of file and feel that it will be worth
> the programming time to do something fancier, here is some code that
> reads lines of the form
>
> > cat(lines, sep="\n")
> Id: First
> Var1: false
> Var2: 6
> Var3: 8
>
> Id: Second
> Id: Last
> Var1: true
> Var3: 8
>
> and produces a matrix with the Id's along the rows and the Var's
> along the columns:
>
> > f(lines)
> Var1 Var2 Var3
> First "false" "6" "8"
> Second NA NA NA
> Last "true" NA "8"
>
> The function f is:
>
> f <- function (lines)
> {
> # keep only lines with colons
> lines <- grep(value = TRUE, "^.+:", lines)
> lines <- gsub("^[[:space:]]+|[[:space:]]+$", "", lines)
> isIdLine <- grepl("^Id:", lines)
> group <- cumsum(isIdLine)
> rownames <- sub("^Id:[[:space:]]*", "", lines[isIdLine])
> lines <- lines[!isIdLine]
> group <- group[!isIdLine]
> varname <- sub("[[:space:]]*:.*$", "", lines)
> value <- sub(".*:[[:space:]]*", "", lines)
> colnames <- unique(varname)
> col <- match(varname, colnames)
> retval <- array(NA_character_, c(length(rownames),
> length(colnames)),
> dimnames = list(rownames, colnames))
> retval[cbind(group, col)] <- value
> retval
> }
>
> The main trick is the matrix subscript given to retval on the
> penultimate line.
>
> Thanks again,
>
>
> Frederik
>
>
>
> On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap
> <wdunlap_at_tibco.com> wrote:
>
>
> I have two suggestions to speed up your code, if you
> must use a loop.
>
> First, don't grow your output dataset at each iteration.
> Instead of
> cases <- 0
> output <- numeric(cases)
> while(length(line <- readLines(input, n=1))==1) {
> cases <- cases + 1
> output[cases] <- as.numeric(line)
> }
> preallocate the output vector to be about the size of
> its eventual length (slightly bigger is better),
> replacing
> output <- numeric(0)
> with the likes of
> output <- numeric(500000)
> and when you are done with the loop trim down the length
> if it is too big
> if (cases < length(output)) length(output) <- cases
> Growing your dataset in a loop can cause quadratic or
> worse
> growth in time with problem size and the above sort of
> code should make the time grow linearly with problem
> size.
>
> Second, don't do data.frame subscripting inside your
> loop.
> Instead of
> data <- data.frame(Id=numeric(cases))
> while(...) {
> data[cases, 1] <- newValue
> }
> do
> Id <- numeric(cases)
> while(...) {
> Id[cases] <- newValue
> }
> data <- data.frame(Id = Id)
> This is just the general principal that you don't want
> to
> repeat the same operation over and over in a loop.
> dataFrame[i,j] first extracts column j then extracts
> element
> i from that column. Since the column is the same every
> iteration
> you may as well extract the column outside of the loop.
>
> Avoiding the loop altogether is the fastest. E.g., the
> code
> you showed does the same thing as
> idLines <- grep(value=TRUE, "Id:", readLines(file))
> data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*",
> "", idLines)))
> You can also use an external process (perl or grep) to
> filter
> out the lines that are not of interest.
>
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
> > -----Original Message-----
> > From: r-help-bounces_at_r-project.org
> > [mailto:r-help-bounces_at_r-project.org] On Behalf Of
> Freds
> > Sent: Wednesday, April 13, 2011 10:58 AM
> > To: r-help_at_r-project.org
> > Subject: Re: [R] Incremental ReadLines
> >
>
> > Hi there,
> >
> > I am having a similar problem with reading in a large
> text
> > file with around
> > 550.000 observations with each 10 to 100 lines of
> > description. I am trying
> > to parse it in R but I have troubles with the size of
> the
> > file. It seems
> > like it is slowing down dramatically at some point. I
> would
> > be happy for any
> > suggestions. Here is my code, which works fine when I
> am
> > doing a subsample
> > of my dataset.
> >
> > #Defining datasource
> > file <- "filename.txt"
> >
> > #Creating placeholder for data and assigning column
> names
> > data <- data.frame(Id=NA)
> >
> > #Starting by case = 0
> > case <- 0
> >
> > #Opening a connection to data
> > input <- file(file, "rt")
> >
> > #Going through cases
> > repeat {
> > line <- readLines(input, n=1)
> > if (length(line)==0) break
> > if (length(grep("Id:",line)) != 0) {
> > case <- case + 1 ; data[case,] <-NA
> > split_line <- strsplit(line,"Id:")
> > data[case,1] <- as.numeric(split_line[[1]][2])
> > }
> > }
> >
> > #Closing connection
> > close(input)
> >
> > #Saving dataframe
> > write.csv(data,'data.csv')
> >
> >
> > Kind regards,
> >
> >
> > Frederik
> >
> >
> > --
> > View this message in context:
> >
> http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
> 447859.html
> <http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3%0A447859
> .html>
> > Sent from the R help mailing list archive at
> Nabble.com.
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained,
> reproducible code.
> >
>
>
>
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 18 Apr 2011 - 00:31:42 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 18 Apr 2011 - 00:40:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive