Re: [R] Incremental ReadLines

From: Frederik Lang <frederiklang_at_gmail.com>
Date: Thu, 14 Apr 2011 11:57:40 -0400

Hi Mike,

Thanks for your comment.

I must admit that I am very new to R and although it sounds interesting what you write I have no idea of where to start. Can you give some functions or examples where I can see how it can be done.

I was under the impression that I had to do a loop since my blocks of observations are of varying length.

Thanks again,

Frederik

On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka <marchywka_at_hotmail.com>wrote:

>
>
>
>
>
> ----------------------------------------
> > Date: Wed, 13 Apr 2011 10:57:58 -0700
> > From: frederiklang_at_gmail.com
> > To: r-help_at_r-project.org
> > Subject: Re: [R] Incremental ReadLines
> >
> > Hi there,
> >
> > I am having a similar problem with reading in a large text file with
> around
> > 550.000 observations with each 10 to 100 lines of description. I am
> trying
> > to parse it in R but I have troubles with the size of the file. It seems
> > like it is slowing down dramatically at some point. I would be happy for
> any
>
> This probably occurs when you run out of physical memory but you can
> probably verify by looking at task manager. A "readline()" method
> wouldn't fit real well with R as you try to had blocks of data
> so that inner loops, implemented largely in native code, can operate
> efficiently. The thing you want is a data structure that can use
> disk more effectively and hide these details from you and algorightm.
> This works best if the algorithm works with data strcuture to avoid
> lots of disk thrashing. You coudl imagine that your "read" would do
> nothing until each item is needed but often people want the whole
> file validated before procesing, lots of details come up with exception
> handling as you get fancy here. Note of course that your parse output
> could be stored in a hash or something represnting a DOM and this could
> get arbitrarily large. Since it is designed for random access, this may
> cause lots of thrashing if partially on disk. Anything you can do to
> make access patterns more regular, for example sort your data, would help.
>
>
> > suggestions. Here is my code, which works fine when I am doing a
> subsample
> > of my dataset.
> >
> > #Defining datasource
> > file <- "filename.txt"
> >
> > #Creating placeholder for data and assigning column names
> > data <- data.frame(Id=NA)
> >
> > #Starting by case = 0
> > case <- 0
> >
> > #Opening a connection to data
> > input <- file(file, "rt")
> >
> > #Going through cases
> > repeat {
> > line <- readLines(input, n=1)
> > if (length(line)==0) break
> > if (length(grep("Id:",line)) != 0) {
> > case <- case + 1 ; data[case,] <-NA
> > split_line <- strsplit(line,"Id:")
> > data[case,1] <- as.numeric(split_line[[1]][2])
> > }
> > }
> >
> > #Closing connection
> > close(input)
> >
> > #Saving dataframe
> > write.csv(data,'data.csv')
> >
> >
> > Kind regards,
> >
> >
> > Frederik
> >
> >
> > --
> > View this message in context:
> http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 14 Apr 2011 - 20:15:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 14 Apr 2011 - 20:30:31 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive