Re: [R] Incremental ReadLines

From: Mike Marchywka <marchywka_at_hotmail.com>
Date: Thu, 14 Apr 2011 16:08:30 -0400



> Date: Thu, 14 Apr 2011 11:57:40 -0400
> Subject: Re: [R] Incremental ReadLines
> From: frederiklang_at_gmail.com
> To: marchywka_at_hotmail.com
> CC: r-help_at_r-project.org
>
> Hi Mike,
>
> Thanks for your comment.
>
> I must admit that I am very new to R and although it sounds interesting
> what you write I have no idea of where to start. Can you give some
> functions or examples where I can see how it can be done.

I'm not sure I have a good R answer, simply pointing out the likley isuse and maybe the rest belongs on r-develoiper list or something. If you can determine you are running out of physical memory, then you either need to partitition somehting or make accesses more regular. My favorite example from personal experience is sorting a data set prior to piping into a c++ program that changed the execution time substantially by avoiding VM thrashing. R either needs a swapping buffer or has an equivalent that someone else could mention.

>
> I was under the impression that I had to do a loop since my blocks of
> observations are of varying length.
>
> Thanks again,
>
> Frederik
>
> On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka
> > wrote:
>
>
>
>
>
> ----------------------------------------
> > Date: Wed, 13 Apr 2011 10:57:58 -0700
> > From: frederiklang_at_gmail.com
> > To: r-help_at_r-project.org
> > Subject: Re: [R] Incremental ReadLines
> >
> > Hi there,
> >
> > I am having a similar problem with reading in a large text file with around
> > 550.000 observations with each 10 to 100 lines of description. I am trying
> > to parse it in R but I have troubles with the size of the file. It seems
> > like it is slowing down dramatically at some point. I would be happy
> for any
>
> This probably occurs when you run out of physical memory but you can
> probably verify by looking at task manager. A "readline()" method
> wouldn't fit real well with R as you try to had blocks of data
> so that inner loops, implemented largely in native code, can operate
> efficiently. The thing you want is a data structure that can use
> disk more effectively and hide these details from you and algorightm.
> This works best if the algorithm works with data strcuture to avoid
> lots of disk thrashing. You coudl imagine that your "read" would do
> nothing until each item is needed but often people want the whole
> file validated before procesing, lots of details come up with exception
> handling as you get fancy here. Note of course that your parse output
> could be stored in a hash or something represnting a DOM and this could
> get arbitrarily large. Since it is designed for random access, this may
> cause lots of thrashing if partially on disk. Anything you can do to
> make access patterns more regular, for example sort your data, would help.
>
>
> > suggestions. Here is my code, which works fine when I am doing a subsample
> > of my dataset.
> >
> > #Defining datasource
> > file <- "filename.txt"
> >
> > #Creating placeholder for data and assigning column names
> > data <- data.frame(Id=NA)
> >
> > #Starting by case = 0
> > case <- 0
> >
> > #Opening a connection to data
> > input <- file(file, "rt")
> >
> > #Going through cases
> > repeat {
> > line <- readLines(input, n=1)
> > if (length(line)==0) break
> > if (length(grep("Id:",line)) != 0) {
> > case <- case + 1 ; data[case,] <-NA
> > split_line <- strsplit(line,"Id:")
> > data[case,1] <- as.numeric(split_line[[1]][2])
> > }
> > }
> >
> > #Closing connection
> > close(input)
> >
> > #Saving dataframe
> > write.csv(data,'data.csv')
> >
> >
> > Kind regards,
> >
> >
> > Frederik
> >
> >
> > --
> > View this message in context:
> http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
                                               



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 14 Apr 2011 - 20:23:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 14 Apr 2011 - 20:30:31 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive