From: Sean O'Riordain <seanpor_at_acm.org>

Date: Tue 23 May 2006 - 01:50:09 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue May 23 02:27:25 2006

Date: Tue 23 May 2006 - 01:50:09 EST

Thank you very much indeed Bogdan!

*> a2[duplicated(a2$mdate),]
*

value2 mdate 318 0 2006-05-10 322 0 2006-05-13 324 0 2006-05-14 326 0 2006-05-15 328 0 2006-05-16

What a relief to know what is causing this problem... now to sort out the root cause!

cheers and thanks again!

Sean

On 22/05/06, bogdan romocea <br44114@gmail.com> wrote:

> Repeated merge()-ing does not always increase the space requirements

*> linearly. Keep in mind that a join between two tables where the same
**> value appears M and N times will produce M*N rows for that particular
**> value. My guess is that the number of rows in atot explodes because
**> you have some duplicate values in your files (having the same
**> duplicate date in each data frame would cause atot to contain 4, then
**> 8, 16, 32, 64... rows for that date).
**>
**>
**> > -----Original Message-----
**> > From: r-help-bounces@stat.math.ethz.ch
**> > [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Sean O'Riordain
**> > Sent: Monday, May 22, 2006 10:12 AM
**> > To: r-help
**> > Subject: [R] win2k memory problem with merge()'ing repeatedly
**> > (long email)
**> >
**> > Good afternoon,
**> >
**> > I have a 63 small .csv files which I process daily, and until two
**> > weeks ago they processed just fine and only took a matter of moments
**> > and had non noticeable memory problem. Two weeks ago they have
**> > reached 318 lines and my script "broke". There are some
**> > missing-values in some of the files. I have tried hard many times
**> > over the last two weeks to create a "small" repeatable example to give
**> > you but I've failed - unless I use my data it works fine... :-(
**> >
**> > Am I missing something obvious? (again)
**> >
**> > A line in a typical file has lines which look like :
**> > 01/06/2005,1372
**> >
**> > Though there are three files which have two values (files 3,32,33) and
**> > these have lines which look like...
**> > 01/06/2005,1766,
**> > or
**> > 15/05/2006,289,114
**> >
**> > a1 <- read.csv("file1.csv",header=F)
**> > etc...
**> > a63 <- read.csv("file63.csv",header=F)
**> > names(a1) <- c("mdate","file1.column.description")
**> >
**> > atot <- merge(a1,a2,all=T)
**> >
**> > followed by repeatedly doing...
**> > atot <- merge(atot, a3,all=T)
**> > atot <- merge(atot, a4,all=T)
**> > etc...
**> >
**> > I normally start R with --vanilla.
**> >
**> > What appears to happen is that atot doubles in size each iteration and
**> > just falls over due to lack of memory at about i=17... even though the
**> > total memory required for all of these individual a1...a63 is only
**> > 1001384 bytes (doing an object.size() on a1..a63)
**> > at this point I've been trying to pin down this problem for two weeks
**> > and I just gave up...
**> >
**> > The following works fine as I'd expect with minimal memory usage...
**> >
**> > for (i in 3:67) {
**> > datelist <- as.Date(start.date)+0:(count-1)
**> > #remove a couple of elements...
**> > datelist <- datelist[-(floor(runif(nacount)*count))]
**> > a2 <- as.data.frame(datelist)
**> > names(a2) <- "mdate"
**> > vname <- paste("value", i, sep="")
**> > a2[vname] <- runif(length(datelist))
**> > #a2[floor(runif(nacount)*count), vname] <- NA
**> >
**> > # atot <- merge(atot,a2,all=T)
**> > i <- 2
**> > a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="")
**> > cat("a.eval.text is: -", a.eval.text, "-\n", sep="")
**> > atot <- eval(parse(text=a.eval.text))
**> >
**> > cat("i:", i, " ", gc(), "\n")
**> > }
**> >
**> > this works fine... but on my files (as per attached 'lastsave.txt'
**> > file) it just gobbles memory.
**> > Am I doing something wrong? I (wrongly?) expected that repeatedly
**> > merge(atot,aN) would only increase the memory requirement linearly
**> > (with jumps perhaps as we go through a 2^n boundary)... which is what
**> > happens when merging simulated data.frames as above... no problem at
**> > all and its really fast...
**> >
**> > The attached text file shows a (slightly edited) session where the
**> > memory required by the merge() operation just doubles with each use...
**> > and I can only allow it to run until i=17!!!
**> >
**> > I've even run it with gctorture() set on... with similar, but
**> > excruciatingly slow results...
**> >
**> > Is there any relevant info that I'm missing? Unfortunately I am not
**> > able to post the contents of the files to a public list like this...
**> >
**> > As per a previous thread, I know that I can use a list to handle these
**> > dataframes - but I had difficulty with the syntax of a list of
**> > dataframes...
**> >
**> > I'd like to know why the memory requirements for this merge
**> > just explode...
**> >
**> > cheers, (and thanks in advance!)
**> > Sean O'Riordain
**> >
**> > ==============================
**> > > version
**> > _
**> > platform i386-pc-mingw32
**> > arch i386
**> > os mingw32
**> > system i386, mingw32
**> > status Patched
**> > major 2
**> > minor 3.0
**> > year 2006
**> > month 05
**> > day 09
**> > svn rev 38014
**> > language R
**> > version.string Version 2.3.0 Patched (2006-05-09 r38014)
**> > >
**> > Running on Win2k with 1Gb ram.
**> >
**> > I also tried it (with the same results) on 2.2.1 and 2.3.0.
**> >
**> > ========================================================
**> >
**> > R : Copyright 2006, The R Foundation for Statistical Computing
**> > Version 2.3.0 Patched (2006-05-09 r38014)
**> > ISBN 3-900051-07-0
**> >
**> > R is free software and comes with ABSOLUTELY NO WARRANTY.
**> > You are welcome to redistribute it under certain conditions.
**> > Type 'license()' or 'licence()' for distribution details.
**> >
**> > Natural language support but running in an English locale
**> >
**> > R is a collaborative project with many contributors.
**> > Type 'contributors()' for more information and
**> > 'citation()' on how to cite R or R packages in publications.
**> >
**> > Type 'demo()' for some demos, 'help()' for on-line help, or
**> > 'help.start()' for an HTML browser interface to help.
**> > Type 'q()' to quit R.
**> >
**> > > gc()
**> > used (Mb) gc trigger (Mb) max used (Mb)
**> > Ncells 178186 4.8 407500 10.9 350000 9.4
**> > Vcells 73112 0.6 786432 6.0 333585 2.6
**> > > # take the information in the .csv files created from the emails
**> > > setwd("C:/Documents and Settings/c_oriordain_s/My
**> > Documents/pasip/mms/mms_emails")
**> > >
**> > > # the input file from Amdocs (as supplied by revenue assurance)
**> > > amdocs_csv_filename <- "amdocs_volumes_revised4.csv"
**> > > # where shall we put the output plot file
**> > > copypath <- "\\\\ient1dfs001\\general\\Process Improvement
**> > Projects\\Process Improvement Projects Repository\\Active
**> > Projects\\MMS\\01 Measure\\"
**> > >
**> > > # set to F (false) instead of T (true) if you're just
**> > tricking around and you don't
**> > > # want to be copying over files to the network drive all the time!
**> > > do.copy <- F
**> > >
**> > > # HOPEFULLY you shouldn't have to trick around with stuff
**> > below here!
**> > > #
**> >
**> > # EDIT file names changed to protect the innocent... :-)
**> >
**> > > a1 <-read.csv("file1.csv",header=F)
**> > #EDIT etc... all the way to
**> > > a63 <-read.csv("file63.csv", header=F)
**> > >
**> > > # now delete the now irrelevant initial date column for all
**> > 63 of these temporary objects...
**> > > for (i in 1:63) {
**> > + # e.g. should look like a63$mdate <-
**> > as.Date(a63$V1,format="%d/%m/%Y")
**> > + anum <- paste("a",i,sep="")
**> > + eval(parse(text= paste(anum, "$mdate <- as.Date(" ,anum,
**> > "$V1,format=\"%d/%m/%Y\")",sep="") ))
**> > + }
**> > >
**> > >
**> > > # three files have three columns...
**> >
**> > #EDIT here again... to protect the innocent...
**> >
**> > > names(a3)[3] <- "2nd.column.name.in.file.3"
**> > > names(a32)[3] <- "2nd.column.name.in.file.32"
**> > > names(a33)[3] <- "2nd.column.name.in.file.33"
**> > >
**> > > # the rest only have two columns...
**> > >
**> > > names(a1)[2] <- "title.1"
**> > #EDIT
**> > > names(a63)[2] <- "title.63"
**> > >
**> > > for (i in 1:63) {
**> > + # now delete the now irrelevant initial date column for all 63
**> > of these temporary objects...
**> > + # e.g. should look like a33[1] <- NULL
**> > + eval(parse(text=paste("a",i,"[1] <- NULL",sep="")))
**> > + }
**> > >
**> > > a.object.sizes <- vector()
**> > > for (i in 1:63) {
**> > + # now delete these 63 temporary objects...
**> > + # e.g. should look like rm(a33)
**> > + a.name <- paste("a", i, sep="")
**> > + # a.object.sizes[i] <- object.size(a.name)
**> > + a.object.sizes[i] <-
**> > eval(parse(text=paste("object.size(",a.name,")", sep="")))
**> > + }
**> > >
**> > > a.object.sizes
**> > [1] 17988 17996 19524 17996 17996 18004 17996 18028 17988 17988 17996
**> > 17996 17996 18012 18012 17988 17980 18004 18004
**> > [20] 18012 19348 19316 19340 17996 18004 18004 18012 18004 19228 19228
**> > 18012 19436 19436 19244 19220 17996 17900 17900
**> > [39] 17884 17884 17884 17884 17884 17884 17876 17988 17900 17892 8808
**> > 17988 8792 8800 8800 8792 8800 8784 17980
**> > [58] 17988 17980 9832 9728 9728 9728
**> > >
**> > > # merge these tables into one big dataframe...
**> > > atot <- merge(a1, a2, all=T)
**> > > for (i in 3:17) {
**> > + # construct the text to be evaluated...
**> > + #atot <- merge(atot, a3, all=T)
**> > + cat("The size of object a", i, " is ",
**> > a.object.sizes[i], "\n", sep="")
**> > + cat("The current size of atot is ", object.size(atot), "\n")
**> > + a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="")
**> > + cat("a.eval.text is: -", a.eval.text, "-\n", sep="")
**> > + atot <- eval(parse(text=a.eval.text))
**> > + cat("i is:", i, gc(), "\n\n")
**> > + }
**> > The size of object a3 is 19524
**> > The current size of atot is 19988
**> > a.eval.text is: -merge(atot, a3, all=T)-
**> > i is: 3 206289 137020 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
**> >
**> > The size of object a4 is 17996
**> > The current size of atot is 24300
**> > a.eval.text is: -merge(atot, a4, all=T)-
**> > i is: 4 206330 137402 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
**> >
**> > The size of object a5 is 17996
**> > The current size of atot is 28564
**> > a.eval.text is: -merge(atot, a5, all=T)-
**> > i is: 5 206411 138044 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
**> >
**> > The size of object a6 is 18004
**> > The current size of atot is 36044
**> > a.eval.text is: -merge(atot, a6, all=T)-
**> > i is: 6 206572 139246 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
**> >
**> > The size of object a7 is 17996
**> > The current size of atot is 50236
**> > a.eval.text is: -merge(atot, a7, all=T)-
**> > i is: 7 206893 141652 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
**> >
**> > The size of object a8 is 18028
**> > The current size of atot is 78516
**> > a.eval.text is: -merge(atot, a8, all=T)-
**> > i is: 8 207534 146614 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6
**> >
**> > The size of object a9 is 17988
**> > The current size of atot is 136252
**> > a.eval.text is: -merge(atot, a9, all=T)-
**> > i is: 9 208815 157016 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6
**> >
**> > The size of object a10 is 17988
**> > The current size of atot is 255404
**> > a.eval.text is: -merge(atot, a10, all=T)-
**> > i is: 10 211376 178938 5.7 1.4 407500 786432 10.9 6 362507
**> > 786425 9.7 6
**> >
**> > The size of object a11 is 17996
**> > The current size of atot is 502540
**> > a.eval.text is: -merge(atot, a11, all=T)-
**> > i is: 11 216497 225184 5.8 1.8 467875 889825 12.5 6.8 362507
**> > 888747 9.7 6.8
**> >
**> > The size of object a12 is 17996
**> > The current size of atot is 1015940
**> > a.eval.text is: -merge(atot, a12, all=T)-
**> > i is: 12 226738 322626 6.1 2.5 531268 1577138 14.2 12.1
**> > 362507 1569929 9.7 12
**> >
**> > The size of object a13 is 17996
**> > The current size of atot is 2082284
**> > a.eval.text is: -merge(atot, a13, all=T)-
**> > i is: 13 247219 527588 6.7 4.1 597831 2209110 16 16.9 362507
**> > 2749247 9.7 21
**> >
**> > The size of object a14 is 18012
**> > The current size of atot is 4295524
**> > a.eval.text is: -merge(atot, a14, all=T)-
**> > i is: 14 288180 957830 7.7 7.4 741108 4242831 19.8 32.4 494389 5296330
**> > 13.3 40.5
**> >
**> > The size of object a15 is 18012
**> > The current size of atot is 8884444
**> > a.eval.text is: -merge(atot, a15, all=T)-
**> > i is: 15 370101 1859128 9.9 14.2 1073225 8314706 28.7 63.5 781279
**> > 10388430 20.9 79.3
**> >
**> > The size of object a16 is 17988
**> > The current size of atot is 18388580
**> > a.eval.text is: -merge(atot, a16, all=T)-
**> > i is: 16 533942 3743450 14.3 28.6 1590760 17263040 42.5 131.8 1354559
**> > 21430459 36.2 163.6
**> >
**> > The size of object a17 is 17980
**> > The current size of atot is 38050756
**> > a.eval.text is: -merge(atot, a17, all=T)-
**> > i is: 17 861623 7675772 23.1 58.6 3094291 35309607 82.7 269.4 2501382
**> > 44137010 66.8 336.8
**> >
**> > ______________________________________________
**> > R-help@stat.math.ethz.ch mailing list
**> > https://stat.ethz.ch/mailman/listinfo/r-help
**> > PLEASE do read the posting guide!
**> > http://www.R-project.org/posting-guide.html
**> >
**>
*

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue May 23 02:27:25 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Tue 23 May 2006 - 04:10:23 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*