Re: [Rd] Severe memory problem using split()

From: cstrato <cstrato_at_aon.at>
Date: Tue, 13 Jul 2010 21:05:27 +0200

Dear Martin,

Thank you for this explanation.

Best regards
Christian

On 7/13/10 12:31 AM, Martin Morgan wrote:
> On 07/12/2010 03:00 PM, cstrato wrote:
>> Dear Martin,
>>
>> Thank you, you are right, now I get:
>>
>>> ann<- read.delim("Hu6800_ann.txt", stringsAsFactors=FALSE)
>>> object.size(ann)
>> 2035952 bytes
>>> u2p<- split(ann[,"ProbesetID"],ann[,"UNIT_ID"])
>>> object.size(u2p)
>> 1207368 bytes
>>> object.size(unlist(u2p))
>> 865176 bytes
>>
>> Nevertheless, a size of 1.2MB for a list representing 2 of 11 columns of
>
> but it's a list of length(unique(ann[["UNIT_ID"]]))) elements, each of
> which has a pointer to the element, a pointer to the name of the
> element, and the element data itself. I'd guess it adds up in a
> non-mysterious way. For a sense of it (and maybe only understandable if
> you have a working understanding of how R represents data) see, e.g.,
>
>> .Internal(inspect(list(x=1,y=2)))
> @1a4c538 19 VECSXP g0c2 [ATT] (len=2, tl=0)
> @191cad8 14 REALSXP g0c1 [] (len=1, tl=0) 1
> @191caa8 14 REALSXP g0c1 [] (len=1, tl=0) 2
> ATTRIB:
> @16fc8d8 02 LISTSXP g0c0 []
> TAG: @60cf18 01 SYMSXP g0c0 [MARK,NAM(2),gp=0x4000] "names"
> @1a4c500 16 STRSXP g0c2 [] (len=2, tl=0)
> @674e88 09 CHARSXP g0c1 [MARK,gp=0x21] "x"
> @728c38 09 CHARSXP g0c1 [MARK,gp=0x21] "y"
>
> Martin
>
>> a table of size 754KB seems still to be pretty large?
>>
>> Best regards
>> Christian
>>
>>
>> On 7/12/10 11:44 PM, Martin Morgan wrote:
>>> On 07/12/2010 01:45 PM, cstrato wrote:
>>>> Dear all,
>>>>
>>>> With great interest I followed the discussion:
>>>> https://stat.ethz.ch/pipermail/r-devel/2010-July/057901.html
>>>> since I have currently a similar problem:
>>>>
>>>> In a new R session (using xterm) I am importing a simple table
>>>> "Hu6800_ann.txt" which has a size of 754KB only:
>>>>
>>>>> ann<- read.delim("Hu6800_ann.txt")
>>>>> dim(ann)
>>>> [1] 7129 11
>>>>
>>>>
>>>> When I call "object.size(ann)" the estimated memory used to store "ann"
>>>> is already 2MB:
>>>>
>>>>> object.size(ann)
>>>> 2034784 bytes
>>>>
>>>>
>>>> Now I call "split()" and check the estimated memory used which turns out
>>>> to be 3.3GB:
>>>>
>>>>> u2p<- split(ann[,"ProbesetID"],ann[,"UNIT_ID"])
>>>>> object.size(u2p)
>>>> 3323768120 bytes
>>>
>>> I guess things improve with stringsAsFactors=FALSE in read.delim?
>>>
>>> Martin
>>>
>>>>
>>>> During the R session I am running "top" in another xterm and can see
>>>> that the memory usage of R increases to about 550MB RSIZE.
>>>>
>>>>
>>>> Now I do:
>>>>
>>>>> object.size(unlist(u2p))
>>>> 894056 bytes
>>>>
>>>> It takes about 3 minutes to complete this call and the memory usage of R
>>>> increases to about 1.3GB RSIZE. Furthermore, during evaluation of this
>>>> function the free RAM of my Mac decreases to less than 8MB free PhysMem,
>>>> until it needs to swap memory. When finished, free PhysMem is 734MB but
>>>> the size of R increased to 577MB RSIZE.
>>>>
>>>> Doing "split(ann[,"ProbesetID"],ann[,"UNIT_ID"],drop=TRUE)" did not
>>>> change the object.size, only processing was faster and it did use less
>>>> memory on my Mac.
>>>>
>>>> Do you have any idea what the reason for this behavior is?
>>>> Why is the size of list "u2p" so large?
>>>> Do I make any mistake?
>>>>
>>>>
>>>> Here is my sessionInfo on a MacBook Pro with 2GB RAM:
>>>>
>>>>> sessionInfo()
>>>> R version 2.11.1 (2010-05-31)
>>>> x86_64-apple-darwin9.8.0
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> Best regards
>>>> Christian
>>>> _._._._._._._._._._._._._._._._._._
>>>> C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a
>>>> V.i.e.n.n.a A.u.s.t.r.i.a
>>>> e.m.a.i.l: cstrato at aon.at
>>>> _._._._._._._._._._._._._._._._._._
>>>>
>>>> ______________________________________________
>>>> R-devel_at_r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 13 Jul 2010 - 19:07:32 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 14 Jul 2010 - 06:00:14 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive