Re: [Rd] Severe memory problem using split()

From: Martin Morgan <mtmorgan_at_fhcrc.org>
Date: Mon, 12 Jul 2010 15:31:46 -0700

On 07/12/2010 03:00 PM, cstrato wrote:

> Dear Martin,
> 
> Thank you, you are right, now I get:
> 
>> ann <- read.delim("Hu6800_ann.txt", stringsAsFactors=FALSE)
>> object.size(ann)
> 2035952 bytes
>> u2p  <- split(ann[,"ProbesetID"],ann[,"UNIT_ID"])
>> object.size(u2p)
> 1207368 bytes
>> object.size(unlist(u2p))
> 865176 bytes
> 
> Nevertheless, a size of 1.2MB for a list representing 2 of 11 columns of

but it's a list of length(unique(ann[["UNIT_ID"]]))) elements, each of which has a pointer to the element, a pointer to the name of the element, and the element data itself. I'd guess it adds up in a non-mysterious way. For a sense of it (and maybe only understandable if you have a working understanding of how R represents data) see, e.g.,

> .Internal(inspect(list(x=1,y=2)))

@1a4c538 19 VECSXP g0c2 [ATT] (len=2, tl=0)
  @191cad8 14 REALSXP g0c1 [] (len=1, tl=0) 1
  @191caa8 14 REALSXP g0c1 [] (len=1, tl=0) 2
ATTRIB:
  @16fc8d8 02 LISTSXP g0c0 []
    TAG: @60cf18 01 SYMSXP g0c0 [MARK,NAM(2),gp=0x4000] "names"     @1a4c500 16 STRSXP g0c2 [] (len=2, tl=0)
      @674e88 09 CHARSXP g0c1 [MARK,gp=0x21] "x"
      @728c38 09 CHARSXP g0c1 [MARK,gp=0x21] "y"

Martin

> a table of size 754KB seems still to be pretty large?
> 
> Best regards
> Christian
> 
> 
> On 7/12/10 11:44 PM, Martin Morgan wrote:
>> On 07/12/2010 01:45 PM, cstrato wrote:

>>> Dear all,
>>>

>>> With great interest I followed the discussion:
>>> https://stat.ethz.ch/pipermail/r-devel/2010-July/057901.html
>>> since I have currently a similar problem:
>>>

>>> In a new R session (using xterm) I am importing a simple table
>>> "Hu6800_ann.txt" which has a size of 754KB only:
>>>
>>>> ann<- read.delim("Hu6800_ann.txt")
>>>> dim(ann)

>>> [1] 7129 11
>>>
>>>

>>> When I call "object.size(ann)" the estimated memory used to store "ann"
>>> is already 2MB:
>>>
>>>> object.size(ann)

>>> 2034784 bytes
>>>
>>>

>>> Now I call "split()" and check the estimated memory used which turns out
>>> to be 3.3GB:
>>>
>>>> u2p<- split(ann[,"ProbesetID"],ann[,"UNIT_ID"])
>>>> object.size(u2p)

>>> 3323768120 bytes
>>
>> I guess things improve with stringsAsFactors=FALSE in read.delim?
>>
>> Martin
>>
>>>

>>> During the R session I am running "top" in another xterm and can see
>>> that the memory usage of R increases to about 550MB RSIZE.
>>>
>>>

>>> Now I do:
>>>
>>>> object.size(unlist(u2p))

>>> 894056 bytes
>>>

>>> It takes about 3 minutes to complete this call and the memory usage of R
>>> increases to about 1.3GB RSIZE. Furthermore, during evaluation of this
>>> function the free RAM of my Mac decreases to less than 8MB free PhysMem,
>>> until it needs to swap memory. When finished, free PhysMem is 734MB but
>>> the size of R increased to 577MB RSIZE.
>>>

>>> Doing "split(ann[,"ProbesetID"],ann[,"UNIT_ID"],drop=TRUE)" did not
>>> change the object.size, only processing was faster and it did use less
>>> memory on my Mac.
>>>

>>> Do you have any idea what the reason for this behavior is?
>>> Why is the size of list "u2p" so large?
>>> Do I make any mistake?
>>>
>>>

>>> Here is my sessionInfo on a MacBook Pro with 2GB RAM:
>>>
>>>> sessionInfo()

>>> R version 2.11.1 (2010-05-31)
>>> x86_64-apple-darwin9.8.0
>>>

>>> locale:
>>> [1] C
>>>

>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>

>>> Best regards
>>> Christian
>>> _._._._._._._._._._._._._._._._._._
>>> C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a
>>> V.i.e.n.n.a A.u.s.t.r.i.a
>>> e.m.a.i.l: cstrato at aon.at
>>> _._._._._._._._._._._._._._._._._._
>>>

>>> ______________________________________________
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 12 Jul 2010 - 22:38:04 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 13 Jul 2010 - 21:50:14 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive