Re: [Rd] [R] data.frame() size

From: Hin-Tak Leung <hin-tak.leung_at_cimr.cam.ac.uk>
Date: Fri 09 Dec 2005 - 18:40:33 GMT

Gabor Grothendieck wrote:
> There was nothing attached in the copy that came through > to me.

I like to see that patch also.

> By the way, there was some discussion earlier this year
> on a light-weight data.frame class but I don't think anyone
> ever posted any code.

It may have been me. I am working on a bit-packed data.frame which only uses 2-bits per unit of data, so it is 4 units per RAWSXP. (work in progress, nothing to show).

So I am very interested to see the patch.

Yes, I took a couple of weeks reading/learning where have all the memory gone in data.frame. The rowname/column names allocation is a bit stupid. Each rowname and each column name is a full R object, so there is a 32(or 28) byte overhead just from managing that, before the STRSXP for the actual string, which is another X bytes. so for an 1 x N data.frame with integers for content, the the content is 4-byte * N, but the rowname/columnname is 32 * N -ish. (a 9x increase). Word is 32-bit on most people's machines, and I am counting the extra one from which you have to keep the address of each SEXPREC somewhere, so it is 7+1 = 8, if I understand it correctly.

Here is the relevant comment, quoted verbatum from around line 225 of "src/include/Rinternals.h":

/* The generational collector uses a reduced version of SEXPREC as a

    header in vector nodes. The layout MUST be kept consistent with     the SEXPREC definition. The standard SEXPREC takes up 7 words on     most hardware; this reduced version should take up only 6 words.     In addition to slightly reducing memory use, this can lead to more     favorable data alignment on 32-bit architectures like the Intel     Pentium III where odd word alignment of doubles is allowed but much     less efficient than even word alignment. */

Hin-Tak Leung

> On 12/9/05, Matthew Dowle <mdowle@concordiafunds.com> wrote:
> 

>>Hi,
>>
>>Please see below for post on r-help regarding data.frame() and the
>>possibility of dropping rownames, for space and time reasons.
>>I've made some changes, attached, and it seems to be working well. I see the
>>expected space (90% saved) and time (10 times faster) savings. There are no
>>doubt some bugs, and needs more work and testing, but I thought I would post
>>first at this stage.
>>
>>Could some changes along these lines be made to R ? I'm happy to help with
>>testing and further work if required. In the meantime I can work with
>>overloaded functions which fixes the problems in my case.
>>
>>Functions effected :
>>
>> dim.data.frame
>> format.data.frame
>> print.data.frame
>> data.frame
>> [.data.frame
>> as.matrix.data.frame
>>
>>Modified source code attached.
>>
>>Regards,
>>Matthew
>>
>>
>>-----Original Message-----
>>From: Matthew Dowle
>>Sent: 09 December 2005 09:44
>>To: 'Peter Dalgaard'
>>Cc: 'r-help@stat.math.ethz.ch'
>>Subject: RE: [R] data.frame() size
>>
>>
>>
>>That explains it. Thanks. I don't need rownames though, as I'll only ever
>>use integer subscripts. Is there anyway to drop them, or even better not
>>create them in the first place? The memory saved (90%) by not having them
>>and 10 times speed up would be very useful. I think I need a data.frame
>>rather than a matrix because I have columns of different types in real life.
>>
>>
>>>rownames(d) = NULL
>>
>>Error in "dimnames<-.data.frame"(`*tmp*`, value = list(NULL, c("a", "b" :
>> invalid 'dimnames' given for data frame
>>
>>
>>-----Original Message-----
>>From: pd@pubhealth.ku.dk [mailto:pd@pubhealth.ku.dk] On Behalf Of Peter
>>Dalgaard
>>Sent: 08 December 2005 18:57
>>To: Matthew Dowle
>>Cc: 'r-help@stat.math.ethz.ch'
>>Subject: Re: [R] data.frame() size
>>
>>
>>Matthew Dowle <mdowle@concordiafunds.com> writes:
>>
>>
>>>Hi,
>>>
>>>In the example below why is d 10 times bigger than m, according to
>>>object.size ? It also takes around 10 times as long to create, which
>>>fits with object.size() being truthful. gcinfo(TRUE) also indicates a
>>>great deal more garbage collector activity caused by data.frame() than
>>>matrix().
>>>
>>>$ R --vanilla
>>>....
>>>
>>>>nr = 1000000
>>>>system.time(m<<-matrix(integer(1), nrow=nr, ncol=2))
>>>
>>>[1] 0.22 0.01 0.23 0.00 0.00
>>>
>>>>system.time(d<<-data.frame(a=integer(nr), b=integer(nr)))
>>>
>>>[1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer
>>>
>>>
>>>>dim(m)
>>>
>>>[1] 1000000 2
>>>
>>>>dim(d)
>>>
>>>[1] 1000000 2 # same dimensions
>>>
>>>
>>>>storage.mode(m)
>>>
>>>[1] "integer"
>>>
>>>>sapply(d, storage.mode)
>>>
>>> a b
>>>"integer" "integer" # same storage.mode
>>>
>>>
>>>>object.size(m)/1024^2
>>>
>>>[1] 7.629616
>>>
>>>>object.size(d)/1024^2
>>>
>>>[1] 76.29482 # but 10 times bigger
>>>
>>>
>>>>sum(sapply(d, object.size))/1024^2
>>>
>>>[1] 7.629501 # or is it ? If its not
>>>really 10 times bigger, why 10 times longer above ?
>>
>>Row names!!
>>
>>
>>
>>>r <- as.character(1:1e6)
>>>object.size(r)
>>
>>[1] 72000056
>>
>>>object.size(r)/1024^2
>>
>>[1] 68.6646
>>
>>'nuff said?
>>
>>--
>> O__ ---- Peter Dalgaard ุster Farimagsgade 5, Entr.B
>> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
>> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
>>~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
>>
>>
>>
>>
>>______________________________________________
>>R-devel@r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>>
> 
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Dec 10 05:49:48 2005

This archive was generated by hypermail 2.1.8 : Fri 09 Dec 2005 - 21:21:08 GMT