Re: [Rd] [R] data.frame() size

From: Liaw, Andy <andy_liaw_at_merck.com>
Date: Fri 09 Dec 2005 - 19:13:49 GMT


I believe Gabor was referring to this:

http://tolstoy.newcastle.edu.au/R/devel/05/05/0837.html

Andy

From: Hin-Tak Leung
>
> Gabor Grothendieck wrote:
> > There was nothing attached in the copy that came through
> > to me.

>
> I like to see that patch also.
>
> > By the way, there was some discussion earlier this year
> > on a light-weight data.frame class but I don't think anyone
> > ever posted any code.

>
> It may have been me. I am working on a bit-packed data.frame
> which only uses 2-bits per unit of data, so it is 4 units per RAWSXP.
> (work in progress, nothing to show).
>
> So I am very interested to see the patch.
>
> Yes, I took a couple of weeks reading/learning where have all the
> memory gone in data.frame. The rowname/column names allocation is
> a bit stupid. Each rowname and each column name is a full
> R object, so there is a 32(or 28) byte overhead just from managing
> that, before the STRSXP for the actual string, which is
> another X bytes.
> so for an 1 x N data.frame with integers for content, the
> the content is 4-byte * N, but the rowname/columnname is 32 * N -ish.
> (a 9x increase). Word is 32-bit on most people's machines, and
> I am counting the extra one from which you have to keep the address
> of each SEXPREC somewhere, so it is 7+1 = 8, if I understand
> it correctly.
>
> Here is the relevant comment, quoted verbatum from around line 225 of
> "src/include/Rinternals.h":
>
> /* The generational collector uses a reduced version of SEXPREC as a
> header in vector nodes. The layout MUST be kept consistent with
> the SEXPREC definition. The standard SEXPREC takes up 7 words on
> most hardware; this reduced version should take up only 6 words.
> In addition to slightly reducing memory use, this can lead to more
> favorable data alignment on 32-bit architectures like the Intel
> Pentium III where odd word alignment of doubles is
> allowed but much
> less efficient than even word alignment. */
>
> Hin-Tak Leung
>
> > On 12/9/05, Matthew Dowle <mdowle@concordiafunds.com> wrote:
> >
> >>Hi,
> >>
> >>Please see below for post on r-help regarding data.frame() and the
> >>possibility of dropping rownames, for space and time reasons.
> >>I've made some changes, attached, and it seems to be
> working well. I see the
> >>expected space (90% saved) and time (10 times faster)
> savings. There are no
> >>doubt some bugs, and needs more work and testing, but I
> thought I would post
> >>first at this stage.
> >>
> >>Could some changes along these lines be made to R ? I'm
> happy to help with
> >>testing and further work if required. In the meantime I can
> work with
> >>overloaded functions which fixes the problems in my case.
> >>
> >>Functions effected :
> >>
> >> dim.data.frame
> >> format.data.frame
> >> print.data.frame
> >> data.frame
> >> [.data.frame
> >> as.matrix.data.frame
> >>
> >>Modified source code attached.
> >>
> >>Regards,
> >>Matthew
> >>
> >>
> >>-----Original Message-----
> >>From: Matthew Dowle
> >>Sent: 09 December 2005 09:44
> >>To: 'Peter Dalgaard'
> >>Cc: 'r-help@stat.math.ethz.ch'
> >>Subject: RE: [R] data.frame() size
> >>
> >>
> >>
> >>That explains it. Thanks. I don't need rownames though, as
> I'll only ever
> >>use integer subscripts. Is there anyway to drop them, or
> even better not
> >>create them in the first place? The memory saved (90%) by
> not having them
> >>and 10 times speed up would be very useful. I think I need
> a data.frame
> >>rather than a matrix because I have columns of different
> types in real life.
> >>
> >>
> >>>rownames(d) = NULL
> >>
> >>Error in "dimnames<-.data.frame"(`*tmp*`, value =
> list(NULL, c("a", "b" :
> >> invalid 'dimnames' given for data frame
> >>
> >>
> >>-----Original Message-----
> >>From: pd@pubhealth.ku.dk [mailto:pd@pubhealth.ku.dk] On
> Behalf Of Peter
> >>Dalgaard
> >>Sent: 08 December 2005 18:57
> >>To: Matthew Dowle
> >>Cc: 'r-help@stat.math.ethz.ch'
> >>Subject: Re: [R] data.frame() size
> >>
> >>
> >>Matthew Dowle <mdowle@concordiafunds.com> writes:
> >>
> >>
> >>>Hi,
> >>>
> >>>In the example below why is d 10 times bigger than m, according to
> >>>object.size ? It also takes around 10 times as long to
> create, which
> >>>fits with object.size() being truthful. gcinfo(TRUE) also
> indicates a
> >>>great deal more garbage collector activity caused by
> data.frame() than
> >>>matrix().
> >>>
> >>>$ R --vanilla
> >>>....
> >>>
> >>>>nr = 1000000
> >>>>system.time(m<<-matrix(integer(1), nrow=nr, ncol=2))
> >>>
> >>>[1] 0.22 0.01 0.23 0.00 0.00
> >>>
> >>>>system.time(d<<-data.frame(a=integer(nr), b=integer(nr)))
> >>>
> >>>[1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer
> >>>
> >>>
> >>>>dim(m)
> >>>
> >>>[1] 1000000 2
> >>>
> >>>>dim(d)
> >>>
> >>>[1] 1000000 2 # same dimensions
> >>>
> >>>
> >>>>storage.mode(m)
> >>>
> >>>[1] "integer"
> >>>
> >>>>sapply(d, storage.mode)
> >>>
> >>> a b
> >>>"integer" "integer" # same storage.mode
> >>>
> >>>
> >>>>object.size(m)/1024^2
> >>>
> >>>[1] 7.629616
> >>>
> >>>>object.size(d)/1024^2
> >>>
> >>>[1] 76.29482 # but 10 times bigger
> >>>
> >>>
> >>>>sum(sapply(d, object.size))/1024^2
> >>>
> >>>[1] 7.629501 # or is it ?
> If its not
> >>>really 10 times bigger, why 10 times longer above ?
> >>
> >>Row names!!
> >>
> >>
> >>
> >>>r <- as.character(1:1e6)
> >>>object.size(r)
> >>
> >>[1] 72000056
> >>
> >>>object.size(r)/1024^2
> >>
> >>[1] 68.6646
> >>
> >>'nuff said?
> >>
> >>--
> >> O__ ---- Peter Dalgaard ุster Farimagsgade 5, Entr.B
> >> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
> >> (*) \(*) -- University of Copenhagen Denmark
> Ph: (+45) 35327918
> >>~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)
> FAX: (+45) 35327907
> >>
> >>
> >>
> >>
> >>______________________________________________
> >>R-devel@r-project.org mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >>
> >>
> >
> >
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Dec 10 06:18:46 2005

This archive was generated by hypermail 2.1.8 : Fri 09 Dec 2005 - 21:21:08 GMT