Re: [R] Comparison of aggregate in R and group by in mysql

From: jim holtman <jholtman_at_gmail.com>
Date: Sat, 26 Jan 2008 20:40:51 -0500

I think with your data you will be computing a matrix that is 7049 x 11704. This will require about 700MB of memory. What size system do you have (how much memory)? How big is the dataframe? (do 'str' and report what it says). This will require a lot more resources and given that you have about 80M possible combinations, I would assume that a lot of them are probably empty. It is having to 'split' the data into the groups and then summarize. Maybe you should use a database with this combination of data.

2008/1/26 zhihuali <lzhtom_at_hotmail.com>:
>
> I repeated your experiment:
> > n <- 1000000
> > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),C=sample(LETTERS[1:4], n, TRUE), data=runif(n))
> > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
> user system elapsed
> 1.824 0.212 2.038
>
>
> Now I use my own data:
> > length(levels(group))
> [1] 7049
> > length(levels(type))
> [1] 11704
> > y<-data.frame(group,type,signal)
> > system.time(y.agg <- aggregate(y$signal, list(y$group,y$type), mean))
> (I killed it after 30 minutes)
>
>
>
> > Date: Sat, 26 Jan 2008 19:55:51 -0500
> > From: jholtman_at_gmail.com
> > To: lzhtom_at_hotmail.com
> > Subject: Re: [R] Comparison of aggregate in R and group by in mysql
> > CC: r-help_at_stat.math.ethz.ch
>
> >
> > How large is your dataframe? How much memory do you have on your
> > system? Are you paging? Here is a test I ran with a data frame with
> > 1,000,000 entries and it seems to be fast:
> >
> > > n <- 1000000
> > > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),
> > + C=sample(LETTERS[1:4], n, TRUE), data=runif(n))
> > > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
> > user system elapsed
> > 2.65 0.34 3.00
> > >
> >
> > On Jan 26, 2008 6:45 PM, zhihuali <lzhtom_at_hotmail.com> wrote:
> > >
> > > Hi, netters,
> > >
> > > First of all, thanks a lot for all the prompt replies to my earlier question about "merging" data frames in R.
> > > Actually that's an equivalence to the "join" clause in mysql.
> > >
> > > Now I have another question. Suppose I have a data frame X with lots of columns/variables:
> > > Name, Age,Group, Type, Salary.
> > > I wanna do a subtotal of salaries:
> > > aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean)
> > >
> > > When the levels of Group and Type are huge, it took R forever to finish the aggregation.
> > > And I used gc to find that the memory usage was big too.
> > >
> > > However, in mysql, it took seconds to finish a similar job:
> > > select Group,Age,Type ,avg(Salary) from X group by Group,Age,Type
> > >
> > > Is it because mysql is superior in doing such kind of things? Or my R command is not efficient enough? Why did R have to consume huge memories to do the aggregation?
> > >
> > > Thanks again!
> > >
> > > Zhihua Li
> > >
> > > _________________________________________________________________
> > > 天凉了,添衣了,心动了,"七件"了
> > > http://get.live.cn
> > > [[alternative HTML version deleted]]
> > >
> > >
> > > ______________________________________________
> > > R-help_at_r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem you are trying to solve?
>
> _________________________________________________________________
> MSN圣诞礼物火热登场,免费发放中,快来领取吧!
> http://im.live.cn/emoticons/?ID=18
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?
______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sun 27 Jan 2008 - 02:07:45 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 27 Jan 2008 - 03:30:08 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive