Re: [R] aggregate slow with many rows - alternative?

From: jim holtman <jholtman_at_gmail.com>
Date: Fri 14 Oct 2005 - 19:23:12 EST

Here is the way that I would do it. Using 'lapply' to process the list and create a matrix; take less than 1 second:  > dat <- data.frame(D=sample(32000:33000, 33000, T), + Fid=sample(1:10,33000,T), A=sample(1:5,33000,T))
> system.time({

+ result <- lapply(split(seq(nrow(dat)), dat$D), function(.d){ # split by first level

+ lapply(split(.d, dat$Fid[.d]), function(.f){ # now by the second
+ # create the sum and count
+ c(D=dat$D[.f[1]], Fid=dat$Fid[.f[1]], sum=sum(dat$A[.f]), cnt=length(.f))
+ })
+ })
+ mat <- do.call('rbind',lapply(result, function(x) do.call('rbind',x)))
+ })

[1] 0.66 0.00 0.73 NA NA
>
> mat[1:20,]

D Fid sum cnt
1 32000 1 8 3
2 32000 2 11 4
3 32000 3 11 3
4 32000 4 2 1
5 32000 5 8 2
6 32000 6 4 2
7 32000 7 21 6
8 32000 8 13 3
9 32000 9 12 4
10 32000 10 10 3
1 32001 1 12 4
2 32001 2 2 1
3 32001 3 10 4
4 32001 4 12 3
5 32001 5 10 3
6 32001 6 8 2
7 32001 7 22 7
8 32001 8 3 2
9 32001 9 7 3
10 32001 10 3 2
>

 On 10/14/05, TEMPL Matthias <Matthias.Templ@statistik.gv.at> wrote:
>
> Hi,
>
> Yesterday, I have analysed data with 160000 rows and 10 columns.
> Aggregation would be impossible with a data frame format, but when
> converting it to a matrix with *numeric* entries (check, if the variables
> are of class numeric!) the computation needs only 7 seconds on a Pentium
> III. Im sadly to say, that this is also slow in comparsion with the proc
> summary in SAS (less than one second), but the code is much more elegant in
> R!
>
> Best,
> Matthias
>
>
> > Hi,
> >
> > I use the code below to aggregate / cnt my test data. It
> > works fine, but the problem is with my real data (33'000
> > rows) where the function is really slow (nothing happened in
> > half an hour).
> >
> > Does anybody know of other functions that I could use?
> >
> > Thanks,
> > Hans-Peter
> >
> > --------------
> > dat <- data.frame( Datum = c( 32586, 32587, 32587, 32625,
> > 32656, 32656, 32656, 32672, 32672, 32699 ),
> > FischerID = c( 58395, 58395, 58395, 88434,
> > 89953, 89953, 89953, 64395, 62896, 62870 ),
> > Anzahl = c( 2, 2, 1, 1, 2, 1, 7, 1, 1, 2 ) )
> > f <- function(x) data.frame( Datum = x[1,1], FischerID =
> > x[1,2], Anzahl = sum( x[,3] ), Cnt = dim( x )[1] )
> > t.a <- do.call("rbind", by(dat, dat[,1:2], f)) # slow for
> > 33'000 rows
> > t.a <- t.a[order( t.a[,1], t.a[,2] ),]
> >
> > # show data
> > dat
> > t.a
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read
> > the posting guide! http://www.R-project.org/posting-guide.html
> >
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

--
Jim Holtman
Cincinnati, OH
+1 513 247 0281

What the problem you are trying to solve?

	[[alternative HTML version deleted]]


______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Received on Fri Oct 14 19:33:27 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:40:44 EST