Re: [R] data summarization etc...

From: Daniel Malter <daniel_at_umd.edu>
Date: Fri, 11 Jul 2008 19:53:04 -0400


The problem is that you do not really have categories. You draw 3 times 70000 random normal variables and then try to subset one by the other. Since, no of the values will perfectly coincide with another, your code would create something like 70000^3 categories. No wonder that you are running out of memory. So what you are doing is nonsensical unless you really have some groups/categories that cluster your data and which are filled with a substantial number of observations (see example below).

x1=rnorm(30000,0,1)

x2=rnorm(30000,10,5)
group1=rep(c(1:3),each=10000)
group2=rep(c(1:3),10000)

aggregate(cbind(x1,x2),list(group1,group2),FUN=mean)

Best,
Daniel



cuncta stricte discussurus

-----Ursprüngliche Nachricht-----
Von: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] Im Auftrag von sj
Gesendet: Friday, July 11, 2008 6:47 PM
An: r-help
Betreff: [R] data summarization etc...

Hello,

I am trying to do some fairly straightforward data summarization, i.e., the kind you would do with a pivot table in excel or by using SQL queires. I have a moderately sized data set of ~70,000 records and I am trying to compute some group averages and sum values within groups. the code example below shows how I am trying to go about doing this

pti <-rnorm(70000,10)
fid <- rnorm(70000,100)
finc <- rnorm(70000,1000)

### compute the sums of pti within fid groups sum_pinc <-aggregate(cbind(fid,pti),list(fid),FUN=sum)

#### compute mean finc within fid groups tot_finc <- aggregate(cbind(fid,finc),list(fid),FUN=mean)

when I try to do it this way I get an error message telling me that enough memory cannot be allocated ( I am using R 2.7.1 on Windows XP with 2 GB of Memory). I figure that there must be a more efficent way to go about doing this. Please suggest.

I would typically do this kind of task in a database and use SQL to push the data around. I know RODBC allows you to write SQL to query external DBs. Is there any mechanisim that allows you to write SQL queies against datasets internal to R e.g. in the case above

I could do something like

set <- cbind(fid,pti,finc)

select fid, sum(pti)
from set
group by fid

that would be handy!

Thanks,

Spencer

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 12 Jul 2008 - 00:06:27 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 12 Jul 2008 - 01:31:45 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive