Re: [R] Does SQL group by have a heavy duty equivalent in R

From: Charles C. Berry <>
Date: Mon 01 Jan 2007 - 02:02:55 GMT

On Sun, 31 Dec 2006, Farrel Buchinsky wrote:

> I have hundreds of humans who have undergone SNP genotyping at hundreds of
> loci. Some have even undergone the procedure twice or thrice (kind of an
> internal control).
> So obviously I need to find those replications, and confirm that the results
> are the same. If there is discordance then I need to address it.

Why not use duplicated() ?

For a data.frame with 200 rows of which about 50 are duplicates and 201 columns finding the (non) duplicates takes little time on my year old AMD 64 running Windows XP:

> my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
> snp.dat <- lapply(1:200,function(x) 0:1 )
> snp.frame <-,snp.dat))
> my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
> system.time( table(duplicated(my.dat)) )
[1] 0.03 0.00 0.03 NA NA

Finding the non-duplicated rows for which there is at least one replication:

> system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in% names(which(table(my.dat$ID)>1)) ) ))
[1] 0.05 0.00 0.05 NA NA

> I tried to use the aggregate function
> nr.attempts
> <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
> This was simply to figure out how many times the same piece of information
> had been obtained. I ran out of patience. It took beyond forever and tapply
> did not perform much better. The reshape package did not help - it implied
> one was out of luck if the data was not numeric. All of my data is character
> or factor.
> Instead I used RODBC
> sqlSave(channel,RawSeq)
> to push the table into a Microsoft Access database
> Then a sql query, courtesy of the Microsoft Access Query Wizard a la design
> mode.
> RawSeq.rownames) AS CountOfrownames
> FROM RawSeq
> WHERE (((RawSeq.GENOTYPE_ID)<>""))
> ORDER BY Count(RawSeq.rownames) DESC;
> This way I could easily use the minimum and maximum values to see if they
> were discordant.
> Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
> result of the SQL query back into R.
> This is the first time I have seen Microsoft Access outpace R.
> Is my observation correct or am I missing something. I would much rather
> perform all data manipulation and analyses in R.
> --
> Farrel Buchinsky
> [[alternative HTML version deleted]]
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E	         UC San Diego         La Jolla, San Diego 92093-0717

______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Mon Jan 01 13:07:39 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Mon 01 Jan 2007 - 04:30:24 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.