[R] possible bug in merge with duplicate blank names in 'by' field.

From: Frank Gibbons <fgibbons_at_hms.harvard.edu>
Date: Fri 17 Jun 2005 - 07:59:56 EST


Run this:

>p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <-
>data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.

>p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <-

>data.frame(Promoter=p, ip=a)
>all <- merge(x=d1, y=d2, by="Promoter", all=T)
>all <- merge(x=all, y=d2, by="Promoter", all=T)
>all

Data is this:

>d1
> Promoter ip
>1 a 10
>2 c 20
>3 30
>4 40
>
>d2
> Promoter ip
>1 b 15
>2 c 20
>3 d 30
>4 40

Output looks like this:

> Promoter ip.x ip.y ip
>1 40 30 30
>2 40 40 30
>3 40 30 40
>4 40 40 40
>5 b 15 NA NA
>6 c 20 20 20
>7 d 30 NA NA
>8 a NA 10 10

The weird thing about this is (in my view) that each instance of '' is considered unique, so with each successive merge, all combinatorial possibilities are explored, like a SQL outer join (Cartesian product). For non-empty names, an inner join is performed.

Dealing with genomic data (10^4 datapoints), it's easy to have a couple of blanks buried in the middle of things, and to combine several replicates with successive merges. I couldn't understand how my three replicates of 6000 points, in which I expected substantial overlap in the labels, were taking so long to merge and ultimately generating 57000 labels. The culprit turned out to be a few hundred blanks buried in the middle.

Why does the empty ("null") name merit special treatment? Perhaps I'm missing something. I hesitate to submit this as a bug, since technically I guess you could say that blank names, especially duplicates, are not kosher. But on the other hand, this combinatorial behaviour seems to occur only for blanks.

-Frank

PhD, Computational Biologist,
Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA.

Tel: 617-432-3555       Fax: 
617-432-3557       http://llama.med.harvard.edu/~fgibbons

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jun 17 07:59:47 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:44 EST