Re: [R] possible bug in merge with duplicate blank names in 'by' field.

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Fri 17 Jun 2005 - 17:26:15 EST

What version of R is this (please do see the posting guide)?

In both 2.1.0 and 2.1.1 beta I get

> all

   Promoter ip.x ip.y ip

1            30   40 40
2            40   40 40
3        a   10   NA NA
4        c   20   20 20
5        b   NA   15 15
6        d   NA   30 30

so cannot reproduce your result. Are you sure that the `blanks' really are empty and not some character that is printing as empty on your unstated OS?

BTW ' ' is what is normally called `blank'.

BTW, these are not `names' but character strings: `names' has other meanings in R.

On Thu, 16 Jun 2005, Frank Gibbons wrote:

> Run this:
>
>> p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <-
>> data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.
>> p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <-
>> data.frame(Promoter=p, ip=a)
>> all <- merge(x=d1, y=d2, by="Promoter", all=T)
>> all <- merge(x=all, y=d2, by="Promoter", all=T)
>> all
>
> Data is this:
>
>> d1
>> Promoter ip
>> 1 a 10
>> 2 c 20
>> 3 30
>> 4 40
>>
>> d2
>> Promoter ip
>> 1 b 15
>> 2 c 20
>> 3 d 30
>> 4 40
>
> Output looks like this:
>
>> Promoter ip.x ip.y ip
>> 1 40 30 30
>> 2 40 40 30
>> 3 40 30 40
>> 4 40 40 40
>> 5 b 15 NA NA
>> 6 c 20 20 20
>> 7 d 30 NA NA
>> 8 a NA 10 10
>
> The weird thing about this is (in my view) that each instance of '' is
> considered unique, so with each successive merge, all combinatorial
> possibilities are explored, like a SQL outer join (Cartesian product). For
> non-empty names, an inner join is performed.
>
> Dealing with genomic data (10^4 datapoints), it's easy to have a couple of
> blanks buried in the middle of things, and to combine several replicates
> with successive merges. I couldn't understand how my three replicates of
> 6000 points, in which I expected substantial overlap in the labels, were
> taking so long to merge and ultimately generating 57000 labels. The culprit
> turned out to be a few hundred blanks buried in the middle.
>
> Why does the empty ("null") name merit special treatment? Perhaps I'm
> missing something. I hesitate to submit this as a bug, since technically I
> guess you could say that blank names, especially duplicates, are not
> kosher. But on the other hand, this combinatorial behaviour seems to occur
> only for blanks.
>
> -Frank
>
> PhD, Computational Biologist,
> Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA.
> Tel: 617-432-3555 Fax:
> 617-432-3557 http://llama.med.harvard.edu/~fgibbons
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Jun 17 18:14:44 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:46 EST