Re: [Rd] duplicates() function

From: Petr Savicky <savicky_at_cs.cas.cz>
Date: Sat, 09 Apr 2011 20:09:34 +0200

On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote:
> I need a function which is similar to duplicated(), but instead of
> returning TRUE/FALSE, returns indices of which element was duplicated.
> That is,
>
> > x <- c(9,7,9,3,7)
> > duplicated(x)
> [1] FALSE FALSE TRUE FALSE TRUE
>
> > duplicates(x)
> [1] NA NA 1 NA 2
>
> (so that I know that element 3 is a duplicate of element 1, and element
> 5 is a duplicate of element 2, whereas the others were not duplicated
> according to our definition.)
>
> Is there a simple way to write this function?

A possible strategy is to use sorting. In a sorted matrix or data frame, the elements, which are duplicates of the same element, form consecutive blocks. These blocks may be identified using !duplicated(), which determines the first elements of these blocks. Since sorting is stable, when we map these blocks back to the original order, the first element of each block is mapped to the first ocurrence of the given row in the original order.

An implementation may be done as follows.

  duplicates <- function(dat)
  {

      s <- do.call("order", as.data.frame(dat))
      non.dup <- !duplicated(dat[s, ])
      orig.ind <- s[non.dup]
      first.occ <- orig.ind[cumsum(non.dup)]
      first.occ[non.dup] <- NA
      first.occ[order(s)]

  }  

  x <- cbind(1, c(9,7,9,3,7) )
  duplicates(x)
  [1] NA NA 1 NA 2

The line

      orig.ind <- s[non.dup]

creates a vector, whose length is the number of non-duplicated rows in the sorted "dat". Its components are indices of the corresponding first occurrences of these rows in the original order. For this, the stability of the order is needed.

The lines

      first.occ <- orig.ind[cumsum(non.dup)]
      first.occ[non.dup] <- NA

expand orig.ind to a vector, which satisfies: If i-th row of the sorted "dat" is duplicated, then first.occ[i] is the index of the first row in the original "dat", which is equal to this row. So, the values in first.occ are those, which are required for the output of duplicates(), but they are in the order of the sorted "dat". The last line

  first.occ[order(s)]

reorders the vector to the original order of the rows.

Petr Savicky.



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 09 Apr 2011 - 18:13:16 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 09 Apr 2011 - 20:50:46 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive