Re: [Rd] unique.matrix issue [Was: Anomaly with unique and match]

From: Petr Savicky <savicky_at_cs.cas.cz>
Date: Sat, 12 Mar 2011 09:09:02 +0100

On Thu, Mar 10, 2011 at 01:19:48AM -0800, Henrik Bengtsson wrote:
> It should be possible to run unique()/duplicated() column by column
> and incrementally update the set of unique/duplicated rows. This
> would avoid any coercing. The benefit should be even greater for
> data.frame():s.

This is a good point. An implementation of this using sorting can be done as follows

  Sort the data frame using function order().   Determine the groups of consecutive equal rows in the sorted df.   Map the first row of each group to the original order of the rows.   Since sorting by the function order() is stable, we obtain the first   in each group of equal rows also in the original order.

The coercion approach uses hashing for string comparison, but the efficiency of hashing seems to be overweighted by the inefficiency of the coercion. So, we get the following comparison.

  a <- matrix(sample(c(1234, 5678), 12*10000, replace=TRUE), ncol=12)   df <- data.frame(a)   

  do.unique.sort <- function(df)
  {

  	i <- do.call(order, df)
  	n <- nrow(df)
  	u <- c(TRUE, rowSums(df[i[2:n], ] == df[i[1:(n-1)], ]) < ncol(df))
  	df[u[order(i)], ]

  }

  system.time(out1 <- do.unique.sort(df))   system.time(out2 <- unique(df))
  identical(out1, out2)

The result may be, for example

     user system elapsed
    0.279 0.000 0.273
     user system elapsed
    0.514 0.000 0.468
  [1] TRUE On another computer

     user system elapsed
    0.058 0.000 0.058
     user system elapsed
    0.187 0.000 0.188
  [1] TRUE On Thu, Mar 10, 2011 at 01:39:56PM -0600, Terry Therneau wrote:
> Simon pointed out that the issue I observed was due to internal
> behaviour of unique.matrix.
>
> I had looked carefully at the manual pages before posting the question
> and this was not mentioned. Perhaps an addition could be made?

According to the description of unique(), the user may expect that if b is obtained using

  b <- unique(a)

then for every "i" there is "j", such that

  all(a[i, ] == b[j, ])

This is usually true, but not always, because among several numbers in "a" with the same as.character() only one remains in "b". If this is intended, then i support the suggestion to include a note in the documentation.

Let me add an argument against using as.character() to determine, whether two numbers are close. The maximum relative difference between the numbers, which have the same 15 digit decimal representation, varies by a factor up to 10 in different ranges. Due to this, we have

  x <- 1 + c(1.1, 1.3, 1.7, 1.9)*1e-14

  unique(as.character(x))
  [1] "1.00000000000001" "1.00000000000002"

  unique(as.character(9*x))
  [1] "9.0000000000001" "9.00000000000012" "9.00000000000015" "9.00000000000017"

The relative differences between components of 9*x are the same as the relative differences in x, but if the mantissa begins with 9, then a smaller relative difference is sufficient to change 15-th digit.

In terms of unique(), this implies

  nrow(unique(cbind(x)))
  [1] 2

  nrow(unique(cbind(9*x)))
  [1] 4   

Petr Savicky.



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 12 Mar 2011 - 08:15:56 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 12 Mar 2011 - 13:50:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive