[R] duplicated() and unique() problems

From: christiaan pauw <cjpauw_at_gmail.com>
Date: Tue, 08 Jun 2010 08:44:39 +0200

Hi everybody

I have found something (for me at least) strange with duplicated(). I will first provide a replicable example of a certain kind of behaviour that I find odd and then give a sample of unexpected results from my own data. I hope someone can help me understand this.

Consider the following

# this works as expected

ex=sample(1:20, replace=TRUE)






# but why does duplicate not work after order() ?

ex=sample(1:20, replace=TRUE)





Why does duplicated() not work after order() has been applied but it works fine after sort() ? Is this an error or is there something I don't understand.

I have been getting very strage results from duplicated() and unique() in a dataset I am analysing. Her is a little sample of my real life problem

> str(Masechaba$PROPDESC)

 Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 8043 16113 16054 13875 15780 12522 7771 14824 12314 ...
> # Create a indicator if the PROPDESC is unique. Default false
> Masechaba$unique=FALSE

> Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
> # Check is something happended
> length(which(Masechaba$unique==TRUE))
[1] 2174
> length(which(Masechaba$unique==FALSE))

[1] 476
> Masechaba$duplicate=FALSE
> Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE
> length(which(Masechaba$duplicate==TRUE))
[1] 476
> length(which(Masechaba$duplicate==FALSE))
[1] 2174
> # Looks OK so far
> # Test on a known duplicate. I expect one to be true and one to be false
> Masechaba[which(Masechaba$PROPDESC==2363),10:12]

      PROPDESC unique duplicate
24874     2363   TRUE     FALSE
31280     2363   TRUE      TRUE

# This is strange. I expected that unique() and duplicate() would give the same results. The variable PROPDESC is clearly not unique in both cases. # The totals are the same but not the individual results
> table(Masechaba$unique,Masechaba$duplicate)

        FALSE TRUE
  FALSE 342 134
  TRUE 1832 342 I don't understand this. Is there something I am missing?

Best regards

> sessionInfo()

R version 2.11.1 (2010-05-31)

[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines stats graphics grDevices utils datasets methods base

other attached packages:

[1] plyr_0.1.9      maptools_0.7-34 lattice_0.18-8  foreign_0.8-40
 Hmisc_3.8-0     survival_2.35-8 rgdal_0.6-26
[8] sp_0.9-64

loaded via a namespace (and not attached): [1] cluster_1.12.3 grid_2.11.1 tools_2.11.1

        [[alternative HTML version deleted]]

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 08 Jun 2010 - 08:44:03 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 08 Jun 2010 - 12:10:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive