[R] Odp: duplicated() and unique() problems

From: Petr PIKAL <petr.pikal_at_precheza.cz>
Date: Tue, 08 Jun 2010 12:58:08 +0200

Hi

r-help-bounces_at_r-project.org napsal dne 08.06.2010 08:44:39:

> Hi everybody
>
> I have found something (for me at least) strange with duplicated(). I
will
> first provide a replicable example of a certain kind of behaviour that I
> find odd and then give a sample of unexpected results from my own data.
I
> hope someone can help me understand this.
>
> Consider the following
>
> # this works as expected
>
> ex=sample(1:20, replace=TRUE)
>
> ex
>
> duplicated(ex)
>
> ex=sort(ex)

This is OK as sort sorts your data

>
> ex
>
> duplicated(ex)
>
>
> # but why does duplicate not work after order() ?
>
> ex=sample(1:20, replace=TRUE)
>
> ex
>
> duplicated(ex)
>
> ex=order(ex)

This is not as order gives you positions not your data

> ex=sample(letters[1:5],20, replace=TRUE)
> ex

 [1] "b" "b" "b" "e" "d" "c" "e" "a" "a" "d" "d" "d" "a" "e" "b" "c" "e" "d" "a"
[20] "a"
> ex<-order(ex)
> ex
 [1] 8 9 13 19 20 1 2 3 15 6 16 5 10 11 12 18 4 7 14 17
>

ex=ex[order(ex)]

shall give you the same result as sort. Maybe with exception of ties.

>
> duplicated(ex)
>
> Why does duplicated() not work after order() has been applied but it
works
> fine after sort() ? Is this an error or is there something I don't
> understand.
>
> I have been getting very strage results from duplicated() and unique()
in a
> dataset I am analysing. Her is a little sample of my real life problem
>
> > str(Masechaba$PROPDESC)
> Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 8043
16113
> 16054 13875 15780 12522 7771 14824 12314 ...
> > # Create a indicator if the PROPDESC is unique. Default false
> > Masechaba$unique=FALSE
> > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
> > # Check is something happended
> > length(which(Masechaba$unique==TRUE))
> [1] 2174
> > length(which(Masechaba$unique==FALSE))
> [1] 476
> > Masechaba$duplicate=FALSE
> > Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE
> > length(which(Masechaba$duplicate==TRUE))
> [1] 476
> > length(which(Masechaba$duplicate==FALSE))
> [1] 2174
> > # Looks OK so far
> > # Test on a known duplicate. I expect one to be true and one to be
false
> > Masechaba[which(Masechaba$PROPDESC==2363),10:12]
> PROPDESC unique duplicate
> 24874 2363 TRUE FALSE
> 31280 2363 TRUE TRUE
>
> # This is strange. I expected that unique() and duplicate() would give
the
> same results. The variable PROPDESC is clearly not unique in both cases.

No.

ex=sample(letters[1:5],10, replace=TRUE) ex
 [1] "b" "d" "d" "b" "a" "c" "b" "c" "d" "d" unique(ex)
[1] "b" "d" "a" "c"
duplicated(ex)
 [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE Functions give you different answers about your data as you ask different questions.

> > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE

                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
This seems to be strange. At first sight I am puzzlet what result I shall expect from such construction.

Regards
Petr

> # The totals are the same but not the individual results
> > table(Masechaba$unique,Masechaba$duplicate)
>
> FALSE TRUE
> FALSE 342 134
> TRUE 1832 342
>
> I don't understand this. Is there something I am missing?
>
> Best regards
> Christaan
>
>
> P.S
> > sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] splines stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] plyr_0.1.9 maptools_0.7-34 lattice_0.18-8 foreign_0.8-40
> Hmisc_3.8-0 survival_2.35-8 rgdal_0.6-26
> [8] sp_0.9-64
>
> loaded via a namespace (and not attached):
> [1] cluster_1.12.3 grid_2.11.1 tools_2.11.1
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 08 Jun 2010 - 11:02:03 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 08 Jun 2010 - 12:00:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive