Re: [R] Help with matching rows

From: Petr Savicky <savicky_at_praha1.ff.cuni.cz>
Date: Thu, 21 Apr 2011 09:34:29 +0200

On Wed, Apr 20, 2011 at 10:09:26PM -0400, gary engstrom wrote:
> Dear Sir,
>
> Please excuse my akwardness as I a new to R and computers, but would kindly
> appreciate help.
> {
> a <- sample (1:10,100,replace=T )
> b <-sample(10:20,100,replace=T)
> c <- sample(20:30,100,replace=T)
> d <- sample(30:40,100,replace=T)
> e <- sample(40:50,100,replace=T)
> }
> d1 <- a
> d2 <- b
> d3 <-c
> d4 <- d
> d5 <- e
>
> data.frame(d1,d2,d3,d4,d5)
> dd <- data.frame(d1,d2,d3,d4,d5)
> dd
> sd(d1)
> summary(d1)
> sd(d2)
> summary(d2)
> sd(d3)
> summary(d3)
> sd(d4)
> summary(d4)
> sd(d5)
> summary(d5)
> I am a beginner to R and am trying to learn statistical
> probability. I have started Dr. Levine and Dr Kerns books.
> So far from the usual sources, I haven't found the answers
> to the following questions and would greatly appreciate
> any assistance that anyone might kindly share.
> If I run this code, how do I look for duplicate rows and how can

See ?duplicated .

> I adjust the SD of the sample function to make the chances
> of a duplicate row occur more often ?

A simple way, how to increase the number of duplicated rows, is to reduce the space, from which the rows are drawn.

The following estimates the probability to have at least one duplicated row using your original code.

  m <- 10000
  count <- 0
  for (i in 1:m) {

      d1 <- sample(1:10,100,replace=T)
      d2 <- sample(10:20,100,replace=T)
      d3 <- sample(20:30,100,replace=T)
      d4 <- sample(30:40,100,replace=T)
      d5 <- sample(40:50,100,replace=T)
      dd <- data.frame(d1,d2,d3,d4,d5)
      if (any(duplicated(dd))) {
          count <- count + 1
      }

  }
  count/m

I obtained

  [1] 0.035

This probability may also be computed exactly as follows. The number of all possible rows, from which we sample, is the product of the sizes of the sets, from which each component is chosen. This is 10*11^4. Using this, the probability to have at least one duplicated row among 100 rows chosen from the uniform distribution is

  N <- 10*11^4 # the number of all possible rows   1 - prod(1 - (0:99)/N)
  [1] 0.03325143

If the sample space is reduced to 8^5 using

    d1 <- sample(1:8,100,replace=T)
    d2 <- sample(11:18,100,replace=T)
    d3 <- sample(21:28,100,replace=T)
    d4 <- sample(31:38,100,replace=T)
    d5 <- sample(41:48,100,replace=T)

then the probability to have at least one duplicated row increases to

  N <- 8^5
  1 - prod(1 - (0:99)/N)
  [1] 0.1403373

Hope this helps.

Petr Savicky.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 21 Apr 2011 - 07:37:40 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 21 Apr 2011 - 08:40:32 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive