Re: [R] counting the occurrences of vectors

From: Marc Schwartz <MSchwartz_at_medanalytics.com>
Date: Tue 06 Jul 2004 - 09:08:34 EST

On Sun, 2004-07-04 at 19:28, Spencer Graves wrote:
> I see a case where "f1" gives the wrong answer:
>
> b <- array(c("a:b", "a", "c", "b:c"), dim=c(2,2))
> a <- b[c(1,1),]
>
> For these two matrices, f1(a,b) == c(2,2), while f2(a,b) ==
> c(2,0). If b does not contain ":", e.g., if it is numeric, then this
> pathology can not occur. However, if "f1" is used with objects of class
> character or string that could contain the "collapse" character, it
> could give an incorrect answer without warning.

Greetings,

After seeing Gabor and Spencer's replies, I of course realized that my initial reply was not entirely what Ravi was looking for. :-)

However, after seeing Spencer's example above, the thing that I also noted was the likely overhead involved in paste()ing together the rows to create objects that could then be tabulated. This is likely to become more of an issue as the matrix size grows.

It came to me that with a modest modification to my initial function, combined with Gabor's approach to tabulation, a new function could be created that avoids the paste()ing overhead:

row.match.count <- function(m1, m2)
{
  if (ncol(m1) != (ncol(m2)))
    stop("Matrices must have the same number of columns")

  if (typeof(m1) != (typeof(m2)))
    stop("Matrices must have the same data type")

  m1.l <- as.character(apply(m1, 1, list))   m2.l <- as.character(apply(m2 ,1, list))

  # return counts for each row in m1.l in m2.l   table(c(unique(m1.l), m2.l))[m1.l] - 1 }

Using Gabor's original two matrices:

set.seed(1)
a <- matrix(sample(3,1000,rep=T),nc=5)
b <- matrix(sample(3,100,rep=T),nc=5)

We can then do (Count rows from 'b' in 'a'):

> gc(); system.time(ans <- row.match.count(b, a))

         used (Mb) gc trigger (Mb)
Ncells 541226 14.5     741108 19.8
Vcells 141364  1.1     786432  6.0

[1] 0.01 0.00 0.00 0.00 0.00

Now...the downside to this approach is that the actual output of the function, due to the coercion, is a wee bit ugly (OK, more than a wee bit...)

For example, using Spencer's two matrices above, we get:

b <- array(c("a:b", "a", "c", "b:c"), dim=c(2,2)) a <- b[c(1,1),]

> row.match.count(b, a)

list(c("a:b", "c")) list(c("a", "b:c"))

                  2                   0 


Go back to my two matrices:

> m <- matrix(1:20, ncol = 4, byrow = TRUE)
> n <- matrix(1:40, ncol = 4, byrow = TRUE)

> row.match.count(m, n)

    list(as.integer(c(1, 2, 3, 4)))     list(as.integer(c(5, 6, 7, 8))) 
                                  1                                   1 
 list(as.integer(c(9, 10, 11, 12))) list(as.integer(c(13, 14, 15, 16))) 
                                  1                                   1 
list(as.integer(c(17, 18, 19, 20))) 
                                  1 



So, since we have a few extra CPU cycles to use, we could include some sub()s to clean up the names in the resultant table:

row.match.count <- function(m1, m2)
{
  if (ncol(m1) != (ncol(m2)))
    stop("Matrices must have the same number of columns")

  if (typeof(m1) != (typeof(m2)))
    stop("Matrices must have the same data type")

  m1.l <- as.character(apply(m1, 1, list))   m2.l <- as.character(apply(m2 ,1, list))

  # return counts for each m1.l in m2.l
  match.table <- table(c(unique(m1.l), m2.l))[m1.l] - 1

  # clean up table names
  if (typeof(m1) == "integer")
  {
    names(match.table) <- sub("^list\\(as.integer\\(", "",

                              names(match.table))
    names(match.table) <- sub("\\)\\)$", "", names(match.table))   }
  else if (typeof(m1) == "character")
  {
    names(match.table) <- sub("^list\\(", "", names(match.table))     names(match.table) <- sub("\\)$", "", names(match.table))   }

  match.table
}

Somebody with more regex insight than I could probably clean up the latter part of the function, but it seems to work well.

That being said, we now get:

> row.match.count(m, n)

    c(1, 2, 3, 4)     c(5, 6, 7, 8)  c(9, 10, 11, 12) c(13, 14, 15, 16) 
                1                 1                 1                 1 
c(17, 18, 19, 20) 
                1 

and

> row.match.count(b, a)

c("a:b", "c") c("a", "b:c")

            2 0

Going back to Gabor's original two matrices, the addition of the names clean up does not seem to add much overhead:

set.seed(1)
a <- matrix(sample(3,2000,rep=T),nc=10)
b <- matrix(sample(3,200,rep=T),nc=10)

> gc(); system.time(ans <- row.match.count(b, a))

         used (Mb) gc trigger (Mb)
Ncells 541243 14.5     818163 21.9
Vcells 140464  1.1     786432  6.0

[1] 0.01 0.00 0.01 0.00 0.00

> ans

c(2, 1, 1, 1, 2) c(3, 3, 1, 3, 2) c(2, 1, 2, 3, 2) c(3, 3, 2, 1, 1)

               1                1                3                1 
c(1, 1, 1, 2, 3) c(1, 3, 2, 3, 3) c(2, 2, 2, 1, 2) c(2, 1, 1, 1, 1) 
               2                0                0                0 
c(3, 2, 2, 3, 3) c(2, 3, 3, 2, 2) c(3, 2, 1, 1, 2) c(2, 2, 2, 1, 3) 
               2                1                0                2 
c(1, 2, 2, 2, 1) c(3, 3, 3, 2, 1) c(2, 2, 3, 3, 3) c(3, 1, 1, 2, 3) 
               1                0                3                1 
c(3, 2, 3, 3, 1) c(1, 2, 2, 1, 2) c(1, 3, 2, 2, 2) c(1, 1, 1, 2, 3) 
               0                1                0                2 


I'd be curious to get any feedback on this and if someone has any thoughts on any gotchas with this approach.

Thanks and I hope that this is of some help.

Marc Schwartz



R-help@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Jul 06 09:13:54 2004

This archive was generated by hypermail 2.1.8 : Fri 18 Mar 2005 - 09:40:22 EST