Re: [R] counting row repetitions without loop

From: Douglas Bates <bates_at_stat.wisc.edu>
Date: Wed, 6 Feb 2008 13:15:48 -0600

On Feb 6, 2008 8:08 AM, Waterman, DG (David) <david.waterman_at_diamond.ac.uk> wrote:
> Hi,

> I have a data frame consisting of coordinates on a 10*10 grid, i.e.

> > example
> x y
> 1 4 5
> 2 6 7
> 3 6 6
> 4 7 5
> 5 5 7
> 6 6 7
> 7 4 5
> 8 6 7
> 9 7 6
> 10 5 6

> What I would like to do is return an 10*10 matrix consisting of counts
> at each position, so in the above example I would have a matrix where,
> for example, cell [4,5] contains 2 and [6,7] contains 3. At the moment I
> have implemented this using a for loop over the rows of the data frame,
> however the data frames I want to process are very long so the loop
> takes many minutes to complete. Can I do this in a more efficient way?

What you are describing is essentially a cross-tabulation so you could use

> examp

   x y
1 4 5
2 6 7
3 6 6
4 7 5
5 5 7
6 6 7
7 4 5
8 6 7
9 7 6
10 5 6
> xtabs(~ x + y, examp)

   y
x 5 6 7
  4 2 0 0
  5 0 1 1
  6 0 1 3
  7 1 1 0

This omits the rows and columns which are completely empty but you can work around that.

If you have a very large collection of such pairs to summarize you could consider the version of xtabs in the Matrix package that allows for the argument sparse = TRUE. That uses conversion of the "triplet" form of a sparse matrix to the compressed column for to do the counting.

If you want to do this without converting the integers in 'x' and 'y' to factors you can use a distinctly unobvious function like

library(Matrix)
sparsetab <- function(x, y)
{

    x <- as.integer(x)
    y <- as.integer(y)
    stopifnot(length(x) == length(y))

    lx <- length(x)
    mx <- max(x)
    my <- max(y)
    as(new("dgTMatrix", i = x - 1L, j = y - 1L,
           x = rep(1, length(x)), Dim = c(mx, my),
           Dimnames = list(1:mx,1:my)), "dgCMatrix")
}

which produces

> with(examp, sparsetab(x, y))

7 x 7 sparse Matrix of class "dgCMatrix"   1 2 3 4 5 6 7

1 . . . . . . .
2 . . . . . . .
3 . . . . . . .
4 . . . . 2 . .
5 . . . . . 1 1
6 . . . . . 1 3
7 . . . . 1 1 .

One reason to use such a function instead of xtabs is because xtabs will convert 'x' and 'y' to factors and the default ordering of the levels is lexicographic so '11' occurs before '2'. Again, you can get around that but the function shown above is more direct and should be fast enough for most any application.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 06 Feb 2008 - 19:21:27 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 06 Feb 2008 - 20:30:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive