From: Ben Haller <rhelp_at_sticksoftware.com>

Date: Wed, 20 Apr 2011 09:05:20 -0400

}

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 20 Apr 2011 - 13:32:11 GMT

Date: Wed, 20 Apr 2011 09:05:20 -0400

Laurent Jégou wrote:

> Hello list members, i'd like to calculate the Moran I on a large dataset,

*> a 8640x3432 grid of values.
**>
**> When i try to create the neighbours list with the cell2nb function, on
**> such a scale, R works for several hours and seems to crash. I'm using the last
**> version (2.13), 64 bits, on a mac pro with 4go of memory.
**>
**> I think that i'm doing it wrong :-)
**>
**> I'd like to use a "moving window" weight matrix instead of a full scale
**> one, but i was not lucky enough to find a solution in the docs.
*

I confronted this problem recently, and decided to roll my own. For a large dataset, an exhaustive computation of Moran's I is very time-consuming, as you say; but an estimate of it, based on a subset of pairs chosen randomly, seemed (for my data, at least) to converge quite nicely. Here's my code. It assumes a square grid, so you'll need to adapt it for your non-square grid, but that should be trivial. To determine the right number of pairs for a good estimate in my application, I called this function with various numbers of pairs, and plotted the estimate produced as a function of the number of pairs used, and observed good convergence by 200,000 pairs. You will probably want to do this yourself, for your dataset. 200,000 pairs took just a second or two to run, on my machine, so this is much, much faster than an exhaustive calculation. As a bonus, this code also gives you Geary's C. Oh, and the code below assumes a wraparound lattice (i.e. a torus, i.e. periodic boundaries), which may not be what you want; just get rid of the d_wraparound stuff and the following pmax() call if you want non-wraparound boundaries, and that should work fine, I think. Not sure what you mean by the moving window thing, so I leave that as an exercise for the reader. :->

I've never made an R package, and I'm not sure there's much demand for this code (is there?), so I'm presently unlikely to package it up. However, if someone who owns a package related to this problem would like to adopt this code, generalize it to non-square lattices, add a flag to choose periodic or non-periodic boundaries, etc., feel free to do so; I hereby place it in the public domain. I'd just like to hear about it if someone does this, and receive credit somewhere, that's all. I'm not super-good in R, either, so if there's a better way to do some things, like the somewhat hacked random index generation (trying to avoid floating point issues when generating a random integer), please let me know. I'm always interested in learning how to do things better.

And of course this code is provided without warranty, may have bugs, etc.

Enjoy!

Ben Haller

McGill University

correlation_stats <- function(p, n_pairs=200000) {

# Compute Moran's I and Geary's C for the landscape p. This is tricky because the exact calculation # would be far too time-consuming, as it involves comparison of every point to every other point. # So I am trying to estimate it here with a small subset of all possible point comparisons. p_size <- NROW(p) # 512 p_length <- length(p) # 262144 mean_p <- mean(p) pv <- as.vector(p) # select points and look up their values; for speed, points are selected by their vector index p1ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001)) p1x <- (p1ix %/% p_size) + 1 p1y <- (p1ix %% p_size) + 1 p1ix <- p1ix + 1 p2ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001)) p2x <- (p2ix %/% p_size) + 1 p2y <- (p2ix %% p_size) + 1 p2ix <- p2ix + 1 v1 <- pv[p1ix] - mean_p v2 <- pv[p2ix] - mean_p # Calculate distances between point pairs, both directly and wrapping around # The average direct distance is much smaller than the average wraparound distance, # because points near the center vertically have a maximum direct distance near 256, # but a minimum wraparound distance near 256. The Moran's I estimate from wraparound # distances is different, as a result. Rupert recommends taking the shorter of the # two distances, whichever it is, because that keeps us within the meaningful scale # of our space, before it just starts repeating due to periodicity. d_direct <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p1y - p2y) ^ 2)) d_direct[is.infinite(d_direct)] <- 0 d_wraparound <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p_size - abs(p1y - p2y)) ^ 2)) d_wraparound[is.infinite(d_wraparound)] <- 0 d <- pmax(d_direct, d_wraparound) # max because we want the min distance, and these are 1/distance # precalculate some shared terms sum_d <- sum(d) sum_v1_sq <- sum(v1 ^ 2) # Moran's I: -1 is perfect dispersion, 1 is perfect correlation, 0 is a random spatial pattern M_I <- (n_pairs / sum_d) * (sum(d * v1 * v2) / ((sum_v1_sq + sum(v2 ^ 2)) / 2)) # (n_pairs / sum(d_direct)) * (sum(d_direct * v1 * v2) / ((sum(v1 ^ 2) + sum(v2 ^ 2)) / 2)) # (n_pairs / sum(d_wraparound)) * (sum(d_wraparound * v1 * v2) / ((sum(v1 ^ 2) + sum(v2 ^ 2)) / 2)) # Geary's C: 0 is positive spatial autocorrelation, 2 is negative spatial autocorrelation, 1 is a random spatial pattern G_C <- ((n_pairs - 1) * sum(d * ((v1 - v2) ^ 2))) / (2 * sum_d * sum_v1_sq) return(list(Moran_I=M_I, Geary_C=G_C))

}

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 20 Apr 2011 - 13:32:11 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 22 Apr 2011 - 08:40:32 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*