Re: [R] Fastest way to do HWE.exact test on 100K SNP data?

From: Martin Morgan <mtmorgan_at_fhcrc.org>
Date: Tue 06 Jun 2006 - 13:34:18 EST

Anna --

This might be a little faster ...

all.geno.df <- data.frame(all.geno)     # exploit shared structure
set.seed(123)                           # once only
for (iter in 1:1000) {

    permut <- sample(affSt)
    control <- all.geno.df[permut==1,]
    case <- all.geno.df[permut==2,]
    pvalControls <- unlist(sapply(control, HWE.exact)["p.value",])     pvalCases <- unlist(sapply(case, HWE.exact)["p.value",])     # presmuably do something with these, before next iteration }

but probably most of the time is being spent in HWE.exact so these cosmetic changes won't help much.

Looking at the code for HWE.exact, it looks like it generates a table based on numbers of alleles, and then uses this table to determine the probability. Any time two SNPs have the same number of alleles, the same table gets generated. It seems like this will happen alot. So one strategy is to only calculate the table once, 'remember' it, and then the next time a value is needed from that table look it up rather than calculate it.

This is a bad strategy if most SNPs have different numbers of alleles (because then the tables get calculated, and a lot of space and some time is used to store results that are never accessed again).

Here is the code (not tested extensively, so please use with care!):

HWE.exact.lookup <- function() {

    HWE.lookup.tbl <- new.env(hash=TRUE)     dhwe2 <- function(n11, n12, n22) {

        ## from body of HWE.exact
        f <- function(x) lgamma(x + 1)
        n <- n11 + n12 + n22
        n1 <- 2 * n11 + n12
        n2 <- 2 * n22 + n12
        exp(log(2) * (n12) + f(n) - f(n11) - f(n12) - f(n22) - 
            f(2 * n) + f(n1) + f(n2))

    }
    function (x) {
        ## adopted from HWE.exact
        if (!is.genotype(x)) 
          stop("x must be of class 'genotype' or 'haplotype'")
        nallele <- length(na.omit(allele.names(x)))
        if (nallele != 2) 
          stop("Exact HWE test can only be computed for 2 markers with 2 alleles")
        allele.tab <- table(factor(allele(x, 1), levels = allele.names(x)), 
                            factor(allele(x, 2), levels = allele.names(x)))
        n11 <- allele.tab[1, 1]
        n12 <- allele.tab[1, 2] + allele.tab[2, 1]
        n22 <- allele.tab[2, 2]
        n1 <- 2 * n11 + n12
        n2 <- 2 * n22 + n12

        nm <- paste(n1,n2,sep=".")
        if (!exists(nm, envir=HWE.lookup.tbl)) { # 'remember'
            x12 <- seq(n1%%2, min(n1, n2), 2)
            x11 <- (n1 - x12)/2
            x22 <- (n2 - x12)/2
            dist <- data.frame(n11 = x11, n12 = x12, n22 = x22,
                               density = dhwe2(x11, x12, x22))
            dist <- dist[order(dist$density),]
            dist$density <- cumsum(dist$density)
            assign(nm,dist, envir=HWE.lookup.tbl)
        }

        dist <- get(nm, HWE.lookup.tbl)
        dist$density[dist$n11 == n11 & dist$n12 == n12 & dist$n22 == n22]
    }
}
calcHWE <- HWE.exact.lookup()           # create a new lookup table & function
all.geno.df <- data.frame(all.geno)     # exploit shared structure
set.seed(123)                           # once only
for (iter in 1:1000) {

    permut <- sample(affSt)
    control <- all.geno.df[permut==1,]
    case <- all.geno.df[permut==2,]
    pvalControls <- sapply(control, calcHWE)     pvalCases <- sapply(case, calcHWE)
}

Let me know how it goes if you adopt this approach!

Martin

Anna Pluzhnikov <apluzhni@bsd.uchicago.edu> writes:

> Hi everyone,
>
> I'm using the function 'HWE.exact' of 'genetics' package to compute p-values of
> the HWE test. My data set consists of ~600 subjects (cases and controls) typed
> at ~ 10K SNP markers; the test is applied separately to cases and controls. The
> genotypes are stored in a list of 'genotype' objects, all.geno, and p-values are
> calculated inside the loop over all SNP markers.
>

> I wish to repeat this procedure multiple times (~1000) permuting the cases and
> controls (affection status). It seems straightforward to implement it like this:
>
> #############################################
>
> for (iter in 1:1000) {
> set.seed(iter)
> # get the permuted affection status
> permut <- sample(affSt)
> for (j in 1:nSNPs) {
> test <- tapply(all.geno[[j]], permut, HWE.exact)
> pvalControls[j] <- test$"1"$p.value
> pvalCases[j] <- test$"2"$p.value
> }
> }
>
> ##############################################
>
> The problem is that it takes ~1 min/iteration (on AMD Opteron 252 processor
> running Linux).
>
> Is there a faster/more efficient way to do this?
>
> Thanks,
> --
> Anna Pluzhnikov, PhD
> Section of Genetic Medicine
> Department of Medicine
> The University of Chicago
>
>
> -------------------------------------------------
> This email is intended only for the use of the individual or...{{dropped}}
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Jun 06 13:37:44 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 06 Jun 2006 - 18:10:34 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.