Re: [R] Rank-based p-value on large dataset

From: Sean Davis <sdavis2_at_mail.nih.gov>
Date: Fri 04 Mar 2005 - 11:15:41 EST


On 3/3/05 19:04, "Adaikalavan Ramasamy" <ramasamy@cancer.org.uk> wrote:

> One solution is to cut() 'x' according to the breaks defined by 'y'.
> Using cut with labels=FALSE is really fast. See a simulation below.
>
> However the accuracy depends on the number of ties you have in your
> "empirical" distribution. I have tried to simulate with the round()
> function below.
>
> # simulate data
> y <- round(rnorm(80000), 5) # empirical distribution with some ties
> x <- round(rnorm(130000), 5) # observed values with some ties
>
> # current way
> system.time({
> p1 <- numeric( length(x) )
> for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) }
> })
> [1] 484.67 25.08 512.04 0.00 0.00
>
> # suggested solution
> system.time({
> break.points <- c(-Inf, unique(sort(y)), Inf)
> p2 <- cut( x, breaks=break.points, labels=FALSE )
> p2 <- 1 - p2/length(break.points)
> })
> [1] 0.27 0.01 0.28 0.00 0.00
>
>
> head( cbind( p1, p2 ) )
> p1 p2
> [1,] 0.658375 0.65813482
> [2,] 0.144000 0.14533705
> [3,] 0.815500 0.81436898
> [4,] 0.412500 0.41269640
> [5,] 0.553625 0.55334516
> [6,] 0.044500 0.04510897
> ...
>
> cor(p1, p2)
> [1] 0.9999987
>
>
> The difference arises mainly because I had to use a unique breakpoints
> in cut(). You may want to investigate further if you are interested.
> Please let me know if you find anything good or bad about this
> suggestion as I am using it as part of my codes too. Thank you.
>
> Regards, Adai
>

Adai,

All I have to say is, "WOW!". I'll give it a try tomorrow and let you know.

Thanks,
Sean



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Mar 04 11:21:05 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:40 EST