# Re: [R] Rank-based p-value on large dataset

From: Sean Davis <sdavis2_at_mail.nih.gov>
Date: Fri 04 Mar 2005 - 11:15:41 EST

On 3/3/05 19:04, "Adaikalavan Ramasamy" <ramasamy@cancer.org.uk> wrote:

> One solution is to cut() 'x' according to the breaks defined by 'y'.
> Using cut with labels=FALSE is really fast. See a simulation below.
>
> However the accuracy depends on the number of ties you have in your
> "empirical" distribution. I have tried to simulate with the round()
> function below.
>
> # simulate data
> y <- round(rnorm(80000), 5) # empirical distribution with some ties
> x <- round(rnorm(130000), 5) # observed values with some ties
>
> # current way
> system.time({
> p1 <- numeric( length(x) )
> for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) }
> })
>  484.67 25.08 512.04 0.00 0.00
>
> # suggested solution
> system.time({
> break.points <- c(-Inf, unique(sort(y)), Inf)
> p2 <- cut( x, breaks=break.points, labels=FALSE )
> p2 <- 1 - p2/length(break.points)
> })
>  0.27 0.01 0.28 0.00 0.00
>
>
> head( cbind( p1, p2 ) )
> p1 p2
> [1,] 0.658375 0.65813482
> [2,] 0.144000 0.14533705
> [3,] 0.815500 0.81436898
> [4,] 0.412500 0.41269640
> [5,] 0.553625 0.55334516
> [6,] 0.044500 0.04510897
> ...
>
> cor(p1, p2)
>  0.9999987
>
>
> The difference arises mainly because I had to use a unique breakpoints
> in cut(). You may want to investigate further if you are interested.
> suggestion as I am using it as part of my codes too. Thank you.
>
>