# Re: [R] Rank-based p-value on large dataset

Date: Fri 04 Mar 2005 - 11:04:16 EST

However the accuracy depends on the number of ties you have in your "empirical" distribution. I have tried to simulate with the round() function below.

# current way
system.time({
p1 <- numeric( length(x) )
for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) } })
[1] 484.67 25.08 512.04 0.00 0.00

# suggested solution
system.time({
break.points <- c(-Inf, unique(sort(y)), Inf)   p2 <- cut( x, breaks=break.points, labels=FALSE )   p2 <- 1 - p2/length(break.points)
})
[1] 0.27 0.01 0.28 0.00 0.00

head( cbind( p1, p2 ) )

p1 p2

```[1,] 0.658375 0.65813482
[2,] 0.144000 0.14533705
[3,] 0.815500 0.81436898
[4,] 0.412500 0.41269640
[5,] 0.553625 0.55334516
[6,] 0.044500 0.04510897
```

...

cor(p1, p2)
[1] 0.9999987

The difference arises mainly because I had to use a unique breakpoints in cut(). You may want to investigate further if you are interested. Please let me know if you find anything good or bad about this suggestion as I am using it as part of my codes too. Thank you.

On Thu, 2005-03-03 at 17:22 -0500, Sean Davis wrote:
> I have a fairly simple problem--I have about 80,000 values (call them
> y) that I am using as an empirical distribution and I want to find the
> p-value (never mind the multiple testing issues here, for the time
> being) of 130,000 points (call them x) from the empirical distribution.
> I typically do that (for one-sided test) something like
>
> loop over i in x
> p.val[i] = sum(y>x[i])/length(y)
>
> and repeat for all i. However, length(x) is large here as is
> length(y), so this process takes quite a long time. Any suggestions?
>
> Thanks,
> Sean
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help