Re: [R] Find the 50 highest values in a matrix

From: Dennis Murphy <djmuser_at_gmail.com>
Date: Fri, 18 Jun 2010 06:10:01 -0700

Hi:

>From what I can tell, Henrik efficiently finds the 50 largest values without
the matrix
indices and Peter efficiently finds the matrix indices without the corresponding values.
Let's combine the two:

x <- rnorm(8e6)
is.na(x) <- sample(8e6, 1e6)
n <- 50
x1 <- sort(x, decreasing=TRUE)[1:n]
# Find the indices that match the 50 largest values ix <- which(x %in% x1)
# Map these to rows and columns assuming that x is shaped # into a 4000 * 2000 matrix columnwise:

ro <- ix %% 4000
ro <- ifelse(ro == 0L, 4000, ro)  # just in case we get row 4000
co <- 1 + ix %/% 4000

# Find the x values corresponding to the indices in ix, in their proper order
x2 <- x[ix]
d <- data.frame(row = ro, col = co, x = x2) d[rev(order(d$x)), ] # sort in decreasing order

To make sure this works, set up
xm <- matrix(x, nrow = 4000)

In my test,
> head(d) # before sorting

   row col x

1  280  17 4.375632
2 1346 179 4.506947
3 2661 214 4.438979
4 1870 308 4.447505
5 1951 344 4.497253
6   12 365 4.465841

> xm[280, 17]

[1] 4.375632
> xm[1870, 308]

[1] 4.447505

Looks OK. Total time to run the entire block of code:

   user system elapsed
   3.38 0.19 3.57

Now let's try it with Henrik's second method:

system.time({
x <- rnorm(8e6)
is.na(x) <- sample(8e6, 1e6)
n <- 50
x2 <- sort(-x, partial = n)
x2h <- -sort(x2[1:n])

ix <- which(x %in% x2h)
ro <- ix %% 4000
ro <- ifelse(ro == 0L, 4000, ro)
co <- 1 + ix %/% 4000
x3 <- x[ix]

d <- data.frame(row = ro, col = co, x = x3) d[rev(order(d$x)), ]
})

   user system elapsed
   2.24 0.17 2.42

Not bad for 8M observations :)

HTH,
Dennis

On Fri, Jun 18, 2010 at 4:39 AM, Henrik Bengtsson <hb_at_stat.berkeley.edu>wrote:

> You might also want to consider _partial sorting_ by using the
> 'partial' argument of sort(), especially when the number of data
> points is really large.
>
> Since argument 'decreasing=FALSE' is not supported when using
> 'partial', you have to flip it yourself by negating the values, e.g.
>
> x <- rnorm(8e6);
> is.na(x) <- sample(length(x), size=1e6);
>
> n <- 50;
> t1 <- system.time({
> x1 <- sort(x, decreasing=TRUE);
> x1h <- x1[1:n];
> });
>
> t2 <- system.time({
> x2 <- sort(-x, partial=n);
> x2h <- -sort(x2[1:n]);
> });
>
> stopifnot(identical(x2h, x1h));
> print(t2/t1);
> user system elapsed
> 0.3076923 0.7777778 0.3491525
>
> /Henrik
>
> On Fri, Jun 18, 2010 at 1:20 PM, Peter Ehlers <ehlers_at_ucalgary.ca> wrote:
> >
> > m <- matrix(round(rnorm(4000 * 2000), 4), nr = 4000)
> > is.na(m) <- sample(8e6, 1e6)
> >
> > system.time(
> > idx <- which(
> > matrix(m %in% head(sort(m, TRUE), 50),
> > nr = nrow(m)), arr.ind = TRUE))
> >
> > # user system elapsed
> > # 3.12 0.19 3.18
> >
> > -Peter Ehlers
> >
> >
> > On 2010-06-18 5:13, Dennis Murphy wrote:
> >>
> >> Hi:
> >>
> >> Here's a faked up example:
> >>
> >> a<- matrix(rnorm(4000*2000), 4000, 2000)
> >> # Generate some NAs in the matrix
> >> nr<- sample(50, 1:4000)
> >> nc<- sample(50, 1:2000)
> >> a[nr, nc]<- NA
> >>
> >> # convert to data frame:
> >> b<- data.frame(row = rep(1:4000, 2000), col = rep(1:2000, each = 4000),
> >> x = as.vector(a))
> >> # relatively time consuming...about 13.5 s on my machine
> >> bb<- b[rev(order(b$x, na.last = FALSE)), ]
> >>>
> >>> bb[1:10, ]
> >>
> >> row col x
> >> 691269 3269 173 5.103704
> >> 7815076 3076 1954 4.961544
> >> 4999621 3621 1250 4.953265
> >> 500469 469 126 4.937655
> >> 5878224 2224 1470 4.929150
> >> 4287270 3270 1072 4.913791
> >> 4442521 2521 1111 4.896869
> >> 4668867 867 1168 4.863504
> >> 5716575 575 1430 4.760778
> >> 3055274 3274 764 4.758995
> >>
> >> HTH,
> >> Dennis
> >>
> >>
> >> On Thu, Jun 17, 2010 at 10:41 PM,
> >> uschlecht<ulrich.schlecht_at_stanford.edu>wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> I have a huge matrix (4000 * 2000 data points) and I would like to
> >>> retrieve
> >>> the coordinates (column and row) for the top 50 (or x) values. Some
> >>> positions in the matrix have NA as a value. These should be discarded.
> >>>
> >>> My current method is to replace all NAs by 0, then rank all the values
> >>> and
> >>> then extract the positions with the 50 highest ranks. It is very
> >>> time-consuming!
> >>>
> >>> Is there a simpler way to do this?
> >>>
> >>> Thank you,
> >>> Ulrich
> >>>
> >>> --
> >>> View this message in context:
> >>>
> >>>
> http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html
> >>> Sent from the R help mailing list archive at Nabble.com.
> >>>
> >>> ______________________________________________
> >>> R-help_at_r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >> [[alternative HTML version deleted]]
> >>
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 18 Jun 2010 - 13:12:21 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 18 Jun 2010 - 13:50:33 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive