From: Charles C. Berry <cberry_at_tajo.ucsd.edu>

Date: Mon, 7 Jan 2008 16:57:23 -0800

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 08 Jan 2008 - 01:01:08 GMT

Date: Mon, 7 Jan 2008 16:57:23 -0800

On Mon, 7 Jan 2008, Seung Jun wrote:

> I'm looking for a way to improve code that's proven to be inefficient.

*>
*

Jim was probably right on both counts (use Rprof and expect wtd.quantile to be the place where the time is being used).

If following his advice doesn't get you what you need, try vectorizing the whole lot by stacking the 'index'es and the 'count's. To see how to do this look at these plots:

*> plot(rep(index,count))
**> index <- 1:4
**> count <- index*10
**> plot(wtd.quantile( index, count, seq(0,1,by=0.001)))
**> plot(rep(index,count))
**>
*

and now this one where I 'stack' another table on top of the first one:

index.2 <- c(1,3)

count.2 <- c(30,40)

plot( rep( c( index, index.2 ), c ( count, count.2 ) ) )

As you can probably see, (for your case) wtd.quantile() is (in effect) doing a lookup and interpolation between points in those case in which an interpolation is needed.

The challenge for you is to figure out how to do the lookup without resorting to approx() - which is used by wtd.quantile(). Keeping track of the cumulative number of the stacked counts with cumsum(), the number in the each table, and the cumulative number of counts for all previous tables should get you there.

**HTH,
**
Chuck

> Suppose that a data source generates the following table every minute:

*>
**> Index Count
**> ------------
**> 0 234
**> 1 120
**> 7 11
**> 30 1
**>
**> I save the tables in the following CSV format:
**>
**> time,index,count
**> 0,0:1:7:30,234:120:11:1
**> 1,0:2:3:19,199:110:87:9
**>
**> That is, each line represents a table, and I have N lines for N minutes of
**> data collection.
**>
**> Now, I wrote the following code to get quantiles for each time period:
**>
**> library(Hmisc)
**> stbl <- read.csv("data.csv")
**> index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric)
**> count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric)
**> len <- length(index)
**> for (i in 1:len) {
**> v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1))
**> stbl$q0[i] <- v[1]
**> stbl$q2[i] <- v[2]
**> stbl$q5[i] <- v[3]
**> stbl$q8[i] <- v[4]
**> stbl$q10[i] <- v[5]
**> }
**>
**> It works fine for a small N, but it get quickly inefficient as N grows. The
**> for-loop takes too long. How could I improve the code or data
**> representation so it can run fast?
**>
**> Thanks,
**> Seung
**>
**> ______________________________________________
**> R-help_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
**>
*

Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry_at_tajo.ucsd.edu UC San Diegohttp://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 08 Jan 2008 - 01:01:08 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Tue 08 Jan 2008 - 01:30:05 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*