[R] How to speed up grouping time series, help please

From: Den Alpin <den.alpin_at_gmail.com>
Date: Mon, 04 Apr 2011 14:49:17 +0200


I retrieve for a few hundred times a group of time series (10-15 ts with 10000 values each), on every group I do some calculation, graphs etc. I wonder if there is a faster method than what presented below to get an appropriate timeseries object.

Making a query with RODBC for every group I get a data frame like this:

> X

   ID                DATE     VALUE
14  3 2000-01-01 00:00:03 0.5726334
4   1 2000-01-01 00:00:03 0.8830174
1   1 2000-01-01 00:00:00 0.2875775
15  3 2000-01-01 00:00:04 0.1029247
11  3 2000-01-01 00:00:00 0.9568333
9   2 2000-01-01 00:00:03 0.5514350
7   2 2000-01-01 00:00:01 0.5281055
6   2 2000-01-01 00:00:00 0.0455565
12  3 2000-01-01 00:00:01 0.4533342
8   2 2000-01-01 00:00:02 0.8924190

3 1 2000-01-01 00:00:02 0.4089769
13 3 2000-01-01 00:00:02 0.6775706

And I want to get a timeSeries object or xts object like this:

                            1         2         3
2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
2000-01-01 00:00:01        NA 0.5281055 0.4533342
2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
2000-01-01 00:00:04        NA        NA 0.1029247


Input data can be sorted or unsorted (the most complicated case is in the example, unsorted and missing data) in the sense that I can sort in query if I can take an advantage from this.

Some considerations:
- Xts is generally faster than timeSeries - both accept a matrix so if I can create a matrix like the one represented above and an array of characters representing dates faster than what possible with xts:::cbind, for examole,I will have a faster implementation (package data.table ?).
- create timeseries objects in multithread and then merge (package plyr ?) - faster merge algorithms?

Below some code to generate the test case above:

set.seed(123)
N <- 5 # number of observations
K <- 3 # number of timeseries ID

X <- data.frame(
  ID = rep(1:K, each = N),
  DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),   VALUE = runif(N*K), stringsAsFactors = FALSE)

X <- X[sample(1:(N*K), N*K),] # sample observations to get random order (optional)
X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing

head(X, 15)

# use explicitly environments to avoid '<<-' buildTimeSeriesFromDataFrame <- function(x, env) {
  {
    if(exists("xx", envir = env)) # if exist variable xx in env cbind

      assign("xx",
        cbind(get("xx", env), timeSeries(x$VALUE, x$DATE,
          format = '%Y-%m-%d %H:%M:%S',
          zone = 'GMT', units = as.character(x$ID[1]))),
        envir = env)
    else  # create xx in env
      assign("xx",
        timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S',
          zone = 'GMT', units = as.character(x$ID[1])),
        envir = env)

    return(TRUE)
  }
}

# use package plyr, faster than 'by' function tsDaply <- function(...)
{

  library(plyr)
  e1 <- new.env(parent = baseenv()) #create a new env   res <- daply(X, "ID", buildTimeSeriesFromDataFrame,

      env = e1)
  return(get("xx", e1)) # return xx from env }

##replicate 100 times
#Time03 <- replicate(100,
#  system.time(tsDaply(X, X$ID))[[1]])
#median(Time03)

# result
tsDaply(X, X$ID)

Thanks in advance for any input, best regards, Den



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 04 Apr 2011 - 13:59:56 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 04 Apr 2011 - 14:40:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive