Re: [R] How to speed up grouping time series, help please

From: Den Alpin <den.alpin_at_gmail.com>
Date: Mon, 04 Apr 2011 17:20:50 +0200

I did some tests on Your and Gabor solutions, below my findings:

Any better idea to speed up grouping time series?

Thanks!

Below few line of codes to test (I suggest to grow X size to get better comparison results):
xtsSplit <- function(x)
{

  x <- xts(x[,c("ID","VALUE")], as.POSIXct(x[,"DATE"]))   x <- do.call(merge, split(x$VALUE,x$ID))   return(x)
}

xtsSplitTime <- replicate(100,
  system.time(xtsSplit(X))[[1]])
median(xtsTime)

zooReadTime <- replicate(100,
 system.time(z <- read.zoo(X, split = 1, index = 2, tz = ""))[[1]]) median(zooReadTime)

And my (unreadable) solution:
library(xts)
buildXtsFromDataFrame <- function(x, env)
{

  {
    if(exists("xx", envir = env))
    {

      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        cbind(get("xx", env), xts(VALUE,
          as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT')),
        envir = env)

    } else
    {
      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        xts(VALUE, as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT'),
        envir = env)

    }
    return(TRUE)
  }
}

xtsDaply <- function(...)
{

  e1 <- new.env(parent = baseenv())
  res <- daply(X, "ID", buildXtsFromDataFrame,

      env = e1)
  return(get("xx", e1))
}

Time04 <- replicate(100,
  system.time(xtsDaply(X, X$ID))[[1]])

2011/4/4 Joshua Ulrich <josh.m.ulrich_at_gmail.com>:

> Hi Dan,
>
> On Mon, Apr 4, 2011 at 7:49 AM, Den Alpin <den.alpin_at_gmail.com> wrote:
>> I retrieve for a few hundred times a group of time series (10-15 ts
>> with 10000 values each), on every group I do some calculation, graphs
>> etc. I wonder if there is a faster method than what presented below to
>> get an appropriate timeseries object.
>>
>> Making a query with RODBC for every group I get a data frame like this:
>>
>>> X
>>   ID                DATE     VALUE
>> 14  3 2000-01-01 00:00:03 0.5726334
>> 4   1 2000-01-01 00:00:03 0.8830174
>> 1   1 2000-01-01 00:00:00 0.2875775
>> 15  3 2000-01-01 00:00:04 0.1029247
>> 11  3 2000-01-01 00:00:00 0.9568333
>> 9   2 2000-01-01 00:00:03 0.5514350
>> 7   2 2000-01-01 00:00:01 0.5281055
>> 6   2 2000-01-01 00:00:00 0.0455565
>> 12  3 2000-01-01 00:00:01 0.4533342
>> 8   2 2000-01-01 00:00:02 0.8924190
>> 3   1 2000-01-01 00:00:02 0.4089769
>> 13  3 2000-01-01 00:00:02 0.6775706
>>
>> And I want to get a timeSeries object or xts object like this:
>>
>>                            1         2         3
>> 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
>> 2000-01-01 00:00:01        NA 0.5281055 0.4533342
>> 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
>> 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
>> 2000-01-01 00:00:04        NA        NA 0.1029247
>>
>>
>> Input data can be sorted or unsorted (the most complicated case is in
>> the example, unsorted and missing data) in the sense that I can sort
>> in query if I can take an advantage from this.
>>
>> Some considerations:
>> - Xts is generally faster than timeSeries
>> - both accept a matrix so if I can create a matrix like the one
>> represented above and an array of characters representing dates faster
>> than what possible with xts:::cbind, for examole,I will have a faster
>> implementation (package data.table ?).
>> - create timeseries objects in multithread and then merge (package plyr ?)
>> - faster merge algorithms?
>>
>> Below some code to generate the test case above:
>>
>>
>> set.seed(123)
>> N <- 5 # number of observations
>> K <- 3 # number of timeseries ID
>>
>> X <- data.frame(
>>  ID = rep(1:K, each = N),
>>  DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
>>  VALUE = runif(N*K), stringsAsFactors = FALSE)
>>
>> X <- X[sample(1:(N*K), N*K),] # sample observations to get random
>> order (optional)
>> X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing
>>
>> head(X, 15)
>>
>> # use explicitly environments to avoid '<<-'
>> buildTimeSeriesFromDataFrame <- function(x, env)
>> {
>>  {
>>    if(exists("xx", envir = env)) # if exist variable xx in env cbind
>>      assign("xx",
>>        cbind(get("xx", env), timeSeries(x$VALUE, x$DATE,
>>          format = '%Y-%m-%d %H:%M:%S',
>>          zone = 'GMT', units = as.character(x$ID[1]))),
>>        envir = env)
>>    else  # create xx in env
>>      assign("xx",
>>        timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S',
>>          zone = 'GMT', units = as.character(x$ID[1])),
>>        envir = env)
>>
>>    return(TRUE)
>>  }
>> }
>>
>> # use package plyr, faster than 'by' function
>> tsDaply <- function(...)
>> {
>>  library(plyr)
>>  e1 <- new.env(parent = baseenv()) #create a new env
>>  res <- daply(X, "ID", buildTimeSeriesFromDataFrame,
>>      env = e1)
>>  return(get("xx", e1)) # return xx from env
>> }
>>
>> ##replicate 100 times
>> #Time03 <- replicate(100,
>> #  system.time(tsDaply(X, X$ID))[[1]])
>> #median(Time03)
>>
>> # result
>> tsDaply(X, X$ID)
>>
>>
>> Thanks in advance for any input, best regards,
>> Den
>>
>>
>
> Here's how I would do it with xts:
>
> x <- xts(X[,c("ID","VALUE")], as.POSIXct(X[,"DATE"]))
> do.call(merge, split(x$VALUE,x$ID))
>
> My xts solution compares favorably to your solution:
>> Time03 <- replicate(100,
> +   system.time(tsDaply(X, X$ID))[[1]])
>> median(Time03)
> [1] 0.02
>> xtsTime <- replicate(100,
> +   system.time(do.call(merge, split(x$VALUE,x$ID)))[[1]])
>> median(xtsTime)
> [1] 0
>
> Best,
> --
> Joshua Ulrich  |  FOSS Trading: www.fosstrading.com
>

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 04 Apr 2011 - 15:24:07 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 07 Apr 2011 - 11:00:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive