Re: [Rd] Reading many large files causes R to crash - Possible Bug in R 2.15.1 64-bit Ubuntu

From: David Terk <david.terk_at_gmail.com>
Date: Mon, 23 Jul 2012 01:25:13 -0400

I've isolated the bug. When the seg fault was produced there was an error that memory had not been mapped. Here is the odd part of the bug. If you comment out certain code and get a full run than comment in the code which is causing the problem it will actually run. So I think it is safe to assume something wrong is taking place with memory allocation. Example. While testing, I have been able to get to a point where the code will run. But if I reboot the machine and try again, the code will not run.

The bug itself is happening somewhere in XTS or ZOO. I will gladly upload the data files. It is happening on the 10th data file which is only 225k lines in size.

Below is the simplified code. The call to either

dat.i <- to.period(dat.i, period=per, k=subper, name=NULL) index(dat.i) <- index(to.period(templateTimes, period=per, k=subper))

is what is causing R to hang or crash. I have been able to replicate this on Windows 7 64 bit and Ubuntu 64 bit. Seems easiest to consistently replicate from R Studio.

The code below will consistently replicate when the appropriate files are used.

parseTickDataFromDir = function(tickerDir, per, subper) {   tickerAbsFilenames = list.files(tickerDir,full.names=T)   tickerNames = list.files(tickerDir,full.names=F)   tickerNames = gsub("_[a-zA-Z0-9].csv","",tickerNames)   pb <- txtProgressBar(min = 0, max = length(tickerAbsFilenames), style = 3)   

  for(i in 1:length(tickerAbsFilenames)) {     dat.i = parseTickData(tickerAbsFilenames[i])     dates <- unique(substr(as.character(index(dat.i)), 1,10))     times <- rep("09:30:00", length(dates))     openDateTimes <- strptime(paste(dates, times), "%F %H:%M:%S")     templateTimes <- NULL     

    for (j in 1:length(openDateTimes)) {

      if (is.null(templateTimes)) {
        templateTimes <- openDateTimes[j] + 0:23400
      } else {
        templateTimes <- c(templateTimes, openDateTimes[j] + 0:23400)
      }

    }     

    templateTimes <- as.xts(templateTimes)     dat.i <- merge(dat.i, templateTimes, all=T)     if (is.na(dat.i[1])) {
      dat.i[1] <- -1
    }
    dat.i <- na.locf(dat.i)

	dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
	index(dat.i) <- index(to.period(templateTimes, period=per,
k=subper))

    setTxtProgressBar(pb, i)
  }
  close(pb)
}

parseTickData <- function(inputFile) {
  DAT.list <- scan(file=inputFile,
sep=",",skip=1,what=list(Date="",Time="",Close=0,Volume=0),quiet=T)   index <- as.POSIXct(paste(DAT.list$Date,DAT.list$Time),format="%m/%d/%Y %H:%M:%S")
  DAT.xts <- xts(DAT.list$Close,index)
  DAT.xts <- make.index.unique(DAT.xts)
  return(DAT.xts)
}

DATTick <- parseTickDataFromDir(tickerDirSecond, "seconds",10)

-----Original Message-----
From: Duncan Murdoch [mailto:murdoch.duncan_at_gmail.com] Sent: Sunday, July 22, 2012 4:48 PM
To: David Terk
Cc: r-devel_at_r-project.org
Subject: Re: [Rd] Reading many large files causes R to crash - Possible Bug in R 2.15.1 64-bit Ubuntu

On 12-07-22 3:54 PM, David Terk wrote:
> I am reading several hundred files. Anywhere from 50k-400k in size.
> It appears that when I read these files with R 2.15.1 the process will
> hang or seg fault on the scan() call. This does not happen on R 2.14.1.

The code below doesn't do anything other than define a couple of functions. Please simplify it to code that creates a file (or multiple files), reads it or them, and shows a bug.

If you can't do that, then gradually add the rest of the stuff from these functions into the mix until you figure out what is really causing the bug.

If you don't post code that allows us to reproduce the crash, it's really unlikely that we'll be able to fix it.

Duncan Murdoch

>
>
>

> This is happening on the precise build of Ubuntu.
>
>
>

> I have included everything, but the issue appears to be when
> performing the scan in the method parseTickData.
>
>
>

> Below is the code. Hopefully this is the right place to post.
>
>
>

> parseTickDataFromDir = function(tickerDir, per, subper, fun) {
>

> tickerAbsFilenames = list.files(tickerDir,full.names=T)
>

> tickerNames = list.files(tickerDir,full.names=F)
>

> tickerNames = gsub("_[a-zA-Z0-9].csv","",tickerNames)
>

> pb <- txtProgressBar(min = 0, max = length(tickerAbsFilenames),
> style = 3)
>
>
>

> for(i in 1:length(tickerAbsFilenames)) {
>
>
>

> # Grab Raw Tick Data
>
> dat.i = parseTickData(tickerAbsFilenames[i])
>
> #Sys.sleep(1)
>
> # Create Template
>
> dates <- unique(substr(as.character(index(dat.i)), 1,10))
>
> times <- rep("09:30:00", length(dates))
>
> openDateTimes <- strptime(paste(dates, times), "%F %H:%M:%S")
>
> templateTimes <- NULL
> > >
> for (j in 1:length(openDateTimes)) {
>
> if (is.null(templateTimes)) {
>
> templateTimes <- openDateTimes[j] + 0:23400
>
> } else {
>
> templateTimes <- c(templateTimes, openDateTimes[j] + 0:23400)
>
> }
>
> }
> > >
> # Convert templateTimes to XTS, merge with data and convert NA's
>
> templateTimes <- as.xts(templateTimes)
>
> dat.i <- merge(dat.i, templateTimes, all=T)
>
> # If there is no data in the first print, we will have leading
> NA's. So set them to -1.
>

> # Since we do not want these values removed by to.period
>
> if (is.na(dat.i[1])) {
>
> dat.i[1] <- -1
>
> }
>
> # Fix remaining NA's
>
> dat.i <- na.locf(dat.i)
>
> # Convert to desired bucket size
>
> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
>
> # Always use templated index, otherwise merge fails with other
> symbols
>

> index(dat.i) <- index(to.period(templateTimes, period=per,
> k=subper))
>

> # If there was missing data at open, set close to NA
>
> valsToChange <- which(dat.i[,"Open"] == -1)
>
> if (length(valsToChange) != 0) {
>
> dat.i[valsToChange, "Close"] <- NA
>
> }
>
> if(i == 1) {
>
> DAT = fun(dat.i)
>
> } else {
>
> DAT = merge(DAT,fun(dat.i))
>
> }
>
> setTxtProgressBar(pb, i)
>

> }
>

> close(pb)
>

> colnames(DAT) = tickerNames
>

> return(DAT)
>

> }
>
>
>

> parseTickData <- function(inputFile) {
>

> DAT.list <- scan(file=inputFile,
> sep=",",skip=1,what=list(Date="",Time="",Close=0,Volume=0),quiet=T)
>

> index <-
> as.POSIXct(paste(DAT.list$Date,DAT.list$Time),format="%m/%d/%Y
> %H:%M:%S")
>

> DAT.xts <- xts(DAT.list$Close,index)
>

> DAT.xts <- make.index.unique(DAT.xts)
>

> return(DAT.xts)
>

> }
>
>
>
>
>
>

> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Mon 23 Jul 2012 - 09:14:03 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 23 Jul 2012 - 12:30:32 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive