Re: [Rd] Reading many large files causes R to crash - Possible Bug in R 2.15.1 64-bit Ubuntu

From: David Terk <david.terk_at_gmail.com>
Date: Mon, 23 Jul 2012 09:14:46 -0400

Where should this be discussed since it is definitely XTS related? I will gladly upload the simplified script + data files to whoever is maintaining this part of the code. Fortunately there is a workaround here.

-----Original Message-----
From: Joshua Ulrich [mailto:josh.m.ulrich_at_gmail.com] Sent: Monday, July 23, 2012 8:15 AM
To: David Terk
Cc: Duncan Murdoch; r-devel_at_r-project.org Subject: Re: [Rd] Reading many large files causes R to crash - Possible Bug in R 2.15.1 64-bit Ubuntu

David,

You still haven't provided a reproducible example. As Duncan already said, "if you don't post code that allows us to reproduce the crash, it's really unlikely that we'll be able to fix it."

And R-devel is not the appropriate venue to discuss this if it's truly an issue with xts/zoo.

Best,

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com


On Mon, Jul 23, 2012 at 12:41 AM, David Terk <david.terk_at_gmail.com> wrote:

> Looks like the call to:
>
> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
>
> If what is causing the issue. If variable name is not set, or set to
> any value other than NULL. Than no hang occurs.
>
> -----Original Message-----
> From: David Terk [mailto:david.terk_at_gmail.com]
> Sent: Monday, July 23, 2012 1:25 AM
> To: 'Duncan Murdoch'
> Cc: 'r-devel_at_r-project.org'
> Subject: RE: [Rd] Reading many large files causes R to crash -
> Possible Bug in R 2.15.1 64-bit Ubuntu
>
> I've isolated the bug. When the seg fault was produced there was an
> error that memory had not been mapped. Here is the odd part of the
> bug. If you comment out certain code and get a full run than comment in
the code which
> is causing the problem it will actually run. So I think it is safe to
> assume something wrong is taking place with memory allocation. Example.
> While testing, I have been able to get to a point where the code will run.
> But if I reboot the machine and try again, the code will not run.
>
> The bug itself is happening somewhere in XTS or ZOO. I will gladly
> upload the data files. It is happening on the 10th data file which is
> only 225k lines in size.
>
> Below is the simplified code. The call to either
>
> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
> index(dat.i) <- index(to.period(templateTimes, period=per, k=subper))
>
> is what is causing R to hang or crash. I have been able to replicate
> this on Windows 7 64 bit and Ubuntu 64 bit. Seems easiest to
> consistently replicate from R Studio.
>
> The code below will consistently replicate when the appropriate files
> are used.
>
> parseTickDataFromDir = function(tickerDir, per, subper) {
> tickerAbsFilenames = list.files(tickerDir,full.names=T)
> tickerNames = list.files(tickerDir,full.names=F)
> tickerNames = gsub("_[a-zA-Z0-9].csv","",tickerNames)
> pb <- txtProgressBar(min = 0, max = length(tickerAbsFilenames),
> style = 3)
>
> for(i in 1:length(tickerAbsFilenames)) {
> dat.i = parseTickData(tickerAbsFilenames[i])
> dates <- unique(substr(as.character(index(dat.i)), 1,10))
> times <- rep("09:30:00", length(dates))
> openDateTimes <- strptime(paste(dates, times), "%F %H:%M:%S")
> templateTimes <- NULL
>
> for (j in 1:length(openDateTimes)) {
> if (is.null(templateTimes)) {
> templateTimes <- openDateTimes[j] + 0:23400
> } else {
> templateTimes <- c(templateTimes, openDateTimes[j] + 0:23400)
> }
> }
>
> templateTimes <- as.xts(templateTimes)
> dat.i <- merge(dat.i, templateTimes, all=T)
> if (is.na(dat.i[1])) {
> dat.i[1] <- -1
> }
> dat.i <- na.locf(dat.i)
> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
> index(dat.i) <- index(to.period(templateTimes, period=per,
> k=subper))
> setTxtProgressBar(pb, i)
> }
> close(pb)
> }
>
> parseTickData <- function(inputFile) {
> DAT.list <- scan(file=inputFile,
> sep=",",skip=1,what=list(Date="",Time="",Close=0,Volume=0),quiet=T)
> index <-
> as.POSIXct(paste(DAT.list$Date,DAT.list$Time),format="%m/%d/%Y
> %H:%M:%S")
> DAT.xts <- xts(DAT.list$Close,index)
> DAT.xts <- make.index.unique(DAT.xts)
> return(DAT.xts)
> }
>
> DATTick <- parseTickDataFromDir(tickerDirSecond, "seconds",10)
>
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch.duncan_at_gmail.com]
> Sent: Sunday, July 22, 2012 4:48 PM
> To: David Terk
> Cc: r-devel_at_r-project.org
> Subject: Re: [Rd] Reading many large files causes R to crash -
> Possible Bug in R 2.15.1 64-bit Ubuntu
>
> On 12-07-22 3:54 PM, David Terk wrote:
>> I am reading several hundred files. Anywhere from 50k-400k in size.
>> It appears that when I read these files with R 2.15.1 the process
>> will hang or seg fault on the scan() call. This does not happen on R
2.14.1.
>
> The code below doesn't do anything other than define a couple of
functions.
> Please simplify it to code that creates a file (or multiple files),
> reads it or them, and shows a bug.
>
> If you can't do that, then gradually add the rest of the stuff from
> these functions into the mix until you figure out what is really causing
the bug.
>
> If you don't post code that allows us to reproduce the crash, it's
> really unlikely that we'll be able to fix it.
>
> Duncan Murdoch
>
>>
>>
>>
>> This is happening on the precise build of Ubuntu.
>>
>>
>>
>> I have included everything, but the issue appears to be when
>> performing the scan in the method parseTickData.
>>
>>
>>
>> Below is the code. Hopefully this is the right place to post.
>>
>>
>>
>> parseTickDataFromDir = function(tickerDir, per, subper, fun) {
>>
>> tickerAbsFilenames = list.files(tickerDir,full.names=T)
>>
>> tickerNames = list.files(tickerDir,full.names=F)
>>
>> tickerNames = gsub("_[a-zA-Z0-9].csv","",tickerNames)
>>
>> pb <- txtProgressBar(min = 0, max = length(tickerAbsFilenames),
>> style = 3)
>>
>>
>>
>> for(i in 1:length(tickerAbsFilenames)) {
>>
>>
>>
>> # Grab Raw Tick Data
>>
>> dat.i = parseTickData(tickerAbsFilenames[i])
>>
>> #Sys.sleep(1)
>>
>> # Create Template
>>
>> dates <- unique(substr(as.character(index(dat.i)), 1,10))
>>
>> times <- rep("09:30:00", length(dates))
>>
>> openDateTimes <- strptime(paste(dates, times), "%F %H:%M:%S")
>>
>> templateTimes <- NULL
>>
>>
>>
>> for (j in 1:length(openDateTimes)) {
>>
>> if (is.null(templateTimes)) {
>>
>> templateTimes <- openDateTimes[j] + 0:23400
>>
>> } else {
>>
>> templateTimes <- c(templateTimes, openDateTimes[j] +
>> 0:23400)
>>
>> }
>>
>> }
>>
>>
>>
>> # Convert templateTimes to XTS, merge with data and convert NA's
>>
>> templateTimes <- as.xts(templateTimes)
>>
>> dat.i <- merge(dat.i, templateTimes, all=T)
>>
>> # If there is no data in the first print, we will have leading
>> NA's. So set them to -1.
>>
>> # Since we do not want these values removed by to.period
>>
>> if (is.na(dat.i[1])) {
>>
>> dat.i[1] <- -1
>>
>> }
>>
>> # Fix remaining NA's
>>
>> dat.i <- na.locf(dat.i)
>>
>> # Convert to desired bucket size
>>
>> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
>>
>> # Always use templated index, otherwise merge fails with other
>> symbols
>>
>> index(dat.i) <- index(to.period(templateTimes, period=per,
>> k=subper))
>>
>> # If there was missing data at open, set close to NA
>>
>> valsToChange <- which(dat.i[,"Open"] == -1)
>>
>> if (length(valsToChange) != 0) {
>>
>> dat.i[valsToChange, "Close"] <- NA
>>
>> }
>>
>> if(i == 1) {
>>
>> DAT = fun(dat.i)
>>
>> } else {
>>
>> DAT = merge(DAT,fun(dat.i))
>>
>> }
>>
>> setTxtProgressBar(pb, i)
>>
>> }
>>
>> close(pb)
>>
>> colnames(DAT) = tickerNames
>>
>> return(DAT)
>>
>> }
>>
>>
>>
>> parseTickData <- function(inputFile) {
>>
>> DAT.list <- scan(file=inputFile,
>> sep=",",skip=1,what=list(Date="",Time="",Close=0,Volume=0),quiet=T)
>>
>> index <-
>> as.POSIXct(paste(DAT.list$Date,DAT.list$Time),format="%m/%d/%Y
>> %H:%M:%S")
>>
>> DAT.xts <- xts(DAT.list$Close,index)
>>
>> DAT.xts <- make.index.unique(DAT.xts)
>>
>> return(DAT.xts)
>>
>> }
>>
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 23 Jul 2012 - 14:18:06 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 23 Jul 2012 - 16:30:33 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive