Re: [Rd] Reading many large files causes R to crash - Possible Bug in R 2.15.1 64-bit Ubuntu

From: Joshua Ulrich <josh.m.ulrich_at_gmail.com>
Date: Mon, 23 Jul 2012 10:59:33 -0500

David,

Thank you for providing something reproducible.

This line:
templateTimes <- as.xts(templateTimes)

creates a zero-width xts object (i.e. the coredata is a zero-length vector, but there is a non-zero-length index). So, the to.period(templateTimes) call returns OHLC data of random memory locations. This is the likely cause of the segfaults.

Since aggregating "no data" doesn't make sense, I have patched to.period to throw an error when run on zero-width/length objects (revision 690 on R-Forge). The attached file works with the CRAN version of xts because it avoids the issue entirely.

Your script will still "hang" on the BAC_0.csv file because as.character.POSIXt can take a long time. Better to just call format() directly (as I do in the attached file).

If you have any follow-up questions, please send them to R-SIG-Finance.

Best,

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com


On Mon, Jul 23, 2012 at 8:41 AM, David Terk <david.terk_at_gmail.com> wrote:

> I'm attaching a runnable script and corresponding data files. This will
> freeze at 83%.
>
> I'm not sure how much simpler to get than this.
>
> -----Original Message-----
> From: Joshua Ulrich [mailto:josh.m.ulrich_at_gmail.com]
> Sent: Monday, July 23, 2012 9:17 AM
> To: David Terk
> Cc: Duncan Murdoch; r-devel_at_r-project.org
> Subject: Re: [Rd] Reading many large files causes R to crash - Possible Bug
> in R 2.15.1 64-bit Ubuntu
>
> Well, you still haven't convinced anyone but yourself that it's definitely
> an xts problem, since you have not provided any reproducible example...
> --
> Joshua Ulrich | about.me/joshuaulrich
> FOSS Trading | www.fosstrading.com
>
>
> On Mon, Jul 23, 2012 at 8:14 AM, David Terk <david.terk_at_gmail.com> wrote:
>> Where should this be discussed since it is definitely XTS related? I
>> will gladly upload the simplified script + data files to whoever is
>> maintaining this part of the code. Fortunately there is a workaround
> here.
>>
>> -----Original Message-----
>> From: Joshua Ulrich [mailto:josh.m.ulrich_at_gmail.com]
>> Sent: Monday, July 23, 2012 8:15 AM
>> To: David Terk
>> Cc: Duncan Murdoch; r-devel_at_r-project.org
>> Subject: Re: [Rd] Reading many large files causes R to crash -
>> Possible Bug in R 2.15.1 64-bit Ubuntu
>>
>> David,
>>
>> You still haven't provided a reproducible example. As Duncan already
>> said, "if you don't post code that allows us to reproduce the crash,
>> it's really unlikely that we'll be able to fix it."
>>
>> And R-devel is not the appropriate venue to discuss this if it's truly
>> an issue with xts/zoo.
>>
>> Best,
>> --
>> Joshua Ulrich | about.me/joshuaulrich FOSS Trading |
>> www.fosstrading.com
>>
>>
>> On Mon, Jul 23, 2012 at 12:41 AM, David Terk <david.terk_at_gmail.com> wrote:
>>> Looks like the call to:
>>>
>>> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
>>>
>>> If what is causing the issue. If variable name is not set, or set to
>>> any value other than NULL. Than no hang occurs.
>>>
>>> -----Original Message-----
>>> From: David Terk [mailto:david.terk_at_gmail.com]
>>> Sent: Monday, July 23, 2012 1:25 AM
>>> To: 'Duncan Murdoch'
>>> Cc: 'r-devel_at_r-project.org'
>>> Subject: RE: [Rd] Reading many large files causes R to crash -
>>> Possible Bug in R 2.15.1 64-bit Ubuntu
>>>
>>> I've isolated the bug. When the seg fault was produced there was an
>>> error that memory had not been mapped. Here is the odd part of the
>>> bug. If you comment out certain code and get a full run than comment
>>> in
>> the code which
>>> is causing the problem it will actually run. So I think it is safe to
>>> assume something wrong is taking place with memory allocation. Example.
>>> While testing, I have been able to get to a point where the code will
> run.
>>> But if I reboot the machine and try again, the code will not run.
>>>
>>> The bug itself is happening somewhere in XTS or ZOO. I will gladly
>>> upload the data files. It is happening on the 10th data file which
>>> is only 225k lines in size.
>>>
>>> Below is the simplified code. The call to either
>>>
>>> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
>>> index(dat.i) <- index(to.period(templateTimes, period=per, k=subper))
>>>
>>> is what is causing R to hang or crash. I have been able to replicate
>>> this on Windows 7 64 bit and Ubuntu 64 bit. Seems easiest to
>>> consistently replicate from R Studio.
>>>
>>> The code below will consistently replicate when the appropriate files
>>> are used.
>>>
>>> parseTickDataFromDir = function(tickerDir, per, subper) {
>>> tickerAbsFilenames = list.files(tickerDir,full.names=T)
>>> tickerNames = list.files(tickerDir,full.names=F)
>>> tickerNames = gsub("_[a-zA-Z0-9].csv","",tickerNames)
>>> pb <- txtProgressBar(min = 0, max = length(tickerAbsFilenames),
>>> style = 3)
>>>
>>> for(i in 1:length(tickerAbsFilenames)) {
>>> dat.i = parseTickData(tickerAbsFilenames[i])
>>> dates <- unique(substr(as.character(index(dat.i)), 1,10))
>>> times <- rep("09:30:00", length(dates))
>>> openDateTimes <- strptime(paste(dates, times), "%F %H:%M:%S")
>>> templateTimes <- NULL
>>>
>>> for (j in 1:length(openDateTimes)) {
>>> if (is.null(templateTimes)) {
>>> templateTimes <- openDateTimes[j] + 0:23400
>>> } else {
>>> templateTimes <- c(templateTimes, openDateTimes[j] + 0:23400)
>>> }
>>> }
>>>
>>> templateTimes <- as.xts(templateTimes)
>>> dat.i <- merge(dat.i, templateTimes, all=T)
>>> if (is.na(dat.i[1])) {
>>> dat.i[1] <- -1
>>> }
>>> dat.i <- na.locf(dat.i)
>>> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
>>> index(dat.i) <- index(to.period(templateTimes, period=per,
>>> k=subper))
>>> setTxtProgressBar(pb, i)
>>> }
>>> close(pb)
>>> }
>>>
>>> parseTickData <- function(inputFile) {
>>> DAT.list <- scan(file=inputFile,
>>> sep=",",skip=1,what=list(Date="",Time="",Close=0,Volume=0),quiet=T)
>>> index <-
>>> as.POSIXct(paste(DAT.list$Date,DAT.list$Time),format="%m/%d/%Y
>>> %H:%M:%S")
>>> DAT.xts <- xts(DAT.list$Close,index)
>>> DAT.xts <- make.index.unique(DAT.xts)
>>> return(DAT.xts)
>>> }
>>>
>>> DATTick <- parseTickDataFromDir(tickerDirSecond, "seconds",10)
>>>
>>> -----Original Message-----
>>> From: Duncan Murdoch [mailto:murdoch.duncan_at_gmail.com]
>>> Sent: Sunday, July 22, 2012 4:48 PM
>>> To: David Terk
>>> Cc: r-devel_at_r-project.org
>>> Subject: Re: [Rd] Reading many large files causes R to crash -
>>> Possible Bug in R 2.15.1 64-bit Ubuntu
>>>
>>> On 12-07-22 3:54 PM, David Terk wrote:
>>>> I am reading several hundred files. Anywhere from 50k-400k in size.
>>>> It appears that when I read these files with R 2.15.1 the process
>>>> will hang or seg fault on the scan() call. This does not happen on
>>>> R
>> 2.14.1.
>>>
>>> The code below doesn't do anything other than define a couple of
>> functions.
>>> Please simplify it to code that creates a file (or multiple files),
>>> reads it or them, and shows a bug.
>>>
>>> If you can't do that, then gradually add the rest of the stuff from
>>> these functions into the mix until you figure out what is really
>>> causing
>> the bug.
>>>
>>> If you don't post code that allows us to reproduce the crash, it's
>>> really unlikely that we'll be able to fix it.
>>>
>>> Duncan Murdoch
>>>
>>>>
>>>>
>>>>
>>>> This is happening on the precise build of Ubuntu.
>>>>
>>>>
>>>>
>>>> I have included everything, but the issue appears to be when
>>>> performing the scan in the method parseTickData.
>>>>
>>>>
>>>>
>>>> Below is the code. Hopefully this is the right place to post.
>>>>
>>>>
>>>>
>>>> parseTickDataFromDir = function(tickerDir, per, subper, fun) {
>>>>
>>>> tickerAbsFilenames = list.files(tickerDir,full.names=T)
>>>>
>>>> tickerNames = list.files(tickerDir,full.names=F)
>>>>
>>>> tickerNames = gsub("_[a-zA-Z0-9].csv","",tickerNames)
>>>>
>>>> pb <- txtProgressBar(min = 0, max = length(tickerAbsFilenames),
>>>> style = 3)
>>>>
>>>>
>>>>
>>>> for(i in 1:length(tickerAbsFilenames)) {
>>>>
>>>>
>>>>
>>>> # Grab Raw Tick Data
>>>>
>>>> dat.i = parseTickData(tickerAbsFilenames[i])
>>>>
>>>> #Sys.sleep(1)
>>>>
>>>> # Create Template
>>>>
>>>> dates <- unique(substr(as.character(index(dat.i)), 1,10))
>>>>
>>>> times <- rep("09:30:00", length(dates))
>>>>
>>>> openDateTimes <- strptime(paste(dates, times), "%F %H:%M:%S")
>>>>
>>>> templateTimes <- NULL
>>>>
>>>>
>>>>
>>>> for (j in 1:length(openDateTimes)) {
>>>>
>>>> if (is.null(templateTimes)) {
>>>>
>>>> templateTimes <- openDateTimes[j] + 0:23400
>>>>
>>>> } else {
>>>>
>>>> templateTimes <- c(templateTimes, openDateTimes[j] +
>>>> 0:23400)
>>>>
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> # Convert templateTimes to XTS, merge with data and convert
>>>> NA's
>>>>
>>>> templateTimes <- as.xts(templateTimes)
>>>>
>>>> dat.i <- merge(dat.i, templateTimes, all=T)
>>>>
>>>> # If there is no data in the first print, we will have leading
>>>> NA's. So set them to -1.
>>>>
>>>> # Since we do not want these values removed by to.period
>>>>
>>>> if (is.na(dat.i[1])) {
>>>>
>>>> dat.i[1] <- -1
>>>>
>>>> }
>>>>
>>>> # Fix remaining NA's
>>>>
>>>> dat.i <- na.locf(dat.i)
>>>>
>>>> # Convert to desired bucket size
>>>>
>>>> dat.i <- to.period(dat.i, period=per, k=subper, name=NULL)
>>>>
>>>> # Always use templated index, otherwise merge fails with other
>>>> symbols
>>>>
>>>> index(dat.i) <- index(to.period(templateTimes, period=per,
>>>> k=subper))
>>>>
>>>> # If there was missing data at open, set close to NA
>>>>
>>>> valsToChange <- which(dat.i[,"Open"] == -1)
>>>>
>>>> if (length(valsToChange) != 0) {
>>>>
>>>> dat.i[valsToChange, "Close"] <- NA
>>>>
>>>> }
>>>>
>>>> if(i == 1) {
>>>>
>>>> DAT = fun(dat.i)
>>>>
>>>> } else {
>>>>
>>>> DAT = merge(DAT,fun(dat.i))
>>>>
>>>> }
>>>>
>>>> setTxtProgressBar(pb, i)
>>>>
>>>> }
>>>>
>>>> close(pb)
>>>>
>>>> colnames(DAT) = tickerNames
>>>>
>>>> return(DAT)
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> parseTickData <- function(inputFile) {
>>>>
>>>> DAT.list <- scan(file=inputFile,
>>>> sep=",",skip=1,what=list(Date="",Time="",Close=0,Volume=0),quiet=T)
>>>>
>>>> index <-
>>>> as.POSIXct(paste(DAT.list$Date,DAT.list$Time),format="%m/%d/%Y
>>>> %H:%M:%S")
>>>>
>>>> DAT.xts <- xts(DAT.list$Close,index)
>>>>
>>>> DAT.xts <- make.index.unique(DAT.xts)
>>>>
>>>> return(DAT.xts)
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-devel_at_r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>> ______________________________________________
>>> R-devel_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>

______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

Received on Mon 23 Jul 2012 - 16:25:28 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 23 Jul 2012 - 17:40:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive