Re: [R] string-to-number

From: Mike Nielsen <mr.blacksheep_at_gmail.com>
Date: Tue 22 Aug 2006 - 00:16:06 EST

Marc,

Thanks very much for this. I hadn't really looked at Rprof in the past; now I have a new toy to play with!

I have formulated an hypothesis that the reason parse/eval is quicker lies in the pattern-matching code: strsplit is using regular expressions, whereas perhaps parse is using some more clever (but possibly less general) matching algorithm. It will be interesting to inspect the source code to get to the bottom of it.

Thanks again for your interest and efforts in this, and for pointing out Rprof!

Regards,

Mike Nielsen

On 8/20/06, Marc Schwartz <MSchwartz@mn.rr.com> wrote:
> On Sat, 2006-08-19 at 10:25 -0600, Mike Nielsen wrote:
> > Wow. New respect for parse/eval.
> >
> > Do you think this is a special case of a more general principle? I
> > suppose the cost is memory, but from time to time a speedup like this
> > would be very beneficial.
> >
> > Any hints about how R programmers could recognize such cases would, I
> > am sure, be of value to the list in general.
> >
> > Many thanks for your efforts, Marc!
>
> Mike,
>
> I think that one needs to consider where the time is being spent and
> then adjust accordingly. Once you understand that, you can develop some
> insight into what may be a more efficient approach. R provides good
> profiling tools that facilitate this process.
>
> In this case, almost all of the time in the first two examples using
> strsplit(), is in that function:
>
> > repeated.measures.columns <- paste(1:100000, collapse = ",")
>
> > library(utils)
> > Rprof(tmp <- tempfile())
> > res1 <- as.numeric(unlist(strsplit(repeated.measures.columns, ",")))
> > Rprof()
>
> > summaryRprof(tmp)
> $by.self
> self.time self.pct total.time total.pct
> "strsplit" 23.68 99.7 23.68 99.7
> "as.double.default" 0.06 0.3 0.06 0.3
> "as.numeric" 0.00 0.0 23.74 100.0
> "unlist" 0.00 0.0 23.68 99.7
>
> $by.total
> total.time total.pct self.time self.pct
> "as.numeric" 23.74 100.0 0.00 0.0
> "strsplit" 23.68 99.7 23.68 99.7
> "unlist" 23.68 99.7 0.00 0.0
> "as.double.default" 0.06 0.3 0.06 0.3
>
> $sampling.time
> [1] 23.74
>
>
> Contrast that with Prof. Ripley's approach:
>
> > Rprof(tmp <- tempfile())
> > res3 <- eval(parse(text=paste("c(", repeated.measures.columns, ")")))
> > Rprof()
>
> > summaryRprof(tmp)
> $by.self
> self.time self.pct total.time total.pct
> "parse" 0.42 87.5 0.42 87.5
> "eval" 0.06 12.5 0.48 100.0
>
> $by.total
> total.time total.pct self.time self.pct
> "eval" 0.48 100.0 0.06 12.5
> "parse" 0.42 87.5 0.42 87.5
>
> $sampling.time
> [1] 0.48
>
>
> To some extent, one could argue that my initial timing examples are
> contrived, in that they specifically demonstrate a worst case scenario
> using strsplit(). Real world examples may or may not show such gains.
>
> For example with Charles' initial query, the initial vector was rather
> short:
>
> > repeated.measures.columns
> [1] "3,6,10"
>
> So if this was a one-time conversion, we would not see such significant
> gains.
>
> However, what if we had a long list of shorter entries:
>
> > repeated.measures.columns <- paste(1:10, collapse = ",")
> > repeated.measures.columns
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> > big.list <- replicate(10000, list(repeated.measures.columns))
>
> > head(big.list)
> [[1]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[2]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[3]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[4]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[5]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
> [[6]]
> [1] "1,2,3,4,5,6,7,8,9,10"
>
>
> > system.time(res1 <- t(sapply(big.list, function(x)
> as.numeric(unlist(strsplit(x, ","))))))
> [1] 1.972 0.044 2.411 0.000 0.000
>
> > str(res1)
> num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
>
> > head(res1)
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,] 1 2 3 4 5 6 7 8 9 10
> [2,] 1 2 3 4 5 6 7 8 9 10
> [3,] 1 2 3 4 5 6 7 8 9 10
> [4,] 1 2 3 4 5 6 7 8 9 10
> [5,] 1 2 3 4 5 6 7 8 9 10
> [6,] 1 2 3 4 5 6 7 8 9 10
>
>
>
> Now use Prof. Ripley's approach:
>
> > system.time(res3 <- t(sapply(big.list, function(x)
> eval(parse(text=paste("c(", x, ")"))))))
> [1] 1.676 0.012 1.877 0.000 0.000
>
> > str(res3)
> num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
>
> > head(res3)
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,] 1 2 3 4 5 6 7 8 9 10
> [2,] 1 2 3 4 5 6 7 8 9 10
> [3,] 1 2 3 4 5 6 7 8 9 10
> [4,] 1 2 3 4 5 6 7 8 9 10
> [5,] 1 2 3 4 5 6 7 8 9 10
> [6,] 1 2 3 4 5 6 7 8 9 10
>
>
>
> > all(res1 == res3)
> [1] TRUE
>
>
> We do see a notable reduction in time with strsplit(), while a notable
> increase in time using eval(parse)), even though we are converting the
> same net number of values (100,000).
>
> Much of the increase with eval(parse()) is of course due to the overhead
> of sapply() and navigating the list.
>
>
> Let's increase the size of the list components to 1000:
>
> > repeated.measures.columns <- paste(1:1000, collapse = ",")
> > big.list <- replicate(10000, list(repeated.measures.columns))
>
> > system.time(res1 <- sapply(big.list, function(x)
> as.numeric(unlist(strsplit(x, ",")))))
> [1] 33.270 0.744 37.163 0.000 0.000
>
> > system.time(res3 <- t(sapply(big.list, function(x)
> eval(parse(text=paste("c(", x, ")"))))))
> [1] 15.893 0.928 18.139 0.000 0.000
>
>
> So we see here that as the size of the list components increases, there
> continues to be an advantage to Prof. Ripley's approach over using
> strsplit().
>
> Again, one needs to develop an understanding of where the time is spent
> in the processing by profiling and then consider how to introduce
> efficiencies, which in some cases may very well require the use of
> compiled C/FORTRAN as may be appropriate if times become too long.
>
> HTH,
>
> Marc Schwartz
>
>
>

-- 
Regards,

Mike Nielsen

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue Aug 22 01:39:47 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 22 Aug 2006 - 04:22:52 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.