Re: [Rd] slow load() in R2.6.0

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu, 11 Oct 2007 09:11:54 +0100 (BST)

I still can't reproduce this with lots of empty strings, but the way they are handled was changed in R-patched -- but not with the intention of avoiding a performance bottleneck, just to simplify the code.

I don't get object sizes as large as 500Kb, but it will be the case that "" is shared in the patched version, and each copy could need 28 bytes in 2.6.0 or 2.5.1. So (depending how it was created), a vector of 10000 "" could reduce from 320Kb to about 40Kb. (All assuming a 32-bit system, but yours looked like Windows.)

This is work-in-progress in that R-devel is faster than R-patched, appreciably so on some problems, and that change will be ported to R-patched shortly.

On Thu, 11 Oct 2007, Mark.Bravington_at_csiro.au wrote:

> Problem fixed by R-patched, thanks; see comments below.
>
>> On Thu, 11 Oct 2007, Mark.Bravington_at_csiro.au wrote:
>>
>>> I'm encountering excruciatingly slow load times for character vectors
>
>>> in R 2.6.0-- up to 30sec for a 15K file that contains a no-attributes
>
>>> character vector of length ~1e4 and object size ~0.5MB. In R 2.5.1,
>>> repeated loads of the same set of files are near-instantaneous.
>>>
>>> The problem is proving tricky to reproduce consistently from scratch,
>
>>> so I have attached the 3 files used in the examples below.
>>
>> There was no attachment: since these are (I presume) binary files, can
> you
>> not put them on a website (as suggested by the posting guide)?
>
> Sorry, I would have if I could, but can't at present. The attachments
> got through OK to me at least, though. If anyone does have an interest
> in the files, let me know off-list and I'll re-send as a zip or
> somesuch.

>
>>
>>> If I create a similar-looking object from scratch, then save it and
>>> re-load it a few times, the problem doesn't always occur... at least
> not
>>> in that session.
>>>
>>>
>>> FWIW I have noticed that the time taken to load seems to be roughly a
>
>>> power of 2 of the "base slow load time"-- could be a red herring.
>>>
>>> The problem seems specific to character vectors-- I noticed it with
>>> entire workspaces and have whittled it down to char vecs only.
>>>
>>> The example below is from a brand-new session with only the basic
>>> packages loaded; delays in my real sessions are much longer.
>>
>> Can you please try R-patched or R-devel. We've found and solved a
> couple
>> of performance issues with creating STRSXPs, but with character vectors
> of
>> the millions of elements.
>
> Thanks; R-patched fixed it. I did look in R-devel NEWS before posting,
> but that doesn't mention the bug fix on CHARSXP which is in the
> R-patched NEWS, so I didn't persist.
>
> FWIW in case work is still being done on new CHARSXP: my problems were
> with much shorter vectors (~1e4) than the millions mentioned in
> patched-NEWS, and the strings were short too: 90% were '' and the other
> 10% were 'a'. Also, when the previously offending objects are loaded
> into 2.6.0patched, they are 3-10X smaller (according to object.size)
> than in unpatched-- I was also amazed by the compression! Looks like
> unpatched R was allocating at least a 32-byte memory entry per
> individual zero-character string. It is down to about 4 bytes per
> (zero-character) string in R-patched.
>
>
> Mark Bravington
>
>>
>> I tried several examples of around 10000 elements and got times of at
> most
>> 0.05 secs in 2.6.0. These included parts of those examples on which we
>
>> had seen performance issues.
>>
>> A few clues:
>>
>> - even your base time is much slower than I would expect.
>>
>> - you say 'a 15K file ... object size ~0.5MB'. That's pretty
> phenomenal
>> compression, and I am seeing file sizes more like 100Kb for objects
> that
>> size. Since object.size does take into account duplication, one way
> to
>> get that would be to have all unique elements. At ca 50bytes per
>> element you would need an average string length of about 15 chars.
> Such
>> an object takes about 200Kb as a .rda file.
>>
>>
>>>
>>>
>>> Mark Bravington
>>> CSIRO Mathematical & Information Sciences
>>> Marine Laboratory
>>> Castray Esplanade
>>> Hobart 7001
>>> TAS
>>>
>>> ph (+61) 3 6232 5118
>>> fax (+61) 3 6232 5012
>>> mob (+61) 438 315 623
>>>
>>>
>>>
>>> Type 'demo()' for some demos, 'help()' for on-line help, or
>>> 'help.start()' for an HTML browser interface to help. Type 'q()' to
>>> quit R.
>>>
>>>> system.time( load( 'd:/r2.0/t1.rda'))
>>> user system elapsed
>>> 0.5 0.0 0.5
>>>> system.time( load( 'd:/r2.0/t1.rda')) # same file; slower
>>> user system elapsed
>>> 3.5 0.0 3.5
>>>> system.time( load( 'd:/r2.0/t1.rda'))
>>> user system elapsed
>>> 4.13 0.00 4.13
>>>> system.time( load( 'd:/r2.0/t1.rda'))
>>> user system elapsed
>>> 3.51 0.00 3.52
>>>
>>>> system.time( load( 'd:/r2.0/t2.rda')) # different bigger file
>>> user system elapsed
>>> 4.42 0.00 4.42
>>>> system.time( load( 'd:/r2.0/t2.rda')) # same file; slower
>>> user system elapsed
>>> 10.44 0.00 10.44
>>>> system.time( load( 'd:/r2.0/t2.rda'))
>>> user system elapsed
>>> 10.79 0.00 10.80
>>>> system.time( load( 'd:/r2.0/t2.rda'))
>>> user system elapsed
>>> 10.39 0.00 10.41
>>>> system.time( load( 'd:/r2.0/t1.rda')) # the smaller file again;
>>>> slower
>>> user system elapsed
>>> 10.67 0.00 10.69
>>>> system.time( load( 'd:/r2.0/t3.rda')) # different smaller file
>>> user system elapsed
>>> 10.51 0.00 10.52
>>>> system.time( load( 'd:/r2.0/t2.rda')) # now bigger file again:
> slower
>>> user system elapsed
>>> 14.61 0.00 14.61
>>>
>>>
>>>
>>> --please do not edit the information below--
>>>
>>> Version:
>>> platform = i386-pc-mingw32
>>> arch = i386
>>> os = mingw32
>>> system = i386, mingw32
>>> status =
>>> major = 2
>>> minor = 6.0
>>> year = 2007
>>> month = 10
>>> day = 03
>>> svn rev = 43063
>>> language = R
>>> version.string = R version 2.6.0 (2007-10-03)
>>>
>>> Windows XP (build 2600) Service Pack 2.0
>>>
>>> Locale:
>>>
> LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_M
>>> ON
>>>
> ETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
>>>
>>> Search Path:
>>> Search Path:
>>> .GlobalEnv, package:stats, package:graphics, package:grDevices,
>>> package:utils, package:datasets, package:methods, Autoloads,
>>> package:base
>>>
>>
>> --
>> Brian D. Ripley, ripley_at_stats.ox.ac.uk
>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford, Tel: +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK Fax: +44 1865 272595
>>
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu 11 Oct 2007 - 08:14:49 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 25 Oct 2007 - 11:37:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.