Re: [Rd] Importing csv files

From: Prof Brian Ripley <>
Date: Fri 24 Dec 2004 - 04:31:37 EST

On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:

> Prof Brian Ripley wrote:
>> I think we need to know what you mean by `large' and why read.table is not
>> fast enough (and hence if some of the planned improvements might be all
>> that is needed).
> I was referring to the e-mail exchanges on r-help about read.table a few
> weeks ago, then there was a new discussion the other day concerning RAM usage
> and read.table not knowing the number of rows up front. I believe that the
> posters provided some timings and examples.

I have yet to see any which used read.table competently which were slow (although the RAM usage could be higher than some people expected). Unless people have followed _all_ the hints in the Data manual, I don't think there is anything to discuss.

There is an issue with reading factors with just a few unique values, but that is one of the things being worked on.

>> Could you make some examples available for profiling?

Anyone who actually has a problem, then?

>> It seems to me that there are some delicate licensing issues in
>> distributing a product that writes .rda format except under GPL. See, for
>> example, the GPL FAQ.
> My understanding is that David is not distributing dataload any more, though
> I would not like to discourage commercial vendors (such as providers of
> Stat/Transfer and DBMSCOPY) from providing .rda output as an option. I
> assume that new code written under GPL would not be a problem. -Frank

I said `except under GPL'. I am not trying to discourage anyone, just pointing out that GPL has far-ranging implications that are often over-looked.

>> On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:
>>> There is a recurring need for importing large csv files quickly. David
>>> Baird's dataload is a standalone program that will directly create .rda
>>> files from .csv (it also handles many other conversions). Unfortunately
>>> dataload is no longer publicly available because of some kind of
>>> relationship with Stat/Transfer. The idea is a good one, though. I
>>> wonder if anyone would volunteer to replicate the csv->rda standalone
>>> functionality or to provide some Perl or Python tools for making creation
>>> of .rda files somewhat easy outside of R.
>>> As an aside, I routinely see 30-fold reductions in file sizes for .rda
>>> files (made with save(..., compress=TRUE)) compared with the size of SAS
>>> binary datasets. And load( ) times are fast.
>>> It's been a great year for R. Let me take this opportunity to thank the R
>>> leaders for a fantastic job that gives immeasurable benefits to the
>>> community.

It's certainly been a great year for people to complain about R, R-help .... We say

         R is a collaborative project with many contributors.

but it seems to me much less than it used to be.

Brian D. Ripley,        
Professor of Applied Statistics,
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________ mailing list
Received on Fri Dec 24 03:36:55 2004

This archive was generated by hypermail 2.1.8 : Fri 24 Dec 2004 - 04:17:07 EST