Re: [R] help with loading National Comorbidity Survey

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Wed 05 Oct 2005 - 03:30:46 EST

On Sat, 1 Oct 2005, Jim Hurd wrote:
>
> Which provides data in DTA (STATA), XPT (SAS), and POR (SPSS) formats all
> of which I have tried to read with the foreign package but I am not able to
> load any of them. I have 2 gb of RAM, but R crashes when the memory gets
> just over 1 GB. I am using Windows version 2.1.1. The size of the DTA file
> is 48 MB; the xpt file is 188 MB.
>

If you mean the NCS 1 data file from that link (da06694-0001.dta) then I don't have this problem.

I have been able to load in the .dta file under Windows on a computer with 1Gb of RAM. The maximum memory use was about 350Mb. It was very slow -- about half an hour. This is because the processing of missing values and of factor levels is very inefficient in read.dta when dealing with very wide data frames. It makes calls to [.data.frame, [<-.data.frame, etc, for each column and so the time is probably quadratic in the number of columns.

The call to .External that does the actual reading took less than 1% of the time. If you only want a hundred or so of the 3000 variables it may be worth just using that .External() call to read the data, then subset it and then work out how to apply the factor levels and so on.

read.dta clearly needs a different algorithm to handle very wide data sets efficiently.

         -thomas



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Oct 05 03:34:53 2005

This archive was generated by hypermail 2.1.8 : Sun 23 Oct 2005 - 18:17:52 EST