Re: [R] Fw: Memory problem on a linux cluster using a large data set [Broadcast]

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Wed 10 Jan 2007 - 15:09:12 GMT

On Wed, 10 Jan 2007, Iris Kolder wrote:

> Hi
>
> I listened to all your advise and ran my data on a computer with a 64
> bits procesor but i still get the same error saying "it cannot allocate
> a vector of that size 1240 kb" . I don't want to cut my data in smaller
> pieces because we are looking at interaction. So are there any other
> options for me to try out or should i wait for the development of more
> advanced computers!

Did you use a 64-bit build of R on that machine? If the message is the same, I strongly suspect not. 64-bit builds are not the default on most OSes.

>
> Thanks,
>
> Iris
>
>
> ----- Forwarded Message ----
> From: Iris Kolder <iriskolder@yahoo.com>
> To: r-help@stat.math.ethz.ch
> Sent: Thursday, December 21, 2006 2:07:08 PM
> Subject: Re: [R] Memory problem on a linux cluster using a large data set [Broadcast]
>
>
> Thank you all for your help!
>
> So with all your suggestions we will try to run it on a computer with a
> 64 bits proccesor. But i've been told that the new R versions all work
> on a 32bits processor. I read in other posts that only the old R
> versions were capable of larger data sets and were running under 64 bit
> proccesors. I also read that they are adapting the new R version for 64
> bits proccesors again so does anyone now if there is a version available
> that we could use?
>
> Iris Kolder
>
> ----- Original Message ----
> From: "Liaw, Andy" <andy_liaw@merck.com>
> To: Martin Morgan <mtmorgan@fhcrc.org>; Iris Kolder <iriskolder@yahoo.com>
> Cc: r-help@stat.math.ethz.ch; N.C. Onland-moret <n.c.onland@umcutrecht.nl>
> Sent: Monday, December 18, 2006 7:48:23 PM
> Subject: RE: [R] Memory problem on a linux cluster using a large data set [Broadcast]
>
>
> In addition to my off-list reply to Iris (pointing her to an old post of
> mine that detailed the memory requirement of RF in R), she might
> consider the following:
>
> - Use larger nodesize
> - Use sampsize to control the size of bootstrap samples
>
> Both of these have the effect of reducing sizes of trees grown. For a
> data set that large, it may not matter to grow smaller trees.
>
> Still, with data of that size, I'd say 64-bit is the better solution.
>
> Cheers,
> Andy
>
> From: Martin Morgan
>>
>> Iris --
>>
>> I hope the following helps; I think you have too much data
>> for a 32-bit machine.
>>
>> Martin
>>
>> Iris Kolder <iriskolder@yahoo.com> writes:
>>
>>> Hello,
>>>
>>> I have a large data set 320.000 rows and 1000 columns. All the data
>>> has the values 0,1,2.
>>
>> It seems like a single copy of this data set will be at least
>> a couple of gigabytes; I think you'll have access to only 4
>> GB on a 32-bit machine (see section 8 of the R Installation
>> and Administration guide), and R will probably end up, even
>> in the best of situations, making at least a couple of copies
>> of your data. Probably you'll need a 64-bit machine, or
>> figure out algorithms that work on chunks of data.
>>
>>> on a linux cluster with R version R 2.1.0. which operates on a 32
>>
>> This is quite old, and in general it seems like R has become
>> more sensitive to big-data issues and tracking down
>> unnecessary memory copying.
>>
>>> "cannot allocate vector size 1240 kb". I've searched through
>>
>> use traceback() or options(error=recover) to figure out where
>> this is actually occurring.
>>
>>> SNP <- read.table("file.txt", header=FALSE, sep="") #
>> read in data file
>>
>> This makes a data.frame, and data frames have several aspects
>> (e.g., automatic creation of row names on sub-setting) that
>> can be problematic in terms of memory use. Probably better to
>> use a matrix, for which:
>>
>> 'read.table' is not the right tool for reading large matrices,
>> especially those with many columns: it is designed to read _data
>> frames_ which may have columns of very different classes. Use
>> 'scan' instead.
>>
>> (from the help page for read.table). I'm not sure of the
>> details of the algorithms you'll invoke, but it might be a
>> false economy to try to get scan to read in 'small' versions
>> (e.g., integer, rather than
>> numeric) of the data -- the algorithms might insist on
>> numeric data, and then make a copy during coercion from your
>> small version to numeric.
>>
>>> SNP$total.NAs = rowSums(is.na(SN # calculate the
>> number of NA per row and adds a colum with total Na's
>>
>> This adds a column to the data.frame or matrix, probably
>> causing at least one copy of the entire data. Create a
>> separate vector instead, even though this unties the
>> coordination between columns that a data frame provides.
>>
>>> SNP = t(as.matrix(SNP)) #
>> transpose rows and columns
>>
>> This will also probably trigger a copy;
>>
>>> snp.na<-SNP
>>
>> R might be clever enough to figure out that this simple
>> assignment does not trigger a copy. But it probably means
>> that any subsequent modification of snp.na or SNP *will*
>> trigger a copy, so avoid the assignment if possible.
>>
>>> snp.roughfix<-na.roughfix(snp.na)
>>
>>> fSNP<-factor(snp.roughfix[, 1]) # Asigns
>> factor to case control status
>>>
>>> snp.narf<- randomForest(snp.roughfix[,-1], fSNP,
>>> na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE,
>>> keep.forest=FALSE, do.trace=100)
>>
>> Now you're entirely in the hands of the randomForest. If
>> memory problems occur here, perhaps you'll have gained enough
>> experience to point the package maintainer to the problem and
>> suggest a possible solution.
>>
>>> set it should be able to cope with that amount. Perhaps someone has
>>> tried this before in R or is Fortram a better choice? I added my R
>>
>> If you mean a pure Fortran solution, including coding the
>> random forest algorithm, then of course you have complete
>> control over memory management. You'd still likely be limited
>> to addressing 4 GB of memory.
>>
>>
>>> I wrote a script to remove all the rows with more than 46 missing
>>> values. This works perfect on a smaller dataset. But the problem
>>> arises when I try to run it on the larger data set I get an error
>>> "cannot allocate vector size 1240 kb". I've searched
>> through previous
>>> posts and found out that it might be because i'm running it
>> on a linux
>>> cluster with R version R 2.1.0. which operates on a 32 bit
>> processor.
>>> But I could not find a solution for this problem. The cluster is a
>>> really fast one and should be able to cope with these large
>> amounts of
>>> data the systems configuration are Speed: 3.4 GHz, memory
>> 4GByte. Is
>>> there a way to change the settings or processor under R? I
>> want to run
>>> the function Random Forest on my large data set it should
>> be able to
>>> cope with that amount. Perhaps someone has tried this
>> before in R or
>>> is Fortram a better choice? I added my R script down below.
>>>
>>> Best regards,
>>>
>>> Iris Kolder
>>>
>>> SNP <- read.table("file.txt", header=FALSE, sep="") #
>> read in data file
>>> SNP[SNP==9]<-NA # change
>> missing values from a 9 to a NA
>>> SNP$total.NAs = rowSums(is.na(SN # calculate the
>> number of NA per row and adds a colum with total Na's
>>> SNP = SNP[ SNP$total.NAs < 46, ] # create a subset
>> with no more than 5%(46) NA's
>>> SNP$total.NAs=NULL # remove
>> added column with sum of NA's
>>> SNP = t(as.matrix(SNP)) #
>> transpose rows and columns
>>> set.seed(1)
>>
>>> snp.na<-SNP
>>> snp.roughfix<-na.roughfix(snp.na)
>>
>>> fSNP<-factor(snp.roughfix[, 1]) # Asigns
>> factor to case control status
>>>
>>> snp.narf<- randomForest(snp.roughfix[,-1], fSNP,
>>> na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE,
>>> keep.forest=FALSE, do.trace=100)
>>>
>>> print(snp.narf)
>>>
>>> __________________________________________________
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help@stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Martin T. Morgan
>> Bioconductor / Computational Biology
>> http://bioconductor.org
>>
>> ______________________________________________
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>
>
> ------------------------------------------------------------------------------
> Notice: This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates (which may be known
> outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD
> and in Japan, as Banyu - direct contact information for affiliates is
> available at http://www.merck.com/contact/contacts.html) that may be
> confidential, proprietary copyrighted and/or legally privileged. It is
> intended solely for the use of the individual or entity named on this
> message. If you are not the intended recipient, and have received this
> message in error, please notify us immediately by reply e-mail and then
> delete it from your system.
>
> ------------------------------------------------------------------------------
>
>
>
> __________________________________________________
>
>
>
>
>
>
> ____________________________________________________________________________________
> Want to start your own business?
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu Jan 11 02:14:24 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 10 Jan 2007 - 15:30:30 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.