Re: [R] Convert factor to numeric vector of labels

From: Bert Gunter <gunter.berton_at_gene.com>
Date: Tue, 14 Aug 2007 13:26:18 -0700


Matt:

I believe you have confused issues.

Setting stringsAsFactors = FALSE would dramatically **increase** the amount of memory used for storing character vectors, which is what factors are for. So your proposed solution does exactly the opposite of what you want.

The issue you are worried about is when numeric fields are somehow interpreted as non-numeric. This can happen for a variety of reasons (stray characters in numeric fields,quotes around numbers,...). The solution is not to set a global default that does the opposite of what you want in its intended use, but to read the documentation and either set the appropriate arguments (perhaps colClasses of read.table) or fix the original data before R reads it (e.g. remove quotes and stray characters). Failing that, the "one-off" solutions given are the correct way to handle what is a data problem, not an R problem.

However, I should add that there are arguments for making stringsAsFactors = FALSE; search the archives for discussions why. The memory penalty will have to be paid, of course.

Bert Gunter
Genentech Nonclinical Statistics

-----Original Message-----
From: r-help-bounces_at_stat.math.ethz.ch
[mailto:r-help-bounces_at_stat.math.ethz.ch] On Behalf Of Matthew Keller Sent: Tuesday, August 14, 2007 12:48 PM
To: John Kane
Cc: Falk Lieder; r-help_at_stat.math.ethz.ch Subject: Re: [R] Convert factor to numeric vector of labels

Hi all,

If we, the R community, are endeavoring to make R user friendly (gasp!), I think that one of the first places to start would be in setting stringsAsFactors = FALSE. Several times I've run into instances of folks decrying R's "rediculous usage of memory" in reading data, only to come to find out that these folks were unknowingly importing certain columns as factors. The fix is easy once you know it, but it isn't obvious to new users, and I'd bet that it turns some % of people off of the program. Factors are not used often enough to justify this default behavior in my opinion. When factors are used, the user knows to treat the variable as a factor, and so it can be done on a case-by-case (or should I say variable-by-variable?) basis.

Is this a default that should be changed?

Matt

On 8/13/07, John Kane <jrkrideau_at_yahoo.ca> wrote:
> This is one of R's rather _endearing_ little
> idiosyncrasies. I ran into it a while ago.
> http://finzi.psych.upenn.edu/R/Rhelp02a/archive/98090.html
>
>
> For some reason, possibly historical, the option
> "stringAsFactors" is set to TRUE.
>
> As Prof Ripley says FAQ 7.10 will tell you
> as.numeric(as.character(f)) # for a one-off conversion
>
> >From Gabor Grothendieck A one-off solution for a
> complete data.frame
>
> DF <- data.frame(let = letters[1:3], num = 1:3,
> stringsAsFactors = FALSE)
>
> str(DF) # to see what has happened.
>
> You can reset the option globally, see below. However
> you might want to read Gabor Grothendieck's comment
> about this in the thread referenced above since it
> could cause problems if you transfer files alot.
>
> Personally I went with the global option since I don't
> tend to transfer programs to other people and I was
> getting tired of tracking down errors in my programs
> caused by numeric and character variables suddenly
> deciding to become factors.
>
> >From Steven Tucker:
>
> You can also this option globally with
> options(stringsAsFactors = TRUE) # in
> \library\base\R\Rprofile
>
> --- Falk Lieder <falk.lieder_at_googlemail.com> wrote:
>
> > Hi,
> >
> > I have imported a data file to R. Unfortunately R
> > has interpreted some
> > numeric variables as factors. Therefore I want to
> > reconvert these to numeric
> > vectors whose values are the factor levels' labels.
> > I tried
> > as.numeric(<factor>),
> > but it returns a vector of factor levels (i.e.
> > 1,2,3,...) instead of labels
> > (i.e. 0.71, 1.34, 2.61,.).
> > What can I do instead?
> >
> > Best wishes, Falk
>
> ______________________________________________
> R-help_at_stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Matthew C Keller
Postdoctoral Fellow
Virginia Institute for Psychiatric and Behavioral Genetics

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 14 Aug 2007 - 20:44:17 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 14 Aug 2007 - 21:34:09 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.