Re: [R] factor : how does it work ?

From: Duncan Murdoch <>
Date: Thu 06 Oct 2005 - 23:36:28 EST

On 10/6/2005 9:14 AM, Florence Combes wrote:
> Dear all,
> I try for long to understand exactly what is the factor type and especially
> how it works, but it seems too difficult for me....
> I read paragraphs about it, and I understand quite well what it is (I think)
> but I still can't figure how to deal with.
> Especially these 2 mysteries (for me) :
> 1st when I make a dataframe (with the or the data.frame()
> commands) from vectors, it seems that some "columns" of the dataframe (which
> where vectors) are factors and some not, but I didn't find an explanation
> for which become factor and which don't.
> (I know I can use I() to avoid the factor transformaton but I think it is
> not an optimal solution to avoid the factor type just because I don't kno
> how to deal with)

This is described in the ?data.frame man page: "Character variables passed to 'data.frame' are converted to factor columns unless protected by 'I'."

> 2d I can't manage to deal with factors, so when I have some, I transform
> them in vectors (with levels()), but I think I miss the power and utility of
> the factor type ?

levels() is not the conversion you want. That lists all the levels, but it doesn't tell you how they correspond to individual observations. For example,

 > df <- data.frame(x=1:3, y=c('a','b','a'))  > df
   x y
1 1 a
2 2 b
3 3 a
 > levels(df$y)
[1] "a" "b"

If you need to convert back to character values, use as.character():

 > as.character(df$y)
[1] "a" "b" "a"

For many purposes, you can ignore the fact that your data is stored as a factor instead of a character vector. There are a few differences:

  1. You can't compare the levels of a factor unless you declared it to be ordered:

 > df$y[1] > df$y[2]
[1] NA
Warning message:
 > not meaningful for factors in: Ops.factor(df$y[1], df$y[2])


 > df$y <- ordered(df$y)
 > df$y[1] > df$y[2]
[1] FALSE However, you need to watch out here: the comparison is done by the order of the factors, not an alphabetic comparison of their names:

 > levels(df$y) <- c("before", "after")
 > df
   x y
1 1 before
2 2 after
3 3 before
 > df$y[1] > df$y[2]
[1] FALSE   2. as.integer() works differently on factors: it gets the position in the levels vector. For example,

 > as.integer(df$y)
[1] 1 2 1
 > as.integer(as.character(df$y))
[1] NA NA NA
Warning message:
NAs introduced by coercion

There are other differences, but these are the two main ones that are likely to cause you trouble.

Duncan Murdoch mailing list PLEASE do read the posting guide! Received on Thu Oct 06 23:38:37 2005

This archive was generated by hypermail 2.1.8 : Sun 23 Oct 2005 - 18:25:02 EST