Re: [Rd] c.factor

From: Matthew Dowle <mdowle_at_concordiafunds.com>
Date: Wed 15 Nov 2006 - 12:51:17 GMT

Prof Ripley,

> Well, R has managed without a factor method for c() for most of its
decade
> of existence (not that it originally had factors as we know them).

R has managed without other things too for most of its decade. For example, row names in data frames have very recently been made efficient. That is an example how R was managing for a decade but an improvement has still been made. As we become aware of what we believe is missing in R, I believe the correct approach, the approach you advocate, is to contribute back to the list. This is what I did. I also contributed a potential solution in the form of working source code. I stand by my statement that the current result of c(x,y) when x and y are factors is not useful. It is a specific statement about a specific operation, not any general criticism of R. I agree with you that factors are best viewed as an enumeration type, but I would argue further that c() of 2 enumerated types should return an enumerated type, retaining the powerful feature of enumerated types in R. However, currently R ignores the fact that x and y are enumerated. It silently ignores the levels information, and returns an integer vector whose integers are, well, not useful. Or, if you prefer, not as useful as the proposal I posted.

I have a solution which works for me, and I have contributed it. One other person has shown some interest, and taken it further to work with multiple arguments which looks like a nice improvement.

The only thing I would comment, if c.factor does go further, is to please avoid the use of as.character in the implementation. One key advantage of the factor type is precisely that it is enumerated, and therefore is efficient for categorical data sets. Intermediate coercion to character is inefficient in this case, which is why I avoided it in the solution I posted.

Regards,
Matthew

> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk]
> Sent: 14 November 2006 18:23
> To: Marc Schwartz
> Cc: Matthew Dowle; r-devel@r-project.org
> Subject: Re: [Rd] c.factor
>
>
> Well, R has managed without a factor method for c() for most
> of its decade
> of existence (not that it originally had factors as we know them).
>
> I would argue that factors are best viewed as an enumeration
> type, and
> anything which silently changes their level set is a bad
> idea. I can see
> a case for a c() method for factors that combines factors
> with the same
> level sets, but I can also see this is best done by users who
> know the
> level sets are same (c.factor would have to expend a
> considerable effort
> to check).
>
> You also need to consider the dispatch rules. c.factor will
> be called
> whenever the first argument is a factor, whatever the others
> are. S4 (I
> think, definitely S4-based versions of S-PLUS) has an
> alternative concat()
> that works differently (recursively) and seems a more natural model.
>
>
> On Tue, 14 Nov 2006, Marc Schwartz wrote:
>
> > On Tue, 2006-11-14 at 11:51 -0600, Marc Schwartz wrote:
> >> On Tue, 2006-11-14 at 16:36 +0000, Matthew Dowle wrote:
> >>> Hi,
> >>>
> >>> Given factors x and y, c(x,y) does not seem to return a useful
> >>> result :
> >>>> x
> >>> [1] a b c d e
> >>> Levels: a b c d e
> >>>> y
> >>> [1] d e f g h
> >>> Levels: d e f g h
> >>>> c(x,y)
> >>> [1] 1 2 3 4 5 1 2 3 4 5
> >>>>
> >>>
> >>> Is there a case for a new method c.factor as follows? Does
> >>> something similar exist already? Is there a better way
> to write the
> >>> function?
> >>>
> >>>> c.factor = function(x,y)
> >>> {
> >>> newlevels = union(levels(x),levels(y))
> >>> m = match(levels(y), newlevels)
> >>> ans = c(unclass(x),m[unclass(y)])
> >>> levels(ans) = newlevels
> >>> class(ans) = "factor"
> >>> ans
> >>> }
> >>>> c(x,y)
> >>> [1] a b c d e d e f g h
> >>> Levels: a b c d e f g h
> >>>> as.integer(c(x,y))
> >>> [1] 1 2 3 4 5 4 5 6 7 8
> >>>>
> >>>
> >>> Regards,
> >>> Matthew
> >>
> >> I'll defer to others as to whether or not there is a basis for
> >> c.factor,
> >> however:
> >>
> >> c.factor <- function(...)
> >> {
> >> args <- list(...)
> >>
> >> # this could be optional
> >> if (!all(sapply(args, is.factor)))
> >> stop("All arguments must be factors")
> >>
> >> factor(unlist(lapply(args, function(x) as.character(x)))) }
> >
> >
> > That last line can even be cleaned up, as I was doing something else
> > initially:
> >
> > c.factor <- function(...)
> > {
> > args <- list(...)
> >
> > if (!all(sapply(args, is.factor)))
> > stop("All arguments must be factors")
> >
> > factor(unlist(lapply(args, as.character)))
> > }
> >
> >
> > Marc
> >
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> --
> Brian D. Ripley, ripley@stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
>



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Nov 16 00:23:10 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 16 Nov 2006 - 04:30:45 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.