Re: [Rd] problem using model.frame()

From: Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
Date: Thu 18 Aug 2005 - 07:26:54 GMT

On Thu, 2005-08-18 at 07:57 +0300, Jari Oksanen wrote:
> On 18 Aug 2005, at 1:49, Gavin Simpson wrote:
>
> > On Wed, 2005-08-17 at 20:24 +0200, Martin Maechler wrote:
> >>>>>>> "GS" == Gavin Simpson <gavin.simpson@ucl.ac.uk>
> >>>>>>> on Tue, 16 Aug 2005 18:44:23 +0100 writes:
> >>
> >> GS> On Tue, 2005-08-16 at 12:35 -0400, Gabor Grothendieck
> >> GS> wrote:
> >>>> On 8/16/05, Gavin Simpson <gavin.simpson@ucl.ac.uk>
> >>>> wrote: > On Tue, 2005-08-16 at 11:25 -0400, Gabor
> >>>> Grothendieck wrote: > > It can handle data frames like
> >>>> this:
> >>>>>>
> >>>>>> model.frame(y1) > > or > > model.frame(~., y1)
> >>>>>
> >>>>> Thanks Gabor,
> >>>>>
> >>>>> Yes, I know that works, but I want the function
> >>>> coca.formula to accept a > formula like this y2 ~ y1,
> >>>> with both y1 and y2 being data frames. It is
> >>>>
> >>>> The expressions I gave work generally (i.e. lm, glm,
> >>>> ...), not just in model.matrix, so would it be ok if the
> >>>> user just does this?
> >>>>
> >>>> yourfunction(y2 ~., y1)
> >>
> >> GS> Thanks again Gabor for your comments,
> >>
> >> GS> I'd prefer the y1 ~ y2 as data frames - as this is the
> >> GS> most natural way of doing things. I'd like to have (y2
> >> GS> ~., y1) as well, and (y2 ~ spp1 + spp2 + spp3, y1) also
> >> GS> work - silently without any trouble.
> >>
> >> I'm sorry, Gavin, I tend to disagree quite a bit.
> >>
> >> The formula notation has quite a history in the S language, and
> >> AFAIK never was the idea to use data.frames as formula
> >> components, but rather as "environments" in which formula
> >> components are looked up --- exactly as Gabor has explained.
> >
> > Hi Martin, thanks for your comments,
> >
> > But then one could have a matrix of variables on the rhs of the formula
> > and it would work - whether this is a documented feature or un-intended
> > side-effect of matrices being stored as vectors with dims, I don't
> > know.
> >
> > And whilst the formula may have a long history, a number of packages
> > have extended the interface to implement a specific feature, which
> > don't
> > work with standard functions like lm, glm and friends. I don't see how
> > what I wanted to achieve is greatly different to that or using a
> > matrix.
> >
> >> To break with such a deeply rooted principle,
> >> you should have very very good reasons, because you're breaking
> >> the concepts on which all other uses of formulae are based.
> >> And this would potentially lead to much confusion of your users,
> >> at least in the way they should learn to think about what
> >> formulae mean.
> >
> > In the end I managed to treat y1 ~ y2 (both data frames) as a special
> > case, which allows the existing formula notation to work as well, so I
> > can use y1 ~ y2, y1 ~ ., data = y2, or y1 ~ var + var2, data = y2. This
> > is what I wanted all along, to extend my interface (not do anything to
> > R's formulae), but to also work in the traditional sense.
> >
> > The model I am writing code for really is modelling the relationship
> > between two matrices of data. In one version of the method, there is
> > real equivalence between both sides of the formula so it would seem odd
> > to treat the two sides of the formula differently. At least to me ;-)
>
> It seems that I may be responsible for one of these extensions (lhs as
> a data.frame in cca and rda in vegan package). There the response (lhs)
> is multivariate or a multispecies community, and you must take that as
> a whole without manipulation (and if you tried using VGAM you see there
> really is painful to define lhs with, say, 127 elements).

Hi Jari,

Thanks for reminding me about this - I'd forgotten about not normally being able to have a data.frame on the lhs of the formula either - I'm surprised no-one pulled me up on that one before, either ;-)

I guess what I'm proposing is really pushing the formula representation too far for some people. I'm coming round to the y1 ~ ., data = y2 way of doing things - still prefer y1 ~ y2 though ;-)

Also, both y1 and y2 are community matrices (i.e. both have many, many species, aka variables for the non-community ecologists reading this). I'm not sure that it makes sense to treat the two sides differently. In the predictive co-correspondence mode (the default), multivariate pls is used to regress one matrix on another, with the number of pls components being chosen by cross-validation or a permutation test.

> However, in
> general you shouldn't use models where you use all the 'explanatory'
> variables (rhs) that yo happen to have by accident. So much bad science
> has been created with that approach even in your field, Gav.

Well, I agree with you there...

> The whole
> idea of formula is the ability to choose from candidate variables. That
> is: to build a model. Therefore you have one-sided formulae in prcomp()
> and princomp(): you can say prcomp(~ x1 + log(x2) +x4, data) or
> prcomp(~ . - x3, data). I think you should try to keep it so. Do
> instead like Gabor suggested: you could have a function coca.default or
> coca.matrix with interface:
>
> coca.matrix(matx, maty, matz) -- or you can name this as coca.default.
>
> and coca.formula which essentially parses your formula and returns a
> list of matrices you need:
>
> coca.formula <- function(formula, data)
> {
> matricesout <- parsemyformula(formula, data)
> coca(matricesout$matx, matricesout$maty, matricesoutz)
> }
> Then you need the generic: coca <- function(...) UseMethod("coca") and
> it's done (but fails in R CMD check unless you add "..." in all
> specific functions...). The real work is always done in coca.matrix (or
> coca.default), and the others just chew your data into suitable form
> for your workhorse.
>
> If then somebody thinks that they need all possible variables as
> 'explanatory' variables (or perhaps constraints in your case), they
> just call the function as
>
> coca(matx, maty, matz)

My functions are already generic with coca.default and coca.formula. The issue with matrices/data.frames was only a problem in the formula interface.

> And if you have coca.data.frame they don't need 'quacking' with extra
> steps:
>
> coca.data.frame <- function(dfx, dfy dfz) coca(as.matrix(dfx),
> as.matrix(dfy), as.matrix(dfz)).
>
> This you call as coca(dfx, dfy, dfz) and there you go.
>
> The essential feature in formula is the ability to define the model.
> Don't give it away.

I think the point I'm trying to make is that I don't think what I'm trying to do is any different than doing lm(y ~ x, data), (where y, x are vectors) - it is just that my x and y happen to be multivariate. I think it is easier to think of each community as a single entity in this regard - the relationship *is* between community 1 and community 2, not parts of community 2, or some parsimonious model of community 2 - but that might just be semantics - unlike your cca/rda functions which really are a (weighted) multivariate multiple regression. Happy to be convinced otherwise though.

Also, it is worth re-iterating that I haven't broken the traditional way of working with formulae with my function - you can still do y1 ~ ., data = y2, or y1 ~ spp1 + spp2 + spp3, data = y2, for maximum flexibility. All I wanted (and worked out how) to do was treat the rhs in a special way if it were a data frame, just like Jari treats a data.frame on the lhs of formulae in package vegan as a special case.

Thanks everyone for your ideas and comments - lots of food for thought. I wavering between both camps on this - still time to be convinced and change it before I finish the package.

All the best,

G

>
> cheers, jazza
> --
> Jari Oksanen, Oulu, Finland
>

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [T] +44 (0)20 7679 5522
ENSIS Research Fellow             [F] +44 (0)20 7679 7565
ENSIS Ltd. & ECRC                 [E] gavin.simpsonATNOSPAMucl.ac.uk
UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
London.  WC1H 0AP.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu Aug 18 17:36:59 2005

This archive was generated by hypermail 2.1.8 : Mon 24 Oct 2005 - 22:27:39 GMT