Re: [R] two cols in a data frame are the same factor

From: Andres Legarra <legarra_at_gmail.com>
Date: Fri, 21 Mar 2008 09:18:17 +0100

Looks like it works, albeit the first level is automatically dropped out by lm(). I'll manege to do something with that. The second option looks good too.
Thanks

Andres

On Thu, Mar 20, 2008 at 6:21 PM, Greg Snow <Greg.Snow_at_imail.org> wrote:
> Here is one approach:
>
> First run a regular lm command without the restrictions, but specify
> y=TRUE, x=TRUE.
>
> This will do the unconstrained regression, but part of the return object
> will be the y variable after subsetting, NA removal, etc. and the x
> matrix that was used, this x matrix will have your 2 factors converted
> into indicator/dummy variables (along with any other covariates
> mentioned). Take the x and y components of that return and put them
> into a new data frame.
>
> Now do a regression using the new data frame as your data and include
> I(f1.1+f2.1) terms just like you would with numeric predictors to force
> the coefficients to be equal.
>
> You could also accomplish the same idea in the original regression using
> a formula like:
>
> Y ~ I( fac1=='A' + fac2=='A' ) + I( fac1=='B' + fac2=='B' ) + ...
>
> For each level (other than the baseline level, or including it if you
> leave out the intercept) of fac1 and fac2. Both do essentially the same
> thing, create your own set of indicator variables rather than depending
> on R to do it.
>
> Hope this helps,
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow_at_imail.org
> (801) 408-8111
>
>
>
>
>
> > -----Original Message-----
> > From: r-help-bounces_at_r-project.org
> > [mailto:r-help-bounces_at_r-project.org] On Behalf Of Andres Legarra
> > Sent: Thursday, March 20, 2008 2:25 AM
> > To: Michael Dewey
> > Cc: R-help_at_r-project.org
> > Subject: Re: [R] two cols in a data frame are the same factor
> >
> > Hi,
> > I am afraid you misunderstood it. I do not have repeated
> > records, but for every record I have two, possibly different,
> > simultaneously present, instanciations of an explanatory variable.
> >
> > My data is as follows :
> >
> > yield haplo1 haplo2
> > 100 A B
> > 151 B A
> > 212 A A
> >
> > So I have one effect (haplo), but two copies of each affect "yield".
> > If I use lm() I get:
> > >
> > a=data.frame(yield=c(100,151,212),haplo1=c("A","B","A"),haplo2=c("B","
> > > A","A"))
> > Call:
> > lm(formula = yield ~ -1 + haplo1 + haplo2, data = a)
> >
> > Coefficients:
> > haploA haploB haplo2B
> > 212 151 -112
> >
> >
> > But I get different coefficients for the two "A"s (in fact oe
> > was set to 0) and the Two "Bs" . That is, the model has four
> > unknowns but in my example I have just two!
> >
> > A least-squares solution is simple to do by hand:
> >
> > X=matrix(c(1,1,1,1,2,0),ncol=2) #the incidence matrix
> > > X
> > [,1] [,2]
> > [1,] 1 1
> > [2,] 1 2
> > [3,] 1 0
> > > solve(crossprod(X,X),crossprod(X,a$yield))
> > [,1]
> > [1,] 184.8333
> > [2,] -30.5000
> >
> > where [1,] is the solution for A and [2,] is the solution for B
> >
> > This is not difficult to do by hand, but it is for a simple
> > case and I miss all the machinery in lm()
> >
> > Thank you
> > Andres
> >
> > On Wed, Mar 19, 2008 at 6:57 PM, Michael Dewey
> > <info_at_aghmed.fsnet.co.uk> wrote:
> > > At 09:11 18/03/2008, Andres Legarra wrote:
> > > >Dear all,
> > > >I have a data set (QTL detection) where I have two cols
> > of factors
> > > in >the data frame that correspond logically (in my model) to the
> > > same >factor. In fact these are haplotype classes.
> > > >Another real-life example would be family gas consumption as a
> > > >function of car company (e.g. Ford, GM, and Honda)
> > (assuming 2 cars
> > > by >family).
> > >
> > > Unless I completely misunderstand this it looks like you have the
> > > dataset in wide format when you really wanted it in long
> > format (to
> > > use the terminology of ?reshape). Then you would fit a
> > model allowing
> > > for the clustering by family.
> > >
> > >
> > >
> > >
> > > >An artificial example follows:
> > > >set.seed(1234)
> > > >L3 <- LETTERS[1:3]
> > > >(d <- data.frame( y=rnorm(10), fac=sample(L3, 10,
> > > >repl=TRUE),fac1=sample(L3,10,repl=T)))
> > > >
> > > > lm(y ~ fac+fac1,data=d)
> > > >
> > > >and I get:
> > > >
> > > >Coefficients:
> > > >(Intercept) facB facC fac1B fac1C
> > > > 0.3612 -0.9359 -0.2004 -2.1376 -0.5438
> > > >
> > > >However, to respect my model, I need to constrain effects
> > in fac and
> > > >fac1 to be the same, i.e. facB=fac1B and facC=fac1C. There are
> > > >logically just 4 unknowns (average,A,B,C).
> > > >With continuous covariates one might do y ~ I(cov1+cov2),
> > but this
> > > is >not the case.
> > > >
> > > >Is there any trick to do that?
> > > >Thanks,
> > > >
> > > >Andres Legarra
> > > >INRA-SAGA
> > > >Toulouse, France
> > >
> > > Michael Dewey
> > > http://www.aghmed.fsnet.co.uk
> > >
> > >
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 21 Mar 2008 - 08:21:51 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 21 Mar 2008 - 09:30:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive