From: Greg Snow <Greg.Snow_at_imail.org>

Date: Thu, 03 Mar 2011 14:16:47 -0700

Date: Thu, 03 Mar 2011 14:16:47 -0700

What you might need to do is create a character string with your formula in it (looping through pairs of variables and using paste or sprint) then convert that to a formula using the as.formula function.

-- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow_at_imail.org 801.408.8111Received on Thu 03 Mar 2011 - 21:19:47 GMT

> -----Original Message-----

> From: Matthew Douglas [mailto:matt.douglas01_at_gmail.com]> Sent: Thursday, March 03, 2011 2:09 PM> To: Greg Snow> Cc: r-help_at_r-project.org> Subject: Re: [R] Regression with many independent variables>> Thanks greg,>> that formula was exactly what I was looking for. Except now when I> run it on my data I get the following error:>> "Error in model.matrix.default(mt, mf, contrasts) : cannot allocate> vector of length 2043479998">> I know there are probably many 2-way interactions that are zero so I> thought I could save space by removing these. Is there some way that> can just delete all the two way interactions that are zero and keep> the columns that have non-zero entries? I think that will> significantly cut down the memory needed. Or is there just another way> to get around this?>> thanks,> Matt>> On Tue, Mar 1, 2011 at 3:56 PM, Greg Snow <Greg.Snow_at_imail.org> wrote:> > You can use ^2 to get all 2 way interactions and ^3 to get all 3 way> interactions, e.g.:> >> > lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris)> >> > The lm.fit function is what actually does the fitting, so you could> go directly there, but then you lose the benefits of using . and ^.> The Matrix package has ways of dealing with sparse matricies, but I> don't know if that would help here or not.> >> > You could also just create x'x and x'y matricies directly since the> variables are 0/1 then use solve. A lot depends on what you are doing> and what questions you are trying to answer.> >> > --> > Gregory (Greg) L. Snow Ph.D.> > Statistical Data Center> > Intermountain Healthcare> > greg.snow_at_imail.org> > 801.408.8111> >> >> >> -----Original Message-----> >> From: Matthew Douglas [mailto:matt.douglas01_at_gmail.com]> >> Sent: Tuesday, March 01, 2011 1:09 PM> >> To: Greg Snow> >> Cc: r-help_at_r-project.org> >> Subject: Re: [R] Regression with many independent variables> >>> >> Hi Greg,> >>> >> Thanks for the help, it works perfectly. To answer your question,> >> there are 339 independent variables but only 10 will be used at one> >> time . So at any given line of the data set there will be 10 non> zero> >> entries for the independent variables and the rest will be zeros.> >>> >> One more question:> >>> >> 1. I still want to find a way to look at the interactions of the> >> independent variables.> >>> >> the regression would look like this:> >>> >> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk> >>> >> so I think the regression in R would look like this:> >>> >> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = adj0708),> >>> >> my problem is that since I have technically 339 independent> variables,> >> when I do this regression I would have 339 Choose 2 = approx 57000> >> independent variables (a vast majority will be 0s though) so I dont> >> want to have to write all of these out. Is there a way to do this> >> quickly in R?> >>> >> Also just a curious question that I cant seem to find to online:> >> is there a more efficient model other than lm() that is better for> >> very sparse data sets like mine?> >>> >> Thanks,> >> Matt> >>> >>> >> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow_at_imail.org>> wrote:> >> > Don't put the name of the dataset in the formula, use the data> >> argument to lm to provide that. A single period (".") on the right> >> hand side of the formula will represent all the columns in the data> set> >> that are not on the left hand side (you can then use "-" to remove> any> >> other columns that you don't want included on the RHS).> >> >> >> > For example:> >> >> >> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris)> >> >> >> > Call:> >> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)> >> >> >> > Coefficients:> >> > (Intercept) Petal.Length Petal.Width> >> Speciesversicolor> >> > 3.0485 0.1547 0.6234> -> >> 1.7641> >> > Speciesvirginica> >> > -2.1964> >> >> >> >> >> > But, are you sure that a regression model with 339 predictors will> be> >> meaningful?> >> >> >> > --> >> > Gregory (Greg) L. Snow Ph.D.> >> > Statistical Data Center> >> > Intermountain Healthcare> >> > greg.snow_at_imail.org> >> > 801.408.8111> >> >> >> >> >> >> -----Original Message-----> >> >> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-> >> >> project.org] On Behalf Of Matthew Douglas> >> >> Sent: Monday, February 28, 2011 1:32 PM> >> >> To: r-help_at_r-project.org> >> >> Subject: [R] Regression with many independent variables> >> >>> >> >> Hi,> >> >>> >> >> I am trying use lm() on some data, the code works fine but I> would> >> >> like to use a more efficient way to do this.> >> >>> >> >> The data looks like this (the data is very sparse with a few 1s,> -1s> >> >> and the rest 0s):> >> >>> >> >> > head(adj0708)> >> >> MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....> >> >> 1 64.28571 29 0 0 0 0 0 0 0 0 0> 0> >> >> 0 0 0> >> >> 2 -100.00000 6 0 0 0 0 0 0 0 1 0> 0> >> >> 0 0 0> >> >> 3 100.00000 4 0 0 0 0 0 0 0 1 0> 0> >> >> 0 0 0> >> >> 4 -33.33333 7 0 0 0 0 0 0 0 0 0> 0> >> >> 0 0 0> >> >> 5 200.00000 2 0 0 0 0 0 0 0 0 0> 0> >> >> -1 0 0> >> >> 6 -83.33333 12 0 -1 0 0 0 0 0 0 0> 0> >> >> 0 0 0> >> >>> >> >> adj0708 is actually a 35657x341 data set. Each column after> "Poss"> >> is> >> >> an independent variable, the dependent variable is "MARGIN" and> it> >> is> >> >> weighted by "Poss"> >> >>> >> >>> >> >> The regression is below:> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +> >> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +> >> >> adj0708$P605 + adj0708$P337 + .... +> >> >> adj0708$P510,weights=adj0708$Poss)> >> >>> >> >> I have two questions:> >> >>> >> >> 1. Is there a way to to condense how I write the independent> >> variables> >> >> in the lm(), instead of having such a long line of code (I have> 339> >> >> independent variables to be exact)?> >> >> 2. I would like to pair the data to look a regression of the> >> >> interactions between two independent variables. I think it would> >> look> >> >> something like this....> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +> >> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +> >> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)> >> >> but there will be 339 Choose 2 combinations, so a lot of> independent> >> >> variables! Is there a more efficient way of writing this code. Is> >> >> there a way I can do this?> >> >>> >> >> Thanks,> >> >> Matt> >> >>> >> >> ______________________________________________> >> >> R-help_at_r-project.org mailing list> >> >> https://stat.ethz.ch/mailman/listinfo/r-help> >> >> PLEASE do read the posting guide http://www.R-> project.org/posting-> >> >> guide.html> >> >> and provide commented, minimal, self-contained, reproducible> code.> >> >> >

______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 03 Mar 2011 - 23:00:18 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*