From: Matthew Douglas <matt.douglas01_at_gmail.com>

Date: Thu, 03 Mar 2011 16:08:46 -0500

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 03 Mar 2011 - 22:53:30 GMT

Date: Thu, 03 Mar 2011 16:08:46 -0500

Thanks greg,

that formula was exactly what I was looking for. Except now when I run it on my data I get the following error:

"Error in model.matrix.default(mt, mf, contrasts) : cannot allocate vector of length 2043479998"

I know there are probably many 2-way interactions that are zero so I thought I could save space by removing these. Is there some way that can just delete all the two way interactions that are zero and keep the columns that have non-zero entries? I think that will significantly cut down the memory needed. Or is there just another way to get around this?

thanks,

Matt

On Tue, Mar 1, 2011 at 3:56 PM, Greg Snow <Greg.Snow_at_imail.org> wrote:

> You can use ^2 to get all 2 way interactions and ^3 to get all 3 way interactions, e.g.:

*>
**> lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris)
**>
**> The lm.fit function is what actually does the fitting, so you could go directly there, but then you lose the benefits of using . and ^. The Matrix package has ways of dealing with sparse matricies, but I don't know if that would help here or not.
**>
**> You could also just create x'x and x'y matricies directly since the variables are 0/1 then use solve. A lot depends on what you are doing and what questions you are trying to answer.
**>
**> --
**> Gregory (Greg) L. Snow Ph.D.
**> Statistical Data Center
**> Intermountain Healthcare
**> greg.snow_at_imail.org
**> 801.408.8111
**>
**>
**>> -----Original Message-----
**>> From: Matthew Douglas [mailto:matt.douglas01_at_gmail.com]
**>> Sent: Tuesday, March 01, 2011 1:09 PM
**>> To: Greg Snow
**>> Cc: r-help_at_r-project.org
**>> Subject: Re: [R] Regression with many independent variables
**>>
**>> Hi Greg,
**>>
**>> Thanks for the help, it works perfectly. To answer your question,
**>> there are 339 independent variables but only 10 will be used at one
**>> time . So at any given line of the data set there will be 10 non zero
**>> entries for the independent variables and the rest will be zeros.
**>>
**>> One more question:
**>>
**>> 1. I still want to find a way to look at the interactions of the
**>> independent variables.
**>>
**>> the regression would look like this:
**>>
**>> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk
**>>
**>> so I think the regression in R would look like this:
**>>
**>> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = adj0708),
**>>
**>> my problem is that since I have technically 339 independent variables,
**>> when I do this regression I would have 339 Choose 2 = approx 57000
**>> independent variables (a vast majority will be 0s though) so I dont
**>> want to have to write all of these out. Is there a way to do this
**>> quickly in R?
**>>
**>> Also just a curious question that I cant seem to find to online:
**>> is there a more efficient model other than lm() that is better for
**>> very sparse data sets like mine?
**>>
**>> Thanks,
**>> Matt
**>>
**>>
**>> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow_at_imail.org> wrote:
**>> > Don't put the name of the dataset in the formula, use the data
**>> argument to lm to provide that. A single period (".") on the right
**>> hand side of the formula will represent all the columns in the data set
**>> that are not on the left hand side (you can then use "-" to remove any
**>> other columns that you don't want included on the RHS).
**>> >
**>> > For example:
**>> >
**>> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
**>> >
**>> > Call:
**>> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)
**>> >
**>> > Coefficients:
**>> > (Intercept) Petal.Length Petal.Width
**>> Speciesversicolor
**>> > 3.0485 0.1547 0.6234 -
**>> 1.7641
**>> > Speciesvirginica
**>> > -2.1964
**>> >
**>> >
**>> > But, are you sure that a regression model with 339 predictors will be
**>> meaningful?
**>> >
**>> > --
**>> > Gregory (Greg) L. Snow Ph.D.
**>> > Statistical Data Center
**>> > Intermountain Healthcare
**>> > greg.snow_at_imail.org
**>> > 801.408.8111
**>> >
**>> >
**>> >> -----Original Message-----
**>> >> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-
**>> >> project.org] On Behalf Of Matthew Douglas
**>> >> Sent: Monday, February 28, 2011 1:32 PM
**>> >> To: r-help_at_r-project.org
**>> >> Subject: [R] Regression with many independent variables
**>> >>
**>> >> Hi,
**>> >>
**>> >> I am trying use lm() on some data, the code works fine but I would
**>> >> like to use a more efficient way to do this.
**>> >>
**>> >> The data looks like this (the data is very sparse with a few 1s, -1s
**>> >> and the rest 0s):
**>> >>
**>> >> > head(adj0708)
**>> >> MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
**>> >> 1 64.28571 29 0 0 0 0 0 0 0 0 0 0
**>> >> 0 0 0
**>> >> 2 -100.00000 6 0 0 0 0 0 0 0 1 0 0
**>> >> 0 0 0
**>> >> 3 100.00000 4 0 0 0 0 0 0 0 1 0 0
**>> >> 0 0 0
**>> >> 4 -33.33333 7 0 0 0 0 0 0 0 0 0 0
**>> >> 0 0 0
**>> >> 5 200.00000 2 0 0 0 0 0 0 0 0 0 0
**>> >> -1 0 0
**>> >> 6 -83.33333 12 0 -1 0 0 0 0 0 0 0 0
**>> >> 0 0 0
**>> >>
**>> >> adj0708 is actually a 35657x341 data set. Each column after "Poss"
**>> is
**>> >> an independent variable, the dependent variable is "MARGIN" and it
**>> is
**>> >> weighted by "Poss"
**>> >>
**>> >>
**>> >> The regression is below:
**>> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
**>> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
**>> >> adj0708$P605 + adj0708$P337 + .... +
**>> >> adj0708$P510,weights=adj0708$Poss)
**>> >>
**>> >> I have two questions:
**>> >>
**>> >> 1. Is there a way to to condense how I write the independent
**>> variables
**>> >> in the lm(), instead of having such a long line of code (I have 339
**>> >> independent variables to be exact)?
**>> >> 2. I would like to pair the data to look a regression of the
**>> >> interactions between two independent variables. I think it would
**>> look
**>> >> something like this....
**>> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
**>> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
**>> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
**>> >> but there will be 339 Choose 2 combinations, so a lot of independent
**>> >> variables! Is there a more efficient way of writing this code. Is
**>> >> there a way I can do this?
**>> >>
**>> >> Thanks,
**>> >> Matt
**>> >>
**>> >> ______________________________________________
**>> >> R-help_at_r-project.org mailing list
**>> >> https://stat.ethz.ch/mailman/listinfo/r-help
**>> >> PLEASE do read the posting guide http://www.R-project.org/posting-
**>> >> guide.html
**>> >> and provide commented, minimal, self-contained, reproducible code.
**>> >
**>
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 03 Mar 2011 - 22:53:30 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 03 Mar 2011 - 23:00:18 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*