From: Joris Meys <jorismeys_at_gmail.com>

Date: Mon, 14 Jun 2010 15:38:52 +0200

Date: Mon, 14 Jun 2010 15:38:52 +0200

Hi,

Marcs explanation is valid to a certain extent, but I don't agree with his conclusion. I'd like to point out "the curse of dimensionality"(Hughes effect) which starts to play rather quickly.

The curse of dimensionality is easily demonstrated looking at the proximity between your datapoints. Say we scale the interval in one dimension to be 1 unit. If you have 20 evenly-spaced observations, the distance between the observations is 0.05 units. To have a proximity like that in a 2-dimensional space, you need 20^2=400 observations. in a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The distance between your observations is important, as a sparse dataset will definitely make your model misbehave.

On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz <marc_schwartz_at_me.com> wrote:

> On Jun 13, 2010, at 10:20 PM, array chip wrote:

*>
**>> Hi, this is not R technical question per se. I know there are many excellent statisticians in this list, so here my questions: I have dataset with ~1800 observations and 50 independent variables, so there are about 35 samples per variable. Is it wise to build a stable multiple logistic model with 50 independent variables? Any problem with this approach? Thanks
**>>
**>> John
**>
**>
**> The general rule of thumb is to have 10-20 'events' per covariate degree of freedom. Frank has suggested that in some cases that number should be as high as 25.
**>
**> The number of events is the smaller of the two possible outcomes for your binary dependent variable.
**>
**> Covariate degrees of freedom refers to the number of columns in the model matrix. Continuous variables are 1, binary factors are 1, K-level factors are K - 1.
**>
**> So if out of your 1800 records, you have at least 500 to 1000 events, depending upon how many of your 50 variables are K-level factors and whether or not you need to consider interactions, you may be OK. Better if towards the high end of that range, especially if the model is for prediction versus explanation.
**>
**> Two excellent references would be Frank's book:
**>
**> http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/
**>
**> and Steyerberg's book:
**>
**> http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/
**>
**> to assist in providing guidance for model building/validation techniques.
**>
**> HTH,
**>
**> Marc Schwartz
**>
**> ______________________________________________
**> R-help_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
**>
*

-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 Joris.Meys_at_Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Mon 14 Jun 2010 - 13:41:08 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Mon 14 Jun 2010 - 16:20:31 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*