Re: [R] logistic regression with 50 varaibales

From: Claudia Beleites <cbeleites_at_units.it>
Date: Mon, 14 Jun 2010 16:36:02 +0200

Dear all,

(this first part of the email I sent to John earlier today, but forgot to put it to the list as well)
Dear John,

> Hi, this is not R technical question per se. I know there are many excellent
> statisticians in this list, so here my questions: I have dataset with ~1800
> observations and 50 independent variables, so there are about 35 samples per
> variable. Is it wise to build a stable multiple logistic model with 50
> independent variables? Any problem with this approach? Thanks

First: I'm not a statistician, but a spectroscopist. But I do build logistic Regression models with far less than 1800 samples and far more variates (e.g. 75 patients / 256 spectral measurement channels). Though I have many measurements per sample: typically several hundred spectra per sample.

Question: are the 1800 real, independent samples?

Model stability is something you can measure. Do a honest validation of your model with really _independent_ test data and measure the stability according to what your stability needs are (e.g. stable parameters or stable predictions?).

(From here on reply to Joris)

> Marcs explanation is valid to a certain extent, but I don't agree with
> his conclusion. I'd like to point out "the curse of
> dimensionality"(Hughes effect) which starts to play rather quickly.
No doubt.

> The curse of dimensionality is easily demonstrated looking at the
> proximity between your datapoints. Say we scale the interval in one
> dimension to be 1 unit. If you have 20 evenly-spaced observations, the
> distance between the observations is 0.05 units. To have a proximity
> like that in a 2-dimensional space, you need 20^2=400 observations. in
> a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
> distance between your observations is important, as a sparse dataset
> will definitely make your model misbehave.

But won't also the distance between groups grow? No doubt, that high-dimensional spaces are _very_ unintuitive.

However, the required sample size may grow substantially slower, if the model has appropriate restrictions. I remember the recommendation of "at least 5 samples per class and variate" for linear classification models. I.e. not to get a good model, but to have a reasonable chance of getting a stable model.

> Even with about 35 samples per variable, using 50 independent
> variables will render a highly unstable model,
Am I wrong thinking that there may be a substantial difference between stability of predictions and stability of model parameters?

BTW: if the models are unstable, there's also aggregation.

At least for my spectra I can give toy examples with physical-chemical explanation that yield the same prediction with different parameters (of course because of correlation).

> as your dataspace is
> about as sparse as it can get. On top of that, interpreting a model
> with 50 variables is close to impossible,
No, not necessary. IMHO it depends very much on the meaning of the variables. E.g. for the spectra a set of model parameters may be interpreted like spectra or difference spectra. Of course this has to do with the fact, that a parallel coordinate plot is the more "natural" view of spectra compared to a point in so many dimensions.

> and then I didn't even start
> on interactions. No point in trying I'd say. If you really need all
> that information, you might want to take a look at some dimension
> reduction methods first.

Which puts to my mind a question I've had since long: I assume that all variables that I know beforehand to be without information are already discarded.
The dimensionality is then further reduced in a data-driven way (e.g. by PCA or PLS). The model is built in the reduced space.

How much less samples are actually needed, considering the fact that the dimension reduction is a model estimated on the data? ...which of course also means that the honest validation embraces the data-driven dimensionality reduction as well...

Are there recommendations about that?

The other curious question I have is:
I assume that it is impossible for him to obtain the 10^xy samples required for comfortable model building.
So what is he to do?

Cheers,

Claudia

-- 
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
UniversitÓ degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbeleites_at_units.it

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Mon 14 Jun 2010 - 14:46:22 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 14 Jun 2010 - 15:30:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive