From: Claudia Beleites <cbeleites_at_units.it>

Date: Mon, 14 Jun 2010 16:36:02 +0200

Date: Mon, 14 Jun 2010 16:36:02 +0200

Dear all,

(this first part of the email I sent to John earlier today, but forgot to put it
to the list as well)

Dear John,

> Hi, this is not R technical question per se. I know there are many excellent

* > statisticians in this list, so here my questions: I have dataset with ~1800
** > observations and 50 independent variables, so there are about 35 samples per
** > variable. Is it wise to build a stable multiple logistic model with 50
** > independent variables? Any problem with this approach? Thanks
*

First: I'm not a statistician, but a spectroscopist. But I do build logistic Regression models with far less than 1800 samples and far more variates (e.g. 75 patients / 256 spectral measurement channels). Though I have many measurements per sample: typically several hundred spectra per sample.

Question: are the 1800 real, independent samples?

Model stability is something you can measure. Do a honest validation of your model with really _independent_ test data and measure the stability according to what your stability needs are (e.g. stable parameters or stable predictions?).

(From here on reply to Joris)

> Marcs explanation is valid to a certain extent, but I don't agree with

* > his conclusion. I'd like to point out "the curse of
** > dimensionality"(Hughes effect) which starts to play rather quickly.
*

No doubt.

> The curse of dimensionality is easily demonstrated looking at the

* > proximity between your datapoints. Say we scale the interval in one
** > dimension to be 1 unit. If you have 20 evenly-spaced observations, the
** > distance between the observations is 0.05 units. To have a proximity
** > like that in a 2-dimensional space, you need 20^2=400 observations. in
** > a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
** > distance between your observations is important, as a sparse dataset
** > will definitely make your model misbehave.
*

But won't also the distance between groups grow? No doubt, that high-dimensional spaces are _very_ unintuitive.

However, the required sample size may grow substantially slower, if the model has appropriate restrictions. I remember the recommendation of "at least 5 samples per class and variate" for linear classification models. I.e. not to get a good model, but to have a reasonable chance of getting a stable model.

> Even with about 35 samples per variable, using 50 independent

* > variables will render a highly unstable model,
*

Am I wrong thinking that there may be a substantial difference between stability
of predictions and stability of model parameters?

BTW: if the models are unstable, there's also aggregation.

At least for my spectra I can give toy examples with physical-chemical explanation that yield the same prediction with different parameters (of course because of correlation).

> as your dataspace is

* > about as sparse as it can get. On top of that, interpreting a model
** > with 50 variables is close to impossible,
*

No, not necessary. IMHO it depends very much on the meaning of the variables.
E.g. for the spectra a set of model parameters may be interpreted like spectra
or difference spectra. Of course this has to do with the fact, that a parallel
coordinate plot is the more "natural" view of spectra compared to a point in so
many dimensions.

> and then I didn't even start

* > on interactions. No point in trying I'd say. If you really need all
** > that information, you might want to take a look at some dimension
** > reduction methods first.
*

Which puts to my mind a question I've had since long:
I assume that all variables that I know beforehand to be without information are
already discarded.

The dimensionality is then further reduced in a data-driven way (e.g. by PCA or
PLS). The model is built in the reduced space.

How much less samples are actually needed, considering the fact that the dimension reduction is a model estimated on the data? ...which of course also means that the honest validation embraces the data-driven dimensionality reduction as well...

Are there recommendations about that?

The other curious question I have is:

I assume that it is impossible for him to obtain the 10^xy samples required for
comfortable model building.

So what is he to do?

Cheers,

Claudia

-- Claudia Beleites Dipartimento dei Materiali e delle Risorse Naturali Università degli Studi di Trieste Via Alfonso Valerio 6/a I-34127 Trieste phone: +39 0 40 5 58-37 68 email: cbeleites_at_units.it ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Mon 14 Jun 2010 - 14:46:22 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Mon 14 Jun 2010 - 15:30:29 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*