Re: [R] Discretize continous variables....

From: milicic.marko <milicic.marko_at_gmail.com>
Date: Sat, 19 Jul 2008 14:59:31 -0700 (PDT)

Frank/Danial,

Thank you for very good discussion on this.

The reason I'm doing this is because is it common industrial practice to group continous varible (say age) in couple of buckets while developming scorecards to be used by business people. I don't see the reason why I shouldn't discretize variable AGE if manage to maintain same information or reduce it slightly.

However, I do agree that reading your book will be of grait benefit.

Thanks a lot.... and keep discussion live

On Jul 19, 7:03pm, Frank E Harrell Jr <f.harr..._at_vanderbilt.edu> wrote:
> Daniel Malter wrote:
> > True. Thanks for the clarification. Is your conclusion from that that the
> > findings in such case should only be interpreted in the specific context
> > (with the awareness that it does not apply to changing contexts) or that
> > such an approach should not be taken at all?
>
> The latter, in general; in specific cases the former. But even then
> why condition on incomplete information when complete information is
> available? I.e., why compute Pr(Y=1 | X>x) in place of Pr(Y=1 | X=x)?
>
> Frank
>
>
>
>
>
> > Frank E Harrell Jr wrote:
> >> Daniel Malter wrote:
> >>> This time I agree with Rolf Turner. This sounds like homework. Whether or
> >>> not, type
>
> >>> ?ifelse
>
> >>> in the R-prompt.
>
> >>> Frank is right, it leads to a loss in information. However, I think it
> >>> remains interpretable. Further, it is common practice in certain fields,
> >>> and
> >> I have to disagree. It is easy to show that odds ratios so obtained are
> >> functions of the entire distribution of the predictor in question. Thus
> >> they do not estimate a scientific quantity (something that can be
> >> interpreted out of context). For example if age is cut at 65 and one
> >> were to add to the sample several subjects aged 100, the >=65 : <65 odds
> >> ratio would change even if the age effect did not.
>
> >>> it maybe a reasonable way to check whether mostly outliers in the X drive
> >>> your results (although other approaches are available for that as well).
> >>> The
> >>> main underlying question however should be, do you have reason to expect
> >>> that the response is different by the groups you create rather than in
> >>> the
> >>> numbers of the continuous variable.
> >> Regression splines can help. Sometimes the splines are stated in terms
> >> of the cube root of the predictor to avoid excess influence.
>
> >> Frank
>
> >>> Regarding question 2: I thought you mean that you want to reduce the
> >>> number
> >>> of levels (say 4) to a smaller number of levels (say 2) for one of your
> >>> independent variables (i.e. one of the Xs), not Y. This makes sense only,
> >>> if
> >>> there is any good conceptual reason to group these categories - not just
> >>> to
> >>> get significance.
>
> >>> Best,
> >>> Daniel
>
> >>> Frank E Harrell Jr wrote:
> >>>> milicic.marko wrote:
> >>>>> Hi R helpers,
>
> >>>>> I'm preparing dataset to fir logistic regression model with lrm(). I
> >>>>> have various cointinous and discrete variables and I would like to:
>
> >>>>> 1. Optimaly discretize continous variables (Optimaly means, maximizing
> >>>>> information value - IV for example)
> >>>> This will result in effects in the model that cannot be interpreted and
> >>>> will ruin the statistical inference from the lrm. It will also hurt
> >>>> predictive discrimination. You seem to be allergic to continuous
> >>>> variables.
>
> >>>>> 2. Regroup discrete variables to achieve perhaps smaller number of
> >>>>> level and better information value...
> >>>> If you use the Y variable to do this the same problems will result.
> >>>> Shrinkage is a better approach, or using marginal frequencies to combine
> >>>> levels. See the "pre-specification of complexity" strategy in my book
> >>>> Regression Modeling Strategies.
>
> >>>> Frank
>
> >>>>> Please suggest if there is some package providing this or same
> >>>>> functionality for discretization...
>
> >>>>> if there is no package plese suggest how to achieve this.
>
> >> --
>
> ______________________________________________
> R-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 19 Jul 2008 - 22:02:34 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 20 Jul 2008 - 03:32:02 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive