Re: [R] modeling binary response variables

From: Daniel Malter <>
Date: Mon, 14 Jul 2008 18:07:15 -0700 (PDT)

Hi Kevin, you mean an s-shaped relationship of a variable with your response? So you have a response that is strictly constrained to the interval 0,1 or, and these limits are not due to truncation or censoring (i.e. your response variable is truly a proportion).

This sounds like a good application for a binomial model as fitting a linear model may give you a fit outside the limits of the interval that you are allowed to observe (0,1). The binomial logit (or probit, or cloglog) fixes that issue.

Since you have a proportion (the probability of success), you have something between 0 and 1. I suggest you to transform that by multiplying that proportion by say 100 (or 1000). Then you round this value to the next integer. Say Y is currently your proportion, do new.Y=round(Y*100). Then you create the number of observations that make up the counter-probability of your observation. counter.Y=100-Y.

Then you can run the binomial as follows:

reg=glm(cbind(new.Y,counter.Y)~predictors,binomial) ##runs the regression summary(reg) ##shows the summary output of your regression fitted(reg) ##shows the predicted values given your data matrix and your estimated model

You will want to check a.) whether you need a binomial (if your probabilities are actually reasonably distributed in a much smaller interval than 0,1, then you may be okay with a linear model). b.) if a binomial is more appropriate, you will want to check whether your data is overdispersed. Look at whether your degrees of freedom in the summary of your model are about equal to the log-likelihood of the model. If not, choose option quasibinomial instead of option binomial when fitting the model.


Kevin J Emerson wrote:
> R-devotees,
> I have a question about modeling in the case where the response variable
> is
> binary.
> I have a case where I have a response variable that is the probability of
> success, and four descriptor variables, The response has a sigmoid
> response
> with one of the variables. I would like to test for the effect of the
> various descriptor variables on the percentage success of the binary
> trait.
> I have looked at glm with family = "binomial" but am not sure I totally
> understand its use (and therefore am not sure it is the appropriate test)
> and am looking for two things: (1) is glm with family = 'binomial' the
> right
> way to do this, and (2) are there any good references on how it works.
> I have posted a plot of a sample of the data I am looking at as well as
> the
> sample data used to generate the plots.
> Sample Plot:
> Sample Data:
> Response variable is ( are the errors from binomial
> estimates given probability and number of samples).
> Descriptor variables are num.days, ppd, temp, and pop.
> Any help would be greatly appreciated.
> Cheers,
> Kevin Emerson
> ====================================
> Kevin J. Emerson
> Bradshaw - Holzapfel Lab
> 1210 University of Oregon
> Eugene, OR, 97403
> email:
> web:
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.

View this message in context:
Sent from the R help mailing list archive at

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 15 Jul 2008 - 01:14:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 15 Jul 2008 - 02:31:39 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive