Re: [R] Fitting a distribution to peaks in histogram

From: hadley wickham <h.wickham_at_gmail.com>
Date: Thu 20 Jul 2006 - 21:23:13 EST

> I am afraid that, by comparison, I am the local statistican. I am also
> the local R-guru, and neither is saying much - so please bear with
> me.
>
> Do you know of some functions (built in hopefully) that I can try?

I'm a bit leery of offering advice without really sitting down and discussing your problem, but I've expanded my initial thoughts into hopefully something you can follow.

You mentioned that the peaks in the density correspond to different stages of the cell cycle. For this reason, I am going to assume that there are known number of peaks. If the number of peaks isn't known then this becomes more complicated. However, it sounds like even in that case there is a fairly low upper limit, so I think this approach should be ok. (And a model selection process based on AIC/BIC, if used with care, should be ok)

I think your problem boils down to fitting your data to a mixture of distributions. A good place to start would be a literature search to see what others have tried in the past, especially what distributions they used. I suspect you have already tried this with little luck. Because the patterns in the histogram you showed looked rather skewed, I would start with a mixture of gammas (this being a fairly flexible distribution esp wrt skewness).

To do this, the first step is to write out the likelihood equation for a mixture of gammas (you will need to do most of this yourself), which will be of the form:

f(x) = a_1 g(alpha_1, beta_1)(x) + a_2 g(alpha_2, beta_2)(x) + ... + (1- sum(a_i) g(alpha_n, beta_n)(x)

(this is for fixed n). This gives you (n-1) + 2*n parameters to estimate, which should be fine given the large amount of data you have. I would then use mle or optim to fit to estimate the parameters from the data. (You will need starting values for the algorthim, which you should be able to generate from the output you already have). This should be pretty fast if you vectorise appropriately.

(you will also need something to account for the final value which looked to have a high occurence in your histogram (perhaps from truncation?))

Once you have these estimates you should be able to get all the other data from them. You can also get standard errors for your estimates using standard ml theory (which should be reasonable given the large number of events)

Now for the caveats: I am only a PhD student, so there may be good reasons not to take this approach. If someone else suggested this approach to me, I would criticise the arbitrary choice of distribution, and would need reassurance there wouldn't be aliasing problems (eg. where very different combinations of the parameters given precisely the same results, will be somewhat alleviated with good defaults). There may be other problems that I am not aware of, and as I don't really know that much about what you are trying to do, there may be radically simpler or more accurate methods.

Regards,

Hadley



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu Jul 20 21:27:34 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 20 Jul 2006 - 22:17:33 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.