Re: [R] Comparison: glm() vs. bigglm()

From: Benilton Carvalho <bcarvalh_at_jhsph.edu>
Date: Fri, 29 Jun 2007 11:29:42 -0400

Hi Peter,

thank you very much for your feedback.

As for your observations, I do realize that I'm using 1.5 chunks for this particular case (10e6 gives around 8 chunks on other sets).

I just noticed that I didn't add the difference in the deviances that I observed:

m1$deviance-m2$deviance
[1] -93196.69

Thank you very much for the suggestion, I'll give it a try.

Best,
benilton

On Jun 29, 2007, at 11:05 AM, Peter Dalgaard wrote:

> Benilton Carvalho wrote:
>> Hi,
>>
>> Until now, I thought that the results of glm() and bigglm() would
>> coincide. Probably a naive assumption?
>>
>> Anyways, I've been using bigglm() on some datasets I have available.
>> One of the sets has >15M observations.
>>
>> I have 3 continuous predictors (A, B, C) and a binary outcome (Y).
>> And tried the following:
>>
>> m1 <- bigglm(Y~A+B+C, family=binomial(), data=dataset1,
>> chunksize=10e6)
>> m2 <- bigglm(Y~A*B+C, family=binomial(), data=dataset1,
>> chunksize=10e6)
>> imp <- m1$deviance-m2$deviance
>>
>> For my surprise "imp" was negative.
>>
>> I then tried the same models, using glm() instead... and as I
>> expected, "imp" was positive.
>>
>> I also noticed differences on the coefficients estimated by glm() and
>> bigglm() - small differences, though, and CIs for the coefficients (a
>> given coefficient compared across methods) overlap.
>>
>> Are such incrongruences expected? What can I use to check for
>> convergence with bigglm(), as this might be one plausible cause for a
>> negative difference on the deviances?
>>
> It doesn't sound right, but I cannot reproduce your problem on a
> similar
> sized problem (it pretty much killed my machine...). Some
> observations:
>
> A: You do realize that you are only using 1.5 chunks? (15M vs. 10e6
> chunksize)
>
> B: Deviance changes are O(1) under the null hypothesis but the
> deviances
> themselves are O(N). In a smaller variant (N=1e5), I got
>
>> m1$deviance
> [1] 138626.4
>> m2$deviance
> [1] 138626.4
>> m2$deviance - m1$deviance
> [1] -0.05865785
>
> This does leave some scope for roundoff to creep in. You may want to
> play with a lower setting of tol=...
>
> --
> O__ ---- Peter Dalgaard ุster Farimagsgade 5, Entr.B
> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen Denmark Ph: (+45)
> 35327918
> ~~~~~~~~~~ - (p.dalgaard_at_biostat.ku.dk) FAX: (+45)
> 35327907
>



R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 29 Jun 2007 - 15:56:53 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 29 Jun 2007 - 17:32:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.