[R] GLM/GAM and unobserved heterogeneity

From: Kyle G. Lundstedt <kylelundstedt_at_hotmail.com>
Date: Thu 18 Aug 2005 - 07:51:00 EST


     I'm interested in correcting for and measuring unobserved heterogeneity ("missing variables") using R. In particular, I'm searching for a simple way to measure the amount of unobserved heterogeneity remaining in a series of increasingly complex models
(adding additional variables to each new model) on the same data.

     I have a static database of 400,000 or so individual mortgage loans, each of which is observed monthly from origination (t=0) until termination (a binary yes/no variable). In my update database, there are up to 60 months of observed data for each loan in the static database, and an individual loan has an "average life" of roughly 36 months.

     Each loan has static covariates observed at origination, such as original loan amount and credit score, as well as time-varying covariates (TVC) such as age, interest rates, and house prices. Because these TVC change each month, I've constructed a modeling database that merges the static database with the update database.

     The resulting "loan-month" modeling database has one observation for every loan-month, and the static covariates remain the same for all loan-months for a given loan. Thus, the modeling database has roughly 14.4 million loan-month records. A loan is considered "active" as long as it has not yet terminated or been censored; my interest is in predicting termination.

     This type of data is often referred to as "event history" or "discrete hazard" data. The standard R package to apply to such data is "survival", with which I could estimate a Cox proportional hazard model using coxph. The advantage of such an approach is that unobserved heterogeneity is easily addressed using the "frailty" term.

     The disadvantages, at least for my purposes, are two-fold. First, my audience is unfamiliar with hazard models. Second, my monthly data has many "ties" (many terminations in the same month), so I've been told that coxph won't work well on a large dataset with many ties.

     On the other hand, because the data is measured discretely each month, many references suggest applying generalized linear models
(GLM, "logit"-type models) or even generalized addivitive models
(GAM, "logit"-type models that incorporate nonlinearity in individual
covariates). The advantage to this approach is that GLM and GAM are readily available in R, and my audience is very familiar with logit- type models.

     The disadvantage, however, is that I am totally unfamiliar with ways to correct for and measure unobserved heterogeneity using GLM/ GAM-type models. I've been told that unobserved heterogeneity in the hazard framework is analogous to random effects in the GLM/GAM framework, but there seem to be a number of R packages that address this issue in different ways.

     So, I'd greatly appreciate suggestions on a simple way to incorporate unobserved heterogeneity into a GLM/GAM-type model. I'm not much of a statistician, so simple examples are always helpful. I'm also happy to track down specific article/book references, if folks think those might be of help.

Many thanks,

kyle  at  hotmail . com

(email altered in obvious ways)
______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Thu Aug 18 07:56:02 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:39:50 EST