Re: [R] Possible overfitting of a GAM

From: <Bill.Venables_at_csiro.au>
Date: Sun, 17 Feb 2008 11:39:18 +1000

thomas L Jones asks:

> The subject is a Generalized Additive Model. Experts caution us
> against overfitting the data, which can cause inaccurate results.

Inaccurate *predictions*, to be more precies. The main problem with overfitting is that your model will capture too much of the noise in the data along with the signal. This noise then becomes prediction errors. The thing about randomness is not the absence of pattern. Randomness can sometimes appear as a fairly striking pattern. The problem is that next time it's a different pattern.

> I am not a statistician (my background is in Computer
> Science). Perhaps some kind soul would take a look and vet the model
> for overfitting the data.

You haven't given us very much to go on: just plots. To help you we need to see what you have really done, not just what you think you've done. This requires us to see some code (and data wouldn't hurt, too).

>
> The study estimated the ebb and flow of traffic through a voting
> place. Just one voting place was studied; the election was the
> U.S. mid-term election about a year ago. Procedure: The voting day
> was divided into five-minute bins, and the number of voters arriving
> in each bin was recorded. The voting day was 13 hours long, giving
> 156 bins.
>
> See http://tinyurl.com/36vzop for the scatterplot. There is a rather
> high random variation, due in part to the fact that the bin width
> was intentionally set to be narrow, in order to improve the amount
> of timing information gathered.

A natural sort of model to consider first would have been poisson with a log link. Is that what you used? You may need to be a bit careful with overdispersion if you want realistic standard errors.

>
> http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used,
> with the loess smoothing algorithm (locally weighted
> regression). The default span was used. http://tinyurl.com/34av6l
> gives the scatterplot and the fitted curve. The two seem to match
> reasonably well.
>

This looks pretty reasonable to me.

> However, when I tried to generate the standard errors, things went
> awry. (Please see http://tinyurl.com/38ej2t ) There are three
> curves, seemingly the fitted curve and the curves for plus and minus
> two standard errors. The shapes seem okay, but there are large
> errors in the y values.

How did you "try to generate standard errors"? This is where actual code becomes important to work out what you have really done.

This looks to me like a plot of the additive component of the model in the log scale, with standard errors on that. This would explain why the component is on a totally different scale to the one you show above (there you had the response scale), and in particular why it goes negative. That would also account for the apparent distortion in the curve itself relative to its image on the response scale. Components, by construction, have mean zero. It's the intercept that lifts them to the right level for predictions, and the inverse link that takes them back to the response scale.

>
> Question: Have I overfitted the data?

Most likely not. You may need to understand the model you are fitting a bit more, though, as well as the tools.

>
> Feedback?
>
> Tom
> Thomas L. Jones, PhD, Computer Science
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 17 Feb 2008 - 01:41:55 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 17 Feb 2008 - 09:30:15 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive