From: David Jarvis <thangalin_at_gmail.com>

Date: Fri, 18 Jun 2010 21:44:17 -0700

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 19 Jun 2010 - 04:47:04 GMT

Date: Fri, 18 Jun 2010 21:44:17 -0700

y <- seq(1900, 2009)

o <- runif(110, 9, 15)

m <- data.frame( x, fitted( gam( y ~ s(x) ) ) )

The values from m are then actually plotted as the trend line depicted at:

What I am trying to do now is to calculate how accurately GAM fits the data. The suggestion I was given was to use RMSE on the observed data versus the model data. It was also suggested that I use mean bins, with each bin containing 5 values, to reduce the amount of error in the calculation. Algorithmically, I pictured it as:

- Let index = 1
- Let size = 5
- Let o = vector of observed data
- Let ob = empty vector
- Append mean( o[index:index+size-1] ) into ob
- Let index = index + size
- Repeat from Step 5 until no more elements in o

At this point, ob would contain the average of: the first five values, the second five values, and so on. Thus length( ob ) = round(length( o ) / 5).

I would then repeat the same calculation on m to get mb, the model's bins.

With those averages, I could use ob and mb to calculate the normal root mean square deviation:

nrmse <- sqrt( mean( ob - mb ) ^ 2 ) / (max( ob ) - min( ob ))

Then turn that into a percentage:

100 - nmse

At that point I was hoping I could say that, in general, the result indicates how closely the model fits the data. The closer to 100%, the more accurate the trend line.

As you can tell, I have very little experience in statistics and R so any feedback, suggestions, or general guidance would be greatly appreciated.

Dave

P.S.

The years, the type of weather data, and the locations that the measurements
were taken can all be selected by users when they generate the report. So
sometimes the data will have 110 years, inclusive, other times it could be
37 years (thus 37 data points). So choosing to average 5 elements per bin is
a bit arbitrary... I am looking to get something working first before
tweaking the possible parameters for the calculation.

Thanks again!

[[alternative HTML version deleted]]

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 19 Jun 2010 - 04:47:04 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Sat 19 Jun 2010 - 14:40:33 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*