Re: [Rd] application to mentor syrfr package development for Google Summer of Code 2010

From: Chidambaram Annamalai <quantumelixir_at_gmail.com>
Date: Mon, 08 Mar 2010 12:09:27 +0530

> If I understand your concern, you want to lay the foundation for
> derivatives so that you can implement the search strategies described
> in Schmidt and Lipson (2010) --
> http://www.springerlink.com/content/l79v2183725413w0/ -- is that
> right?

Yes. Basically traditional "naive" error estimators or fitness functions fail miserably when used in SR with implicit equations because they immediately close in on "best" fits like f(x) = x - x and other trivial solutions. In such cases no amount of regularization and complexity penalizing methods will help since x - x is fairly simple by most measures of complexity and it does have zero error. So the paper outlines such problems associated with "direct" error estimators and thus they infer the "triviality" of the fit by probing its estimates around nearby points and seeing if it does follow the pattern dictated by the data points -- ergo derivatives.

Also, somewhat like a side benefit, this method also enables us to perform regression on closed loops and other implicit equations since the fitness functions are based only on derivatives. The specific form of the error is equation 1.2 which is what, I believe, comprises of the internals of the evaluation procedure used in Eureqa.

You are correct in pointing out that there is no reason to not work in parallel, since GAs generally have a more or less fixed form (evaluate-reproduce cycle) which is quite easily parallelized. I have used OpenMP in the past, in which it is fairly trivial to parallelize well formed for loops.

Chillu

It is not clear to me how well this generalized approach will
> work in practice, but there is no reason not to proceed in parallel to
> establish a framework under which you could implement the metrics
> proposed by Schmidt and Lipson in the contemplated syrfr package.
>
> I have expanded the test I proposed with two more questions -- at
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
> -- specifically:
>
> 5. Critique http://sites.google.com/site/gptips4matlab/
>
> 6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a
> linear model of your choice. How can your characterize the
> degree-of-freedom-adjusted goodness of fit of nonlinear models?
>
> I believe pairwise anova.nls is the optimal comparison for nonlinear
> models, but there are several good choices for approximations,
> including the residual standard error, which I believe can be adjusted
> for degrees of freedom, as can the F statistic which TableCurve uses;
> see: http://en.wikipedia.org/wiki/F-test#Regression_problems
>
> Best regards,
> James Salsman
>
>
> On Sun, Mar 7, 2010 at 7:35 PM, Chidambaram Annamalai
> <quantumelixir_at_gmail.com> wrote:
> > It's been a while since I proposed syrfr and I have been constantly in
> > contact with the many people in the R community and I wasn't able to find
> a
> > mentor for the project. I later got interested in the Automatic
> > Differentiation proposal (adinr) and, on consulting with a few others
> within
> > the R community, I mailed John Nash (who proposed adinr in the first
> place)
> > if he'd be willing to take me up on the project. I got a positive reply
> only
> > a few hours ago and it was my mistake to have not removed the syrfr
> proposal
> > in time from the wiki, as being listed under proposals looking for
> mentors.
> >
> > While I appreciate your interest in the syrfr proposal I am afraid my
> > allegiances have shifted towards the adinr proposal, as I got convinced
> that
> > it might interest a larger group of people and it has wider scope in
> > general.
> >
> > I apologize for having caused this trouble.
> >
> > Best Regards,
> > Chillu
> >
> > On Mon, Mar 8, 2010 at 6:41 AM, James Salsman <jsalsman_at_talknicer.com>
> > wrote:
> >>
> >> Per http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010
> >> -- and
> >>
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
> >> -- I am applying to mentor the "Symbolic Regression for R" (syrfr)
> >> package for the Google Summer of Code 2010.
> >>
> >> I propose the following test which an applicant would have to pass in
> >> order to qualify for the topic:
> >>
> >> 1. Describe each of the following terms as they relate to statistical
> >> regression: categorical, periodic, modular, continuous, bimodal,
> >> log-normal, logistic, Gompertz, and nonlinear.
> >>
> >> 2. Explain which parts of http://bit.ly/tablecurve were adopted in
> >> SigmaPlot and which weren't.
> >>
> >> 3. Use the 'outliers' package to improve a regression fit maintaining
> >> the correct extrapolation confidence intervals as are between those
> >> with and without outlier exclusions in proportion to the confidence
> >> that the outliers were reasonably excluded. (Show your R transcript.)
> >>
> >> 4. Explain the relationship between degrees of freedom and correlated
> >> independent variables.
> >>
> >> Best regards,
> >>
> >> James Salsman
> >> jsalsman_at_talknicer.com
> >> http://talknicer.com
> >>
> >> ______________________________________________
> >> R-devel_at_r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>

        [[alternative HTML version deleted]]



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Mon 08 Mar 2010 - 06:41:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 08 Mar 2010 - 08:30:58 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive