Re: [Rd] application to mentor syrfr package development for Google Summer of Code 2010

From: Michael Schmidt <mds47_at_cornell.edu>
Date: Mon, 08 Mar 2010 03:41:52 -0500

Hi James,

Thanks for contacting me. Eureqa takes into account the total size of an equation when comparing different candidate models. It attempts to find the set of possible equations that are non-dominated in both error and size. The final results is a short list consisting of the most accurate equation for increasing equation sizes.

This is closely related to degrees of freedom, but not exactly the same because Eureqa needs to search for both the structure of the equations and their parameters simultaneously.

Michael

On Mon, Mar 8, 2010 at 2:49 AM, James Salsman <jsalsman_at_talknicer.com>wrote:

> Chillu, I meant that development on both a syrfr R package capable of
> using either F statistics or parametric derivatives should proceed in
> parallel with your work on such a derivatives package. You are right
> that genetic algorithm search (and general best-first search --
> http://en.wikipedia.org/wiki/Best-first_search -- of which genetic
> algorithms are various special cases) can be very effectively
> parallelized, too.
>
> In any case, thank you for pointing out Eureqa --
> http://ccsl.mae.cornell.edu/eureqa -- but I can see no evidence there
> or in the user manual or user forums that Eureqa is considering
> degrees of freedom in its goodness-of-fit estimation. That is a
> serious problem which will typically result in invalid symbolic
> regression. I am sending this message also to Michael Schmidt so that
> he might be able to comment on the extent to which Eureqa adjusts for
> degrees of freedom in his fit evaluations.
>
> Best regards,
> James Salsman
>
> On Sun, Mar 7, 2010 at 10:39 PM, Chidambaram Annamalai
> <quantumelixir_at_gmail.com> wrote:
> >
> >> If I understand your concern, you want to lay the foundation for
> >> derivatives so that you can implement the search strategies described
> >> in Schmidt and Lipson (2010) --
> >> http://www.springerlink.com/content/l79v2183725413w0/ -- is that
> >> right?
> >
> > Yes. Basically traditional "naive" error estimators or fitness functions
> > fail miserably when used in SR with implicit equations because they
> > immediately close in on "best" fits like f(x) = x - x and other trivial
> > solutions. In such cases no amount of regularization and complexity
> > penalizing methods will help since x - x is fairly simple by most
> measures
> > of complexity and it does have zero error. So the paper outlines such
> > problems associated with "direct" error estimators and thus they infer
> the
> > "triviality" of the fit by probing its estimates around nearby points and
> > seeing if it does follow the pattern dictated by the data points -- ergo
> > derivatives.
> >
> > Also, somewhat like a side benefit, this method also enables us to
> perform
> > regression on closed loops and other implicit equations since the fitness
> > functions are based only on derivatives. The specific form of the error
> is
> > equation 1.2 which is what, I believe, comprises of the internals of the
> > evaluation procedure used in Eureqa.
> >
> > You are correct in pointing out that there is no reason to not work in
> > parallel, since GAs generally have a more or less fixed form
> > (evaluate-reproduce cycle) which is quite easily parallelized. I have
> used
> > OpenMP in the past, in which it is fairly trivial to parallelize well
> formed
> > for loops.
> >
> > Chillu
> >
> >> It is not clear to me how well this generalized approach will
> >> work in practice, but there is no reason not to proceed in parallel to
> >> establish a framework under which you could implement the metrics
> >> proposed by Schmidt and Lipson in the contemplated syrfr package.
> >>
> >> I have expanded the test I proposed with two more questions -- at
> >>
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
> >> -- specifically:
> >>
> >> 5. Critique http://sites.google.com/site/gptips4matlab/
> >>
> >> 6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a
> >> linear model of your choice. How can your characterize the
> >> degree-of-freedom-adjusted goodness of fit of nonlinear models?
> >>
> >> I believe pairwise anova.nls is the optimal comparison for nonlinear
> >> models, but there are several good choices for approximations,
> >> including the residual standard error, which I believe can be adjusted
> >> for degrees of freedom, as can the F statistic which TableCurve uses;
> >> see: http://en.wikipedia.org/wiki/F-test#Regression_problems
> >>
> >> Best regards,
> >> James Salsman
> >>
> >>
> >> On Sun, Mar 7, 2010 at 7:35 PM, Chidambaram Annamalai
> >> <quantumelixir_at_gmail.com> wrote:
> >> > It's been a while since I proposed syrfr and I have been constantly in
> >> > contact with the many people in the R community and I wasn't able to
> >> > find a
> >> > mentor for the project. I later got interested in the Automatic
> >> > Differentiation proposal (adinr) and, on consulting with a few others
> >> > within
> >> > the R community, I mailed John Nash (who proposed adinr in the first
> >> > place)
> >> > if he'd be willing to take me up on the project. I got a positive
> reply
> >> > only
> >> > a few hours ago and it was my mistake to have not removed the syrfr
> >> > proposal
> >> > in time from the wiki, as being listed under proposals looking for
> >> > mentors.
> >> >
> >> > While I appreciate your interest in the syrfr proposal I am afraid my
> >> > allegiances have shifted towards the adinr proposal, as I got
> convinced
> >> > that
> >> > it might interest a larger group of people and it has wider scope in
> >> > general.
> >> >
> >> > I apologize for having caused this trouble.
> >> >
> >> > Best Regards,
> >> > Chillu
> >> >
> >> > On Mon, Mar 8, 2010 at 6:41 AM, James Salsman <jsalsman_at_talknicer.com
> >
> >> > wrote:
> >> >>
> >> >> Per
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010
> >> >> -- and
> >> >>
> >> >>
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
> >> >> -- I am applying to mentor the "Symbolic Regression for R" (syrfr)
> >> >> package for the Google Summer of Code 2010.
> >> >>
> >> >> I propose the following test which an applicant would have to pass in
> >> >> order to qualify for the topic:
> >> >>
> >> >> 1. Describe each of the following terms as they relate to statistical
> >> >> regression: categorical, periodic, modular, continuous, bimodal,
> >> >> log-normal, logistic, Gompertz, and nonlinear.
> >> >>
> >> >> 2. Explain which parts of http://bit.ly/tablecurve were adopted in
> >> >> SigmaPlot and which weren't.
> >> >>
> >> >> 3. Use the 'outliers' package to improve a regression fit maintaining
> >> >> the correct extrapolation confidence intervals as are between those
> >> >> with and without outlier exclusions in proportion to the confidence
> >> >> that the outliers were reasonably excluded. (Show your R
> transcript.)
> >> >>
> >> >> 4. Explain the relationship between degrees of freedom and correlated
> >> >> independent variables.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> James Salsman
> >> >> jsalsman_at_talknicer.com
> >> >> http://talknicer.com
> >> >>
> >> >> ______________________________________________
> >> >> R-devel_at_r-project.org mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >
> >> >
> >
> >
>

-- 
Michael Schmidt
Cornell Computational Synthesis Lab
Cornell University, 239 Upson Hall, Ithaca, NY 14853
email: mds47_at_cornell.edu

	[[alternative HTML version deleted]]

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 08 Mar 2010 - 09:04:43 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 09 Mar 2010 - 07:50:57 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive