From: Steve Lianoglou <mailinglist.honeypot_at_gmail.com>

Date: Thu, 05 Jan 2012 10:16:54 -0500

Date: Thu, 05 Jan 2012 10:16:54 -0500

Good stuff, Max!

Would also be nice to nail your 14 theses to a more permanent wall than the r-help mailing list ... not sure where that would be, though ... isn't someone supposed to be redesigning the r-project.org website? [I jest, I jest] More seriously, though, it might be worth linking to from the developer.r-project.org site as well as from some blurb in the header of the ML task view.

-steve

On Wed, Jan 4, 2012 at 9:19 AM, Max Kuhn <mxkuhn_at_gmail.com> wrote:

> Working on the caret package has exposed me to the wide variety of

*> approaches that different authors have taken to creating predictive
**> modeling functions (aka machine learning)(aka pattern recognition).
**>
**> I suspect that many package authors are neophyte R users and are
**> stumbling through the process of writing their first R package (or R
**> code). As such, they may not have been exposed to some of the informal
**> conventions that have evolved over time. Also, their package may be
**> intended to demonstrate their research and not for "production"
**> modeling. In any case, it might be a good idea to print up a few
**> points for consideration when creating a predictive modeling package.
**> I don't propose changes to existing code.
**>
**> Some of this is obvious and not limited to this class of modeling
**> packages. Many of these points are arguable, so please do so.
**>
**> If this seems useful, perhaps we could repost the final list to R-Help
**> to use as a checklist.
**>
**> Those of you who have used my code will probably realize that I am not
**> a grand architect of R packages =] I'd love to get feedback from those
**> of you with a broader perspective and better software engineering
**> skills than I (a low bar to step over).
**>
**> I have marked a few of these items with an OCD tag since I might be
**> taking it a bit too far.
**>
**> The list:
**>
**> (1) Extend the work of others. There is an amazing amount of unneeded
**> redundancy. There are plenty of times that users implement their own
**> version of a function because there is an missing feature, but a lot
**> of time is spent re-creating duplicate functions. For example, kernlab
**> has an excellent set of kernel functions that are really efficient and
**> have useful ancillary functions. People may not new aware of these
**> functions, but they are one RSiteSearch away. (Perhaps we could
**> nominate a few packages like kernlab that implement a specific tool
**> well)
**>
**> (2) When modeling a categorical outcome, use a factor as input (as
**> opposed to 0/1 indicators or integers). Factors are exactly the kind
**> of feature that separates R from other languages (I'm looking at you
**> SAS) and is a natural structure for this data type.
**>
**> corollary (2a): save the factor levels in the model object somewhere
**>
**> corollary (2b): return predicted classes as factors with the same
**> levels (and ordering of levels).
**>
**> (3) Implement a separate prediction function. Some packages only make
**> predictions when the model is built, so effectively the model cannot
**> be used at any point in the future.
**>
**> corollary (3a): use object-orientation (eg. predict.{class}) and not
**> some made-up function name "modelPredict()" for predicting new
**> samples.
**>
**> (4) If the method only accepts a specific type of input (eg. matrix or
**> data frame), please do the conversion whenever appropriate.
**>
**> (5) Provide a formula interface (eg. foo(y~x, data = dat)) and
**> non-formula interface (foo(x, y) to the function. Formula methods are
**> really inefficient at this time for large dimensional data but are
**> fantastically convenient. There are some good reasons to not use
**> formulas, such as functions that do not use a design matrix (eg.
**> cforest()) or need factors to be handled in a non-standard way (eg.
**> cubist()).
**>
**> (6) Don't require a test set when model building.
**>
**> (7) Control all written output during model-building time with a
**> verbose option. Resampling can make a mess out of things if
**> output/logging is always exposed.
**>
**> (8) Please use RSiteSearch to avoid name collisions between packages
**> (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.
**>
**> (9) Allow the predict function to generate results from many different
**> sub-models simultaneously. For example, pls() can return predictions
**> across many values of ncomp. enet(), cubist(), blackboost() are other
**> examples.
**>
**> corollary (9a): [OCD] ensure the same object type for predictions.
**> There are occasions where predict() will return a vector or a matrix
**> depending on the context. I would argue that this is not optimal.
**>
**> (10) Use a limited vocabulary for options. For example, some predict()
**> functions have a "type" options to switch between predicted classes
**> and class probabilities. Values of "type" pertaining to class
**> probabilities range from "prob", "probability", "posterior", "raw",
**> "response", etc. I'll make a suggestion of "prob" as a possible
**> standard for this situation.
**>
**> (11) Make sure that class probabilities sum to one. Seriously.
**>
**> (12) If the model implicitly conducts feature selection, do not
**> require un-used predictors to be present in future data sets for
**> prediction. This may be a problem when the formula interface to models
**> is used, but it looks like many functions reference columns by
**> position and not name.
**>
**> (13) Packages that have their own cross-validation functions should
**> allow the users to pass in the specific folds/resamping indicators to
**> maintain consistency across similar functions in other packages.
**>
**> (14) [OCD] For binary classification models, model the probability of
**> the first level of a factor as the event of interest (again, for
**> consistency) Note that glm() does not do this but most others use the
**> first level.
**>
**> Thanks,
**>
**> Max
**>
**> ______________________________________________
**> R-devel_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
*

-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-develReceived on Thu 05 Jan 2012 - 15:22:46 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

*
Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.
Archive generated by hypermail 2.2.0, at Thu 05 Jan 2012 - 20:50:06 GMT.
*

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel.
Please read the posting
guide before posting to the list.
*