Re: [Rd] informal conventions/checklist for new predictive modeling packages

From: Steve Lianoglou <>
Date: Thu, 05 Jan 2012 10:16:54 -0500

Good stuff, Max!

Would also be nice to nail your 14 theses to a more permanent wall than the r-help mailing list ... not sure where that would be, though ... isn't someone supposed to be redesigning the website? [I jest, I jest] More seriously, though, it might be worth linking to from the site as well as from some blurb in the header of the ML task view.


On Wed, Jan 4, 2012 at 9:19 AM, Max Kuhn <> wrote:
> Working on the caret package has exposed me to the wide variety of
> approaches that different authors have taken to creating predictive
> modeling functions (aka machine learning)(aka pattern recognition).
> I suspect that many package authors are neophyte R users and are
> stumbling through the process of writing their first R package (or R
> code). As such, they may not have been exposed to some of the informal
> conventions that have evolved over time. Also, their package may be
> intended to demonstrate their research and not for "production"
> modeling. In any case, it might be a good idea to print up a few
> points for consideration when creating a predictive modeling package.
> I don't propose changes to existing code.
> Some of this is obvious and not limited to this class of modeling
> packages. Many of these points are arguable, so please do so.
> If this seems useful, perhaps we could repost the final list to R-Help
> to use as a checklist.
> Those of you who have used my code will probably realize that I am not
> a grand architect of R packages =] I'd love to get feedback from those
> of you with a broader perspective and better software engineering
> skills than I (a low bar to step over).
> I have marked a few of these items with an OCD tag since I might be
> taking it a bit too far.
> The list:
> (1) Extend the work of others. There is an amazing amount of unneeded
> redundancy. There are plenty of times that users implement their own
> version of a function because there is an missing feature, but a lot
> of time is spent re-creating duplicate functions. For example, kernlab
> has an excellent set of kernel functions that are really efficient and
> have useful ancillary functions. People may not new aware of these
> functions, but they are one RSiteSearch away. (Perhaps we could
> nominate a few packages like kernlab that implement a specific tool
> well)
> (2) When modeling a categorical outcome, use a factor as input (as
> opposed to 0/1 indicators or integers). Factors are exactly the kind
> of feature that separates R from other languages (I'm looking at you
> SAS) and is a natural structure for this data type.
> corollary (2a): save the factor levels in the model object somewhere
> corollary (2b): return predicted classes as factors with the same
> levels (and ordering of levels).
> (3) Implement a separate prediction function. Some packages only make
> predictions when the model is built, so effectively the model cannot
> be used at any point in the future.
> corollary (3a): use object-orientation (eg. predict.{class}) and not
> some made-up function name "modelPredict()" for predicting new
> samples.
> (4) If the method only accepts a specific type of input (eg. matrix or
> data frame), please do the conversion whenever appropriate.
> (5) Provide a formula interface (eg. foo(y~x, data = dat)) and
> non-formula interface (foo(x, y) to the function. Formula methods are
> really inefficient at this time for large dimensional data but are
> fantastically convenient. There are some good reasons to not use
> formulas, such as functions that do not use a design matrix (eg.
> cforest()) or need factors to be handled in a non-standard way (eg.
> cubist()).
> (6) Don't require a test set when model building.
> (7) Control all written output during model-building time with a
> verbose option. Resampling can make a mess out of things if
> output/logging is always exposed.
> (8) Please use RSiteSearch to avoid name collisions between packages
> (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.
> (9) Allow the predict function to generate results from many different
> sub-models simultaneously. For example, pls() can return predictions
> across many values of ncomp. enet(), cubist(), blackboost() are other
> examples.
> corollary (9a): [OCD] ensure the same object type for predictions.
> There are occasions where predict() will return a vector or a matrix
> depending on the context. I would argue that this is not optimal.
> (10) Use a limited vocabulary for options. For example, some predict()
> functions have a "type" options to switch between predicted classes
> and class probabilities. Values of "type" pertaining to class
> probabilities range from "prob", "probability", "posterior", "raw",
> "response", etc. I'll make a suggestion of "prob" as a possible
> standard for this situation.
> (11) Make sure that class probabilities sum to one. Seriously.
> (12) If the model implicitly conducts feature selection, do not
> require un-used predictors to be present in future data sets for
> prediction. This may be a problem when the formula interface to models
> is used, but it looks like many functions reference columns by
> position and not name.
> (13) Packages that have their own cross-validation functions should
> allow the users to pass in the specific folds/resamping indicators to
> maintain consistency across similar functions in other packages.
> (14) [OCD] For binary classification models, model the probability of
> the first level of a factor as the event of interest (again, for
> consistency) Note that glm() does not do this but most others use the
> first level.
> Thanks,
> Max
> ______________________________________________
> mailing list

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info:

______________________________________________ mailing list
Received on Thu 05 Jan 2012 - 15:22:46 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 05 Jan 2012 - 20:50:06 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive