[Rd] informal conventions/checklist for new predictive modeling packages

From: Max Kuhn <mxkuhn_at_gmail.com>
Date: Wed, 04 Jan 2012 09:19:11 -0500


Working on the caret package has exposed me to the wide variety of approaches that different authors have taken to creating predictive modeling functions (aka machine learning)(aka pattern recognition).

I suspect that many package authors are neophyte R users and are stumbling through the process of writing their first R package (or R code). As such, they may not have been exposed to some of the informal conventions that have evolved over time. Also, their package may be intended to demonstrate their research and not for "production" modeling. In any case, it might be a good idea to print up a few points for consideration when creating a predictive modeling package. I don't propose changes to existing code.

Some of this is obvious and not limited to this class of modeling packages. Many of these points are arguable, so please do so.

If this seems useful, perhaps we could repost the final list to R-Help to use as a checklist.

Those of you who have used my code will probably realize that I am not a grand architect of R packages =] I'd love to get feedback from those of you with a broader perspective and better software engineering skills than I (a low bar to step over).

I have marked a few of these items with an OCD tag since I might be taking it a bit too far.

The list:

(1) Extend the work of others. There is an amazing amount of unneeded
redundancy. There are plenty of times that users implement their own version of a function because there is an missing feature, but a lot of time is spent re-creating duplicate functions. For example, kernlab has an excellent set of kernel functions that are really efficient and have useful ancillary functions. People may not new aware of these functions, but they are one RSiteSearch away. (Perhaps we could nominate a few packages like kernlab that implement a specific tool well)

(2) When modeling a categorical outcome, use a factor as input (as
opposed to 0/1 indicators or integers). Factors are exactly the kind of feature that separates R from other languages (I'm looking at you SAS) and is a natural structure for this data type.

corollary (2a): save the factor levels in the model object somewhere

corollary (2b): return predicted classes as factors with the same levels (and ordering of levels).

(3) Implement a separate prediction function. Some packages only make
predictions when the model is built, so effectively the model cannot be used at any point in the future.

corollary (3a): use object-orientation (eg. predict.{class}) and not some made-up function name "modelPredict()" for predicting new samples.

(4) If the method only accepts a specific type of input (eg. matrix or
data frame), please do the conversion whenever appropriate.

(5) Provide a formula interface (eg. foo(y~x, data = dat)) and
non-formula interface (foo(x, y) to the function. Formula methods are really inefficient at this time for large dimensional data but are fantastically convenient. There are some good reasons to not use formulas, such as functions that do not use a design matrix (eg. cforest()) or need factors to be handled in a non-standard way (eg. cubist()).

(6) Don't require a test set when model building.

(7) Control all written output during model-building time with a
verbose option. Resampling can make a mess out of things if output/logging is always exposed.

(8) Please use RSiteSearch to avoid name collisions between packages
(eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.

(9) Allow the predict function to generate results from many different
sub-models simultaneously. For example, pls() can return predictions across many values of ncomp. enet(), cubist(), blackboost() are other examples.

corollary (9a): [OCD] ensure the same object type for predictions. There are occasions where predict() will return a vector or a matrix depending on the context. I would argue that this is not optimal.

(10) Use a limited vocabulary for options. For example, some predict()
functions have a "type" options to switch between predicted classes and class probabilities. Values of "type" pertaining to class probabilities range from "prob", "probability", "posterior", "raw", "response", etc. I'll make a suggestion of "prob" as a possible standard for this situation.

(11) Make sure that class probabilities sum to one. Seriously.

(12) If the model implicitly conducts feature selection, do not
require un-used predictors to be present in future data sets for prediction. This may be a problem when the formula interface to models is used, but it looks like many functions reference columns by position and not name.

(13) Packages that have their own cross-validation functions should
allow the users to pass in the specific folds/resamping indicators to maintain consistency across similar functions in other packages.

(14) [OCD] For binary classification models, model the probability of
the first level of a factor as the event of interest (again, for consistency) Note that glm() does not do this but most others use the first level.

Thanks,

Max



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 04 Jan 2012 - 14:23:26 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 05 Jan 2012 - 21:00:07 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive