[Rd] Wish R Core had a standard format (or generic function) for "newdata" objects

From: Paul Johnson <pauljohn32_at_gmail.com>
Date: Tue, 26 Apr 2011 10:13:23 -0500

Is anybody working on a way to standardize the creation of "newdata" objects for predict methods?

When using predict, I find it difficult/tedious to create newdata data frames when there are many variables. It is necessary to set all variables at the mean/mode/median, and then for some variables of interest, one has to insert values for which predictions are desired. I was at a presentation by Scott Long last week and he was discussing the increasing emphasis in Stata on calculations of marginal predictions and "Spost" an several other packages, and, co-incidentally, I had a student visit who is learning to use R MASS's polr (W.Venables and B. Ripley) and we wrestled for quite a while to try to make the same calculations that Stata makes automatically. It spits out predicted probabilities each independent variable, keeping other variables at a reference level.

I've found R packages that aim to do essentially the same thing.

In Frank Harrell's Design/rms framework, he uses a "data.dist" function that generates an object that the user has to put into the R options. I think many users trip over the use of "options" there. If I don't use that for a month or two, I completely forget the fine points and have to fight with it. But it does "work" to give plots and predict functions the information they require.

In Zelig ( by Kosuke Imai, Gary King, and Olivia Lau), a function "setx" does the work of creating "newdata" objects. That appears to be about right as a candidate for a generic "newdata" function. Perhaps it could directly generalize to all R regression functions, but right now it is tailored to the models in Zelig. It has separate methods for the different types of models, and that is a bit confusing to me,since the "newdata" in one model should be the same as the newdata in another, I'm guessing. But his code is all there, I'll keep looking.

In Effects (by John Fox), there are internal functions to create newdata and plot the marginal effects. If you load effects and run, for example, "effects:::effect.lm" you see Prof Fox has his own way of grabbing information from model columns and calculating predictions.

I think it is time the R Core Team would look at this tell "us" what is the right way to do this. I think the interface to setx in Zelig is pretty easy to understand, at least for numeric variables.

In R's termplot function, such a thing could be put to use. As far as I can tell now, termplot is doing most of the work of creating a newdata object, but not exactly.

It seems like it would be a shame to proliferate more functions that do the same function, when it is such a common thing.

Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

R-devel_at_r-project.org mailing list
Received on Tue 26 Apr 2011 - 15:15:32 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 27 Apr 2011 - 18:00:53 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive