[R] A comment about R:

From: Peter Muhlberger <pmuhl1848_at_gmail.com>
Date: Thu 05 Jan 2006 - 06:43:08 EST


I'm someone who from time to time comes to R to do applied stats for social science research. I think the R language is excellent--much better than Stata for writing complex statistical programs. I am thrilled that I can do complex stats readily in R--sem, maximum likelihood, bootstrapping, some Bayesian analysis. I wish I could make R my main statistical package, but find that a few stats that are important to my work are difficult to find or produce in R. Before I list some examples, I recognize that people view R not as a statistical package but rather as a statistical programming environment. That said, however, it seems, from my admittedly limited perspective, that it would be fairly easy to make a few adjustments to R that would make it a lot more practical and friendly for a broader range of people--including people like me who from time to time want to do statistical programming but more often need to run canned procedures. I'm not a statistician, so I don't want to have to learn everything there is to know about common procedures I use, including how to write them from scratch. I want to be able to focus my efforts on more novel problems w/o reinventing the wheel. I would also prefer not to have to work through a couple books on R or S+ to learn how to meet common needs in R. If R were extended a bit in the direction of helping people like me, I wonder whether it would not acquire a much broader audience. Then again, these may just be the rantings of someone not sufficiently familiar w/ R or the community of stat package users--so take my comments w/ a grain of salt.

Some examples of statistics I typically use that are difficult to find and / or produce or produce in a usefully formatted way in R--

Ex. 1) Wald tests of linear hypotheses after max. likelihood or even after a regression. "Wald" does not even appear in my standard R package on a search. There's no comment in the lm help or optim help about what function to use for hypothesis tests. I know that statisticians prefer likelihood ratio tests, but Wald tests are still useful and indeed crucial for first-pass analysis. After searching with Google for some time, I found several Wald functions in various contributed R packages I did not have installed. One confusion was which one would be relevant to my needs. This took some time to resolve. I concluded, perhaps on insufficient evidence, that package car's Wald test would be most helpful. To use it, however, one has to put together a matrix for the hypotheses, which can be arduous for a many-term regression or a complex hypothesis. In comparison, in Stata one simply states the hypothesis in symbolic terms. I also don't know for certain that this function in car will work or work properly w/ various kinds of output, say from lm or from optim. To be sure, I'd need to run time-consuming tests comparing it with Stata output or examine the function's code. In Stata the test is easy to find, and there's no uncertainty about where it can be run or its accuracy. Simply having a comment or "see also" in lm help or mle or optim help pointing the user to the right Wald function would be of enormous help.

Ex. 2) Getting neat output of a regression with Huberized variance matrix. I frequently have to run regressions w/ robust variances. In Stata, one simply adds the word "robust" to the end of the command or "cluster(cluster.variable)" for a cluster-robust error. In R, there are two functions, robcov and hccm. I had to run tests to figure out what the relationship is between them and between them and Stata (robcov w/o cluster gives hccm's hc0; hccm's hc1 is equivalent to Stata's 'robust' w/o cluster; etc.). A single sentence in hccm's help saying something to the effect that statisticians prefer hc3 for most types of data might save me from having to scramble through the statistical literature to try to figure out which of these I should be using. A few sentences on what the differences are between these methods would be even better. Then, there's the problem of output. Given that hc1 or hc3 are preferred for non-clustered data, I'd need to be able to get regression output of the form summary(lm) out of hccm, for any practical use. Getting this, however, would require programming my own function. Huberized t-stats for regressions are commonplace needs, an R oriented a little toward more everyday needs would not require programming of such needs. Also, I'm not sure yet how well any of the existing functions handle missing data.

Ex. 3) I need to do bootstrapping w/ clustered data, again a common statistical need. I wasted a good deal of time reading the help contents of boot and Bootstrap, only to conclude that I'd need to write my own, probably inefficient, function to bootstrap clustered data if I were to use boot. It's odd that boot can't handle this more directly. After more digging, I learned that bootcov in package Design would handle the cluster bootstrap and save the parameters. I wouldn't have found this if I had not needed bootcov for another purpose. Again, maybe a few words in the boot help saying that 'for clustered data, you could use bootcov or program a function in boot' would be very helpful. I still don't know whether I can feed the results of bootcov back into functions in the boot package for further analysis.

My 2 bits for what they're worth,

Peter



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu Jan 05 07:59:36 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:50 EST