Re: [R] A comment about R

From: John Maindonald <john.maindonald_at_anu.edu.au>
Date: Thu 05 Jan 2006 - 11:17:47 EST

Quoting from Thomas's message -

> "On the question of which system really is easier to learn I can
> only comment that this isn't the only question where education,
> as a field, would benefit from some good randomized controlled
> trials."

A Randomized Controlled Trial?:
Doing such trials would be a 30-year project. The entry criterion might be at least a pass score on a test that was designed to identify students with the potential to be reasonable statistical practitioners. (To make this work, coaching at a summer camp might be a necessary preliminary.) Students would be introduced to whatever system at various times in their educational development -- ages 11, 14, 18 or 24. For each age/system combination, there'd be a variety of dose levels(!). Half would be introduced via the GUI and half via the command line. Outcome measures would be (1) liking for the system; (2) quality of analysis, on several analysis tasks of a type that are likely to arise in several different analysis areas. Assessments would be made in early career and in mid-career. Analyses would of course be done using both SAS Proc Mixed and lmer(0 in lme4. There'd be bound to be enough missing data to make the design unbalanced, hence allowing plenty of room for argument about the informativeness of the missingness, and about the adequacy of the degrees of freedom approximation, or whether an approach that uses a df approximation was even worth considering.

What happens with those who decide, of their own accord or from necessity, to learn a system additional to the one to which they were assigned? (This may itself be an outcome.) Should there be control for exposure to another language?

The more one thinks about it, the worse the design problem gets. The situation is a bit different from the teaching of reading, where high quality randomized trials can and should be done, notwithstanding the complications of controlling for teacher effects. As always, it is however insightful to think about the randomized trial that would be required.

I can envisage a simple randomized trial, still extending over some years, where the outcome measure is the quality of statistical analysis, on problems that meet the criteria given above.

The height of the bar:
For proper comparison of ease of doing analyses, a staged set of analysis problems is required, from cases where most would agree that a t-test or chi-square test ot CI or ... "answers" the question of interest, through to a variety of realistic regression problems. Agreement on some minimal set of steps needed to do an adequate analysis would be a necessary part of the process. This insists that the goalposts are always at the same height. Such an exercise could be highly insightful, and a useful contribution to the public scientific good.

Research questions:
To a smaller or larger extent, R is a component of a research exercise in the development of statistical computational abilities. Perhaps to the majority of users on this list, it is primarily an effective tool for the handling of statistical and other scientific computing tasks. Some see these two goals as somewhat distinct (at the boundaries, they obviously are); others see a large overlap.

In any case, this latter role has enormous importance, actual and potential, for the scientific community, and indeed for any area (especially business) where there is a continual and insistent demand to make sense of data. A variety of research questions that warrant attention:

(1) Who should learn R?

[In my view R is such a versatile tool for scientific computing that anyone contemplating a career in science, and who expects to to their own computations that have a substantial data analysis component, should learn R. The only serious competitors, in my view and depending on the area of application, are Genstat, Stata, and Matlab -- Genstat for the analysis of designed experiments and for the quality of its GUI, Stata for the reasons given by others, and Matlab for signal processsing. SAS may be important for its efficiency in certain types of batch processing with large data sets, and because of the extent of existing large SAS repositories, SPSS may be important because of the extent of existing large SPSS data repositories. Some comment is also needed on S-PLUS? I am of course ignoring the skill investment that many researchers have made in these other packages. While this has somehow to be factored in, it surely has limited relevance to assessing priorities for those who are currently starting out.]

(2) R has clearly reduced the time lag between the development of new theory, and availability of the associated methodology to statistical practitioners. It has also, incidentally, raised the bar for commercial statistical software systems. What are the implications for statistical research, and for professional practice and training?

(3) Should learners use a GUI, or the command line, in getting started?

[A major issue for GUIs is documentation of steps in an analysis. This will become increasingly important as more journals demand, as I hope will happen, publication of Sweave or other reproducible versions of analyses. Some ultimate familiarity with the command line may in the medium term be essential.]

(4) When should students start learning R?

[Students should get their first exposure to a high-level programming language, in the style of R then Python or Octave, at age 11-14. There are now good alternatives to the former use of Fortran or Pascal, languages which have for good reason dropped out of favour for learning experience. They should start on R while their minds are still malleable, and long before they need it for serious research use.]

(5) What are the traps, in using R, for relative novices?

[Mechanisms are needed for identifying traps that routinely catch novices (even novices who may be quite sophisticated statistically), with a program to tackle these, in the medium to long term.]

(6) Default output requires (continuing) careful scrutiny from a "what will encourage good statistical practice" perspective.

(7) What, more widely, should go on the wish list?

John Maindonald email: john.maindonald@anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Mathematics & Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.

On 4 Jan 2006, at 10:00 PM, r-help-request@stat.math.ethz.ch wrote:

> From: Thomas Lumley <tlumley@u.washington.edu>
> Date: 4 January 2006 6:23:18 AM
> To: Peter Dalgaard <p.dalgaard@biostat.ku.dk>
> Cc: R-help@stat.math.ethz.ch, Patrick Burns <pburns@pburns.seanet.com>
> Subject: Re: [R] A comment about R:
>
>
> On Tue, 3 Jan 2006, Peter Dalgaard wrote:
>> One thing that is often overlooked, and hasn't yet been mentioned in
>> the thread, is how much *simpler* R can be for certain completely
>> basic tasks of practical or pedagogical relevance: Calculate a simple
>> derived statistic, confidence intervals from estimate and SE,
>> percentage points of the binomial distribution - using dbinom or from
>> the formula, take the sum of each of 10 random samples from a set of
>> numbers, etc. This is where other packages get stuck in the
>> procedure+dataset mindset.
>
> Some of these things are actually fairly straightforward in Stata.
> For example, Stata will give confidence intervals and tests for
> linear combinations of coefficients and even (using symbolic
> differentiation and the delta method) for nonlinear combinations.
> The first is available in packages in R, the second is in "S
> Programming" but doesn't seem to be packaged.
>
> <snip>
>
> Now, I still prefer R both for data analysis and (even more so) for
> programming. There are some things that it is genuinely difficult
> to program in Stata -- and as evidence that this isn't just my
> ignorance of the best approaches, the language was substantially
> reworked in both versions 8 and 9 to allow the vendor to implement
> better graphics and
> linear mixed models.
>
> On the question of which system really is easier to learn I can
> only comment that this isn't the only question where education, as
> a field, would benefit from some good randomized controlled trials.
>
> -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley@u.washington.edu University of Washington, Seattle



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu Jan 05 11:25:30 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:51 EST