Re: [R] Tools For Preparing Data For Analysis

From: Chris Evans <>
Date: Fri, 08 Jun 2007 18:26:51 +0100

Martin Henry H. Stevens sent the following at 08/06/2007 15:11:
> Is there an example available of this sort of problematic data that
> requires this kind of data screening and filtering? For many of us,
> this issue would be nice to learn about, and deal with within R. If a
> package could be created, that would be optimal for some of us. I
> would like to learn a tad more, if it were not too much effort for
> someone else to point me in the right direction?
> Cheers,
> Hank
> On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
>> On 6/7/07, Robert Wilkins <> wrote: >>> As noted on the R-project web site itself ( ->

... rest snipped ...

OK, I can't resist that invitation. I think there are many kinds of problematic data. I handle some nasty textish things in perl (and I loved the purgatory quote) and I'm afraid I do some things in Excel and some cleaning I can handle in R, but I never enter data directly into R.

However, one very common scenario I have faceda all my working life is psych data from questionnaires or interviews in low budget work, mostly student research or routine entry of therapists' data. Typically you have an identifier, a date, some demographics and then a lot of item data. There's little money (usual zero) involved for data entry and cleaning but I've produced a lot of good(ish) papers out of this sort of very low budget work over the last 20 years. (Right at the other end of a financial spectrum from the FDA/validated s'ware thread but this is about validation again!)

The problem I often face is that people are lousy data entry machines (well, actually, they vary ... enormously) and if they mess up the data entry we all know how horrible this can be.

SPSS (boo hiss) used to have an excellent "module", actually a standalone PC/Windoze program, that allowed you to define variables so they had allowed values and it would refuse to accept out of range or out of acceptable entries, it also allowed you to create checking rules and rules that would, in the light of earlier entries, set later values and not ask about them. In a rudimentary way you could also lay things out on the screen so that it paginated where the q'aire or paper data record did etc. The final nice touch was that you could define some variables as invariant and then set the thing so an independent data entry person could re-enter the other data (i.e. pick up q'aire, see if ID fits the one showing on screen, if so, enter the rest of the data). It would bleep and not move on if you entered a value other than that entered by the first person and you had to confirm that one of you was right.

That saved me wasted weeks I'm sure on analysing data that turned out to be awful and I'd love to see someone build something to replace that.

Currently I tend to use (boo hiss) Excel for this as everyone I work with seems to have it (and not all can install open office and anyway I haven't had time to learn that properly yet either ...) and I set up spreadsheets with validation rules set. That doesn't get the branching rules and checks (e.g. if male, skip questions about periods, PMT and pregnancies), or at least, with my poor Excel skills it doesn't. I just skip a column to indicate page breaks in the q'aire, and I get, when I can, two people to enter the data separately and then use R to compare the two spreadsheets having yanked them into data frames.

I would really, really love someone to develop (and perhaps replace) the rather buggy edit() and fix() routines (seem to hang on big data frames in Rcmdr which is what I'm trying to get students onto) with something that did some or all of what SPSS/DE used to do for me or I bodge now in Excel. If any generous coding whiz were willing to do this, I'll try to alpha and beta test and write help etc.

There _may_ be good open source things out there that do what I need but something that really integrated into R would be another huge step forward in being able to phase out SPSS in my work settings and phase in R.

Very best all,


Chris Evans <> Skype: chris-psyctc
Professor of Psychotherapy, Nottingham University;
Consultant Psychiatrist in Psychotherapy, Notts PDD network;
Research Programmes Director, Nottinghamshire NHS Trust;
*If I am writing from one of those roles, it will be clear. Otherwise*
*my views are my own and not representative of those institutions    *

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Fri 08 Jun 2007 - 17:32:39 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 08 Jun 2007 - 18:31:33 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.