Re: [R] Non-normal data issues in PhD software engineering experiment

From: hadley wickham <>
Date: Thu, 10 Jul 2008 12:07:06 -0500

On Thu, Jul 10, 2008 at 11:06 AM, Daniel Malter <> wrote:
> I hope you don't really want our patients :)
> It looks that you have an experiment with two groups. You have several
> trials for each group. And within each trial you observe your units a
> distinct points in time.
> The first advice for you is to graphically display your data. Before you
> start modeling your data wrong, you should have a strong feeling what the
> right approach will be. If your data is nonlinear, for example, you will
> take a different approach than when it is. So what I suggest you to do is to
> plot your Ys (dependent variables) against time for each of your trials,
> optimally two plots, one for each group (but multiple plots are also okay).
> These plots should give you a firm intution about how your dependent
> variable develops over time for each group. The modeling of your data in a
> regression model then depends on the presumed functional relationship
> between your dependent variable and your independent variables (time and
> group). An important question is the distribution of your dependent
> variable. Is normally distributed? Or is it a proportion? All this is
> important information in deciding how to model your problem.

I'd suggest starting with looking at the overall distribution of sensitivity:

exp <- read.csv("data.csv")

qplot(sensitivity, geom="histogram", data=exp, binwidth=.05)

This is revealing - sensitivity is discrete and quite clumpy. You could then look at this distribution conditioned on version and paradigm:

qplot(sensitivity, geom="histogram", data=exp, binwidth=.05, facets = version ~ paradigm)

This is a complex plot, but it rewards detailed study (and suggests that accurate modelling is going to be challenging). There's a clear change in sensitivity in paradigm one after version 3, and in paradigm two, versions 4, 9 and 10 look unusual.

Looking at the scatterplot of sensitivity vs version:

qplot(version, sensitivity, data=exp, colour=factor(paradigm))

isn't very helpful because the discrete values of sensitivity mean that many of the points are overplotted. Jittering the points and adding a smoothed line for each group helps a little, but it's not as revealing as the histograms.

qplot(version, sensitivity, data=exp, colour=factor(paradigm), geom="jitter") + geom_smooth()



______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 10 Jul 2008 - 17:12:24 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 10 Jul 2008 - 17:31:14 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive