# Re: [R] Non-normal data issues in PhD software engineering experiment

Date: Thu, 10 Jul 2008 12:07:06 -0500

On Thu, Jul 10, 2008 at 11:06 AM, Daniel Malter <daniel_at_umd.edu> wrote:
>
> I hope you don't really want our patients :)
>
> It looks that you have an experiment with two groups. You have several
> trials for each group. And within each trial you observe your units a
> distinct points in time.
>
> The first advice for you is to graphically display your data. Before you
> start modeling your data wrong, you should have a strong feeling what the
> right approach will be. If your data is nonlinear, for example, you will
> take a different approach than when it is. So what I suggest you to do is to
> plot your Ys (dependent variables) against time for each of your trials,
> optimally two plots, one for each group (but multiple plots are also okay).
> These plots should give you a firm intution about how your dependent
> variable develops over time for each group. The modeling of your data in a
> regression model then depends on the presumed functional relationship
> group). An important question is the distribution of your dependent
> variable. Is normally distributed? Or is it a proportion? All this is
> important information in deciding how to model your problem.

I'd suggest starting with looking at the overall distribution of sensitivity:

library(ggplot2)

qplot(sensitivity, geom="histogram", data=exp, binwidth=.05)

This is revealing - sensitivity is discrete and quite clumpy. You could then look at this distribution conditioned on version and paradigm:

qplot(sensitivity, geom="histogram", data=exp, binwidth=.05, facets = version ~ paradigm)

This is a complex plot, but it rewards detailed study (and suggests that accurate modelling is going to be challenging). There's a clear change in sensitivity in paradigm one after version 3, and in paradigm two, versions 4, 9 and 10 look unusual.

Looking at the scatterplot of sensitivity vs version:

isn't very helpful because the discrete values of sensitivity mean that many of the points are overplotted. Jittering the points and adding a smoothed line for each group helps a little, but it's not as revealing as the histograms.

qplot(version, sensitivity, data=exp, colour=factor(paradigm), geom="jitter") + geom_smooth()

```--