[R] Non-normal data issues in PhD software engineering experiment

From: Andrew Jackson <ajackso_at_cs.tcd.ie>
Date: Thu, 10 Jul 2008 16:15:42 +0100 (IST)


Hi All,

This is a rather general post to begin with. This is because I need to provide you some important context. There are very R specific elements to this further along in this rather long posting so I thank you in advance for your patients.

I will do my best to clearly explain my experiment, data, goals and problems I have. Please let me know if I have left out any vital information or if there is any ambiguity that I need to address such that you can help me.

I have a very limited background in statistics - I have just completed a postgraduate course in Statistics at Trinity College Dublin, Ireland. So I have the basics and not much more.

I would also like to say up front that I am not the most gifted in terms of maths. With that in mind, I would appreciate it that if you respond to this with a long equations and mathematical notations you could also describe at a high level what the equation does or represents.

For each version and paradigm type I have collected a data set of measures called sensitivity measures. So for instance I have 20 different data sets, 10 for the 10 versions of software under design paradigm 1 and 10 for the 10 versions of software under design paradigm 2.

My data can be found at - https://www.cs.tcd.ie/~ajackso/data.csv

In this data file there are a number of columns - "version","paradigm","location","coverage","execution","infection","propogation","sensitivity"

Sensitivity is the main response - please ignore "coverage","execution","infection","propogation" as these were used to calculate sensitivity.

All 20 if my data sets are in this file - the columns version (1 - 10) and paradigm (1 or 2) differentiate them.

1)The problem I have is that each of the 20 data sets are of variable size. These data sets are also not-normal. I have assessed this using the normality tests (ad.test etc.in R and Mini-tab) So as far as I understand it I had two choices - the first is to transform my non-normal data into normal data. The second is to look at using non-parametric approaches.

So I tried to use R to conduct a boxcox transformation for each of my 20 data sets. I couldn't figure it out past generating an optimal lambda. I then turned to mini-tab and found that I could make transformations there - the problem however was that there was a subset group option I didn't understand. I set it at various numbers but always seemed to get the same result so it didn't seem to upset the outcome that much/if at all. The result of this was non-normal data again. I then turned to the Johnson transformation and found that that also failed to produce transform my non-normal data to normal data.

3) I have looked at the Friedman test as a means of performing two way analysis of variance to address with my scenario. I have tried to execute it in R and Mini-tab but cant really cant figure out what my arguments should be.

Using R: I then read my data into a frame using "read.table(data)". I proceed to then with the following - friedman.test( data$sensitivity ~ data$paradigm | data$version, data, data$version, na.action=na.exclude). This produces the following error "incorrect specification for 'formula'". I see that my formula needs to be of length == 3 for this test to be used (https://svn.r-project.org/R/trunk/src/library/stats/R/friedman.test.R). I dont think that my formula should be like this even but I wanted to be as close as possible to the example provided by R.

I then tried to use the kruskal.test as follows - kruskal.test(data$sensitivity ~ data$sensitivity, data = data, na.action=na.exclude) - this gave me a result - however there was no account of the variance between versions.

I have no idea if these tests are the right thing to do here? This test is advertised as a substitute to one way anova. My instinct tells me that I need to use the friedman.test - but as you can see I am noting having much luck with it. I have looked at the code in R as you can see from the link above and can see where it us rejecting my formula - I just don't understand what I need to do to my model for it to be accepted.

4) I have looked at the outputs to the kruskal.test and friedman.test and they differ from the anova table -

By following and executing the R man examples I can see the friedman.test produces the following output:

You can also see from the above point that the output of the kruskal.test looks similar enough. This is a big contrasts to an anova table. In an anova table I can see the components of variance and the significant of each F test. These alternative tests do not seem to provide me this information.

Using Mini-tab: I go to stats->Nonparametrics->Friedman. This prompts me to provide columns for response, treatment and blocks.

I provide the following:

response <- sensitivity
treatment <- paradigm
blocks <- version

When I try to execute this I get the following error

Friedman 'sensitivity' 'paradigm' 'version' 'RESI1' 'FITS1'.

Thank you and best regards,
Andrew Jackson



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 10 Jul 2008 - 15:31:06 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 10 Jul 2008 - 17:31:14 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive