From: andrewjacksonTCD <Andrew.Jackson_at_cs.tcd.ie>

Date: Thu, 10 Jul 2008 07:01:29 -0700 (PDT)

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 10 Jul 2008 - 17:22:43 GMT

Date: Thu, 10 Jul 2008 07:01:29 -0700 (PDT)

Hi All,

Title: Non-normal data issues in PhD software engineering experiment

I hope I am not breeching any terms of this forum by this rather general post. There are very R specific elements to this rather long posting.

I will do my best to clearly explain my experiment, goals and problems here but please let me know if I have left out any vital information or if there is any ambiguity that I need to address such that you can help me.

I have a very limited background in statistics - I have just completed a postgraduate course in Statistics at TCD Dublin, Ireland.

- Experimental setup *** I am have conducted a software engineering experiment in which I have taken measures of quality for a software system build using 2 different design paradigms (1 and 2) over 10 evolutionary versions of the system (1 - 10). So for each version I have a pair of systems identical in that they do precisely the same thing and differ only in that they are build using 2 different design paradigms.

For each version and paradigm type I have collected a data set of measures called sensitivity measures. So for instance I have 20 different data sets, 10 for the 10 versions of software under design paradigm 1 and 10 for the 10 versions of software under design paradigm 2.

- Data ***

My data can be found at - https://www.cs.tcd.ie/~ajackso/data.csv

In this data file there are a number of columns - "version","paradigm","location","coverage","execution","infection","propogation","sensitivity"

Sensitivity is the main response - please ignore "coverage","execution","infection","propogation" as these were used to calculate sensitivity.

All 20 if my data sets are in this file - the columns version (1 - 10) and paradigm (1 or 2) differentiate them.

- Goals ***
With this data collected I now want to do a number of things -
- I want to look at the analysis of variance so see if there is a difference in mean for each paradigm over the 10 versions. I want to remove the version related variance by blocking on version. With this done I should be able to get a picture of the variance related to paradigm only. My null hypothesis is that there the means of both data sets are the same. I also want to look at each data set individually also to see if there is any difference between each pair of system designs.
- I want to create two regression models, one for each paradigm to enable me to see how the quality of each paradigm is effected over time (versions). It would also be nice to have both confidence and prediction boundaries.
- I want to be able to look at the power of all of this and possible see how many times I would need to do this to have concrete evidence that one paradigm is different/the same/better/worse than the other.
- I am not 100% sure if its relevant - but the analysis of divergence (Something I came across when reading an R book - Introductory statistics with R - Peter Dalagaard - Springer - p197) may fit what I am looking for to assess the difference between the two regression models stated in goal statement 2. I think that this will assess the degree to which the regression models diverge over time.

- Problems ***

1)The problem I have is that each of the 20 data sets are of variable size. These data sets are also not-normal. I have assessed this using the normality tests (ad.test etc.in R and Mini-tab) So as far as I understand it I had two choices - the first is to transform my non-normal data into normal data. The second is to look at using non-parametric approaches.

So I tried to use R to conduct a boxcox transformation for each of my 20 data sets. I couldn't figure it out past generating an optimal lambda. I then turned to mini-tab and found that I could make transformations there - the problem however was that there was a subset group option I didn't understand. I set it at various numbers but always seemed to get the same result so it didn't seem to upset the outcome that much/if at all. The result of this was non-normal data again. I then turned to the Johnson transformation and found that that also failed to produce transform my non-normal data to normal data.

3) I have looked at the Friedman test as a means of performing two way analysis of variance to address with my scenario. I have tried to execute it in R and Mini-tab but cant really cant figure out what my arguments should be.

Using R: I then read my data into a frame using "read.table(data)". I proceed to then with the following - friedman.test( data$sensitivity ~ data$paradigm | data$version, data, data$version, na.action=na.exclude). This produces the following error "incorrect specification for 'formula'". I see that my formula needs to be of length == 3 for this test to be used (https://svn.r-project.org/R/trunk/src/library/stats/R/friedman.test.R). I dont think that my formula should be like this even but I wanted to be as close as possible to the example provided by R.

I then tried to use the kruskal.test as follows - kruskal.test(data$sensitivity ~ data$sensitivity, data = data, na.action=na.exclude) - this gave me a result - however there was no account of the variance between versions.

- kruskal.test(data$sensitivity ~ data$version + data$paradigm, data =
sensResults, na.action=na.exclude)

*--*

- Kruskal-Wallis rank sum test

*--*

- data: data$sensitivity by data$version by data$paradigm
- Kruskal-Wallis chi-squared = 12.1449, df = 9, p-value = 0.2053

I have no idea if these tests are the right thing to do here? This test is advertised as a subsitite to one way anova. My instinct tells me that I need to use the friedman.test - but as you can see I am noting having much luck with it. I have looked at the code in R as you can see from the link above and can see where it us rejecting my formula - I just don't understand what I need to do to my model for it to be accepted.

4) I have looked at the outputs to the kruskal.test and friedman.test and they differ from the anova table -

By following and executing the R man examples I can see the friedman.test produces the following output:

- > friedman.test(x ~ w | t, data = wb)

*--*

- Friedman rank sum test

*--*

- data: x and w and t
- Friedman chi-squared = 0.3333, df = 1, p-value = 0.5637

You can also see from the above point that the output of the kruskal.test looks similar enough. This is a big contrasts to an anova table. In an anova table I can see the components of variance and the significant of each F test. These alternative tests do not seem to provide me this information.

Using Mintab

I go to stats->nonparametrics->Friedman

This prompts me to provide columns for response, treatment and blocks

I provide the following

response <- sensitivity

treatment <- paradigm

blocks <- version

When I try to execute this I get the following error

Friedman 'sensitivity' 'paradigm' 'version' 'RESI1' 'FITS1'.

- ERROR * Must have one observation per cell.
- ERROR * Completion of computation impossible. 5) I have looked briefly at the non-parametric approaches to regression - there seems to be many (http://socserv.mcmaster.ca/jfox/Courses/Oxford-2005/R-nonparametric-regression.html) paths that can be taken. I need some guidance on which approach I should follow? What are the tradeoffs? How do I do this?

Thank you and best regards,

Andrew Jackson

*--
*

View this message in context: http://www.nabble.com/Non-normal-data-issues-in-PhD-software-engineering-experiment-tp18383175p18383175.html
Sent from the R help mailing list archive at Nabble.com.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 10 Jul 2008 - 17:22:43 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 10 Jul 2008 - 18:31:22 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*