Re: [R] What to do with this data?

From: Jim Lemon <jim_at_bitwrit.com.au>
Date: Fri, 04 Apr 2008 23:11:05 +1000

mika03 wrote:
> ...
> Here's what we did:
> We showed a fairly large number of subjects search engine queries and
> different possible search engine responses. We assumed that users would like
> some our responses better than others and wanted to check this. Subjects
> could rate a query/response pair on a scale from 0 (very bad response) to 10
> (very good response).
>
> Here are all the judgments we received for one particular class of response
> to queries which we thought users would like:
>
> Predicted-Good-0, 4
> Predicted-Good-1, 1
> Predicted-Good-2, 11
> Predicted-Good-3, 8
> Predicted-Good-4, 25
> Predicted-Good-5, 12
> Predicted-Good-6, 21
> Predicted-Good-7, 25
> Predicted-Good-8, 30
> Predicted-Good-9, 52
> Predicted-Good-10, 189
>
> And here are all the judgments we received for one particular class of
> response to queries which we thought users would NOT like:
>
> Predicted-Bad-0, 34
> Predicted-Bad-1, 23
> Predicted-Bad-2, 45
> Predicted-Bad-3, 60
> Predicted-Bad-4, 42
> Predicted-Bad-5, 50
> Predicted-Bad-6, 21
> Predicted-Bad-7, 20
> Predicted-Bad-8, 25
> Predicted-Bad-9, 19
> Predicted-Bad-10, 39
>
I interpret these as counts for each option on the scale 0-10.

> Here's a small table listing number of observations, mean, standard
> deviation and standard error:
>
> Type, N, Mean, StDev, StErr
> Predicted-Good, 378, 8.21693121693122, 2.47110906286224, 0.12710013550711
> Predicted-Bad, 378, 4.5978835978836, 3.02059872953413, 0.155362834286119
>
> The question we have are:
>
> a) It doesn't seem like our data follows a standard distribution. Therefore
> is it okay to calculate mean, standard deviation and standard error at all?
>
Yes, the mean is one way of describing the location of the aggregate response. The median is another. The calculations give sensible numbers, but ...

> b) We initially created a figure plotting the mean and a bar around it
> indicating standard deviation. Then somebody who knows more about statistics
> told us we should display the mean and error bars around it "to depict a 95%
> Confidence Interval, mean +/- 1.96*SE". But if we are doing this, aren't we
> forgetting to mention vital parts of our data, that is that we indeed get
> better means for "Good" responses, but that the individual data points are
> all over the place (especially for "Predicted-Bad")? We would capture this
> by showing standard deviation.
>
when you start talking about confidence intervals, you have to assume that some distribution for which the distribution functions are known or can be calculated underlies your observations. As the responses aren't normally distributed, you can't use the normal distribution function to calculate confidence intervals. You could estimate them by bootstrapping, or see below.

> c) And finally: What would be the best way to present this data anyway?
>
Here's a start - cmdf is a data frame with two columns, good (counts of "good" responses) and bad (counts of "bad" responses):

plot(0:10,cmdf$good,pch=1,col=3,type="b",   main="Distribution of response ratings",xlab="Rating",ylab="Count")

points(0:10,cmdf$bad,pch=2,col=2,type="b")
points(mean(rep(0:10,cmdf$good)),150,pch=1,col=3)
points(mean(rep(0:10,cmdf$bad)),150,pch=2,col=2)
goodmad<-mad(rep(0:10,cmdf$good))
badmad<-mad(rep(0:10,cmdf$bad))
arrows(mean(rep(0:10,cmdf$good))+c(-0.1,0.1),150,
  mean(rep(0:10,cmdf$good))+c(-goodmad,goodmad),150,angle=90,col=3)
arrows(mean(rep(0:10,cmdf$bad))+c(-0.1,0.1),150,
  mean(rep(0:10,cmdf$bad))+c(-badmad,badmad),150,angle=90,col=2)
text(mean(rep(0:10,cmdf$good)),170,"Good mean",col=3) text(mean(rep(0:10,cmdf$bad)),170,"Bad mean",col=2)

I'm being lazy here, you probably want confidence intervals either bootstrapped or on the assumption that "good" responses are exponentially distributed and "bad" ones uniformly.

Jim



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 04 Apr 2008 - 12:07:06 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 04 Apr 2008 - 12:30:26 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive