Re: [R] Simple qqplot question

From: Bert Gunter <gunter.berton_at_gene.com>
Date: Fri, 25 Jun 2010 09:02:41 -0700

To add to/modify what Joris (and I) previously said:

  1. qqplots are not cumulative distribution plots. Hence, as Joris said, the S-shape indicates short tails/bimodality compared to the normal. Why you continue to insist on carrying out normality tests that with so many points obviously will reject is beyond me! The bimodality is what's important. Why is it there? What is it telling you about your data (perhaps some sort of measurement shift...)?
  2. My prior suggestion for plotting a reference line -- and Joris's confidence interval recommendations -- are in some sense wrong. The reason is that they give the conditional expectation and confidence intervals thereof of the quantiles of the "y" distribution conditioned on those of the "x" . What you probably want is the "correlation" line. One simple "robust" estimate of this -- and quick to calculate -- is just to mimic qqline() and calculate the 1st and 3rd quartiles of both distributions and use the line joining the corresponding quartile pairs ((1st,1st) and (3rd,3rd)) . I leave the trivial algebra to you -- quantile() gets the quartiles.

Of course, there's a literature on this if you want to do something authoritative -- and perhaps R functions somewhere based on it. Perhaps some kind (and wiser than I) soul will provide references.

(However, I doubt that the line so obtained will differ appreciably from my earlier "incorrect" recommendation, which was probably good enough for eyeballing in most cases.)

Finally, risking hubris again, I would suggest that if the two distributions with so many points really are essentially identical, then this is scientifically "uninteresting" -- that is, the identity is a logical (and trivial) consequence of the systematic way in which the data were obtained, some sort of software (data collection?) issue, or the like -- i.e. not indicative of a scientifically interesting phenomenon. It might even indicate a problem with the data/measurements. My reasoning: real variability prohibits such identity. The identical bimodality may be a clue here. Again, note that I know nothing about what you are doing, and you are therefore justified in publicly chastising me for such ignorant speculation if I am wrong.

I would welcome comments and criticisms from others on such speculation also.

HTH,

Bert Gunter
Genentech Nonclinical Biostatistics    

-----Original Message-----
From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On Behalf Of Joris Meys
Sent: Friday, June 25, 2010 2:15 AM
To: Ralf B
Cc: R mailing list
Subject: Re: [R] Simple qqplot question

Sorry, missed the two variable thing. Go with the lm solution then, and you can tweak the plot yourself (the confidence intervals are easily obtained via predict(lm.object, interval="prediction") ). The function qq.plot uses robust regression, but in your case normal regression will do.

Regarding the shapes : this just indicates both tails are shorter than expected, so you have a kurtosis greater than 3 (or positive, depending whether you do the correction or not)

Cheers
Joris

On Fri, Jun 25, 2010 at 4:10 AM, Ralf B <ralf.bierig_at_gmail.com> wrote:
> Short rep: I have two distributions, data and data2; each build from
> about 3 million data points; they appear similar when looking at
> densities and histograms. I plotted qqplots for further eye-balling:
>
> qqplot(data, data2, xlab = "1", ylab = "2")
>
> and get an almost perfect diagonal line which means they are in fact
> very alike. Now I tried to check normality using qqnorm -- and I think
> I am doing something wrong here:
>
> qqnorm(data, main = "Q-Q normality plot for 1")
> qqnorm(data2, main = "Q-Q normality plot for 2")
>
> I am getting perfect S-shaped curves (??) for both distributions. Am I
> something missing here?
>
> |
> |                               *  *   *  *
> |                           *
> |                        *
> |                    *
> |               *
> |            *
> |         *
> | * * *
> |---------------------------------------------
>
> Thanks, Ralf
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys_at_Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri 25 Jun 2010 - 16:05:32 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 25 Jun 2010 - 16:40:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive