Re: [R] Difficulty with qqline in logarithmic context

From: François Pinard <pinard_at_iro.umontreal.ca>
Date: Sat 04 Feb 2006 - 06:08:32 EST

[Brian Ripley]
>Is there a good reason to use qqnorm in a single-log context?

Yes. Googling around reveals this is not so uncommon.

> Should one not rather use

>>qqnorm(log(freq))
>>qqline(log(freq))

In the display produced by "qqnorm", the y-axis would then show "log(value)" labels, while the user (me!) expects "value" labels.

>since you are (I guess) looking at log-normality of freq?

Once again, I was merely toying with "qqplot". I found intriguing that, while shuffling messages around between folders, for a good while, the distribution of log(number of messages) per folder appears vagueley normal, as I do not quickly see a reasonable justification for this.

>Another way to look at that is

>>qqplot(qlnorm(ppoints(length(freq))), freq, log="xy")

>the same plot, different scales.

Interesting, thanks for teaching me about "ppoints". Yet, I stay more happy with the abcissa scale produced by "qqnorm". Besides, how would one uses "qqline" with the above?

>(I believe a QQ plot should always have comparable scales on the two
>axes.)

While comparable scales are somewhat simpler to compare, this is not necessarily what is most adequate for the user. Proof is that while quantiles are being compared here, scales do not show quantiles, but units as meaningful to the user. One might want to compare variables scaled very differently, maybe because of different units from the same distribution, of from different but similar distributions using different scales and shifted to different means. Or even, why not, if this is what is meaningful for users, a log scale.

>The point is that qqline is tied to normality, not to log-normality.

As it stands, yes. As a convenience, it could be extended (probably easily) to log-normality. "qqnorm" already does something sensible in log-context, so a user might expect "qqline" to do equally well.

The real point might be that "qqline" is tied to "abline" a bit too blindly. What is the meaning of intercept and slope of a straight line on a graphic in log context? First, the intercept might not even exist. Second, "abline" interpretation depends on the clippling, and possibly on the extrema of the pretty breakpoints chosen for scales, so making it hard to predict on average use. There ought to be some reason for the log-aware code in "abline", yet I did not find documentation for it.

The wisest for "abline", in my very humble opinion, would be for it to complain if ever called in log context. Then, "qqline" would indirectly complain through "abline", if "qqline" is not modified to do something more proper. Moreover, if it is definitely out of question that "qqline" be ever meaningfully called in log context, then so "qqnorm", which should then complain as well.

Currently, "qqline" misbehaves, in that it silently produces a meaningless result, while it could either diagnose that the result is meaningless, or produce a mearningful result.

[Remainder of the reply top-quoted, as usual on r-help.]

>On Wed, 1 Feb 2006, François Pinard wrote:

>>Hi, R friends. I had some difficulty with the following code:

>> qqnorm(freq, log='y')
>> qqline(freq)

>>as the line drawn was seemingly random. The exact data I used appears
>>below. After wandering a bit within the source code for "abline",
>>I figured out I should rather write:

>> qqnorm(freq, log='y')
>> par(ylog=FALSE)
>> qqline(log10(freq))
>> par(ylog=TRUE)

>>I'm proposing that this little stunt be rather be hidden and
>>automatically effected within "qqline" proper, whenever par('ylog') is
>>TRUE. I thought about providing a patch, as "qqline" is so small. Yet
>>it would be more noise than useful, as I'm not familiar with the "datax"
>>argument usage, which should probably be addressed as well.

>>Here is the data, in case useful:

>>freq <-
>>as.integer(c(33, 79, 21, 436, 58, 18, 1106, 498, 1567, 393, 2,
>>104, 50, 67, 113, 76, 327, 331, 196, 145, 86, 59, 12, 215, 293,
>>154, 500, 314, 246, 587, 85, 23, 323, 3, 13, 576, 29, 37, 24,
>>21, 1230, 137, 13, 93, 3, 101, 72, 218, 59, 17, 2, 8, 86, 143,
>>150, 22, 19, 234, 119, 157, 4, 255, 146, 126, 76, 15, 271, 170,
>>4, 6, 16, 3048, 2175, 3350, 5017, 5706, 1610, 665, 322, 1, 16,
>>47, 51, 168, 94, 66, 154, 99, 11, 547, 953, 1, 1071, 80, 184,
>>168, 52, 187, 103, 187, 361, 46, 85, 135, 597, 121, 283, 26,
>>12, 20, 169, 9, 79, 15, 114, 75, 30, 111, 556, 173, 32, 99, 438,
>>2, 2, 1, 117, 5, 3, 51, 8, 41, 12, 23, 2, 13, 5, 1, 9, 4, 1,
>>7, 15, 5, 48, 16, 112, 6, 1, 39, 60, 5, 23, 5, 19, 1, 8, 32,
>>4, 13, 1, 14, 71, 5, 1, 35, 30, 100, 389, 22, 8, 1, 192, 40,
>>6, 3, 17, 2, 14, 71, 14, 1, 5, 4, 32, 21, 18, 13, 2, 2, 45, 342,
>>46, 144, 18, 131, 188, 112, 37, 85, 90, 8, 195, 173, 5, 53, 96,
>>37, 16, 16, 281, 64, 50, 92, 336, 31, 744, 4, 134, 74, 1, 227,
>>6, 48, 418, 64, 66, 59, 20, 45, 20, 370, 148, 22, 7, 30, 601,
>>29, 82, 113, 938, 252, 65, 137, 72, 22, 98, 12, 152, 212, 13,
>>8, 35, 3, 77))

>>Yet this really is the value of "courriel$freq" after "data(courriel)",
>>with a file ".../R/data/courriel.R" here, holding:

>>courriel <- read.table(pipe('grep -c \'^From \' ../courriel/*'),
>> sep=':', as.is=T, row.names=1,
>> col.names=c('fichier', 'freq'))

>>My goal, which is nothing serious, was merely to toy with the number of
>>messages per folder, for folders massaged out of R archives.

>>Version:
>>platform = i686-pc-linux-gnu
>>arch = i686
>>os = linux-gnu
>>system = i686, linux-gnu
>>status =
>>major = 2
>>minor = 2.1
>>year = 2005
>>month = 12
>>day = 20
>>svn rev = 36812
>>language = R

>>Locale:
>>LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=fr_CA.UTF-8;LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C

>>Search Path:
>>.GlobalEnv, package:methods, package:stats, package:graphics,
>>package:grDevices, package:utils, package:datasets, fp.etc, Autoloads,
>>package:base

>>--
>>François Pinard http://pinard.progiciels-bpi.ca

>>______________________________________________
>>R-help@stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide!
>>http://www.R-project.org/posting-guide.html

>--
>Brian D. Ripley, ripley@stats.ox.ac.uk
>Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>University of Oxford, Tel: +44 1865 272861 (self)
>1 South Parks Road, +44 1865 272866 (PA)
>Oxford OX1 3TG, UK Fax: +44 1865 272595

-- 
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Sat Feb 04 06:20:02 2006

This archive was generated by hypermail 2.1.8 : Sat 04 Feb 2006 - 15:50:10 EST