Re: [R] Scatterplot Showing All Points

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Tue, 18 Dec 2007 10:49:50 -0500

On 18/12/2007 10:01 AM, Antony Unwin wrote:
> On 18 Dec 2007, at 2:42 pm, Duncan Murdoch wrote:
>

>>> (I must admit to being very surprised that jittering and  
>>> sunflower  plots have been suggested for a dataset of 5000  
>>> points.  Do those who  mentioned these methods have examples on  
>>> that scale where they are  effective?)
>> Sure.  The original post said there were about 50-60 unique  
>> locations. This plot:
>>
>> x <- rbinom(5000, 20, 0.15)
>> y <- rbinom(5000, 20, 0.15)
>> plot(x,y)
>>
>> has a few more unique locations; tune those probabilities if you  
>> want it closer.  Due to the overlap, the distribution is very  
>> unclear.  But this plot
>>
>> plot(jitter(x), jitter(y))
>>
>> makes the distribution quite clear.

>
> No it doesn't! It makes it moderately clearer than the plot without
> jittering. One good alternative here is the fluctuation diagram
> variant of a mosaic plot:
>
> xx<-as.factor(x)
> yy<-as.factor(y)
> imosaic(xx,yy, type="f")

That plot is better than jittering, but there's the problem in the mosaic plot of understanding the scale of the rectangles: is it area or diameter that encodes the count? With a jittered plot, you lose resolution when the number of points gets too high because you just see a mess of ink, but at least you only require the viewer to count in order to get a close numerical reading from the plot.

I could also claim that while imperfect, at least jittering is widely applicable. For example, if the data were not on a regular grid, perhaps because they had been generated like this:

xloc <- rnorm(50)
yloc <- rnorm(50)
index <- sample(1:50, 5000, rep=TRUE, prob = abs(xloc)) x <- xloc[index]
y <- yloc[index]

then jittering still works as well (or as poorly), but the imosaic would not work at all. There are better plots than jittering available, but jittering is easy.

(Actually, with this dataset, plot(jitter(x), jitter(y)) is really poor, because jitter() chooses a bad amount of jittering. But with manual tuning (e.g. plot(jitter(x, a=0.1), jitter(y, a=0.1), pch=".")) it's not too bad. So I'd say jittering worked, but the R implementation of it may need improvement).

> Using jittering for categorical data is really not to be recommended
> and will certainly degrade in performance as the dataset gets bigger.

Yes, I probably wouldn't recommend jittering if there were more than a few hundred replications at any point, or more than a few hundred unique points.

Duncan Murdoch

P.S. iplots 1.1-1 may have an init problem in Windows: in my first attempt, the plot made the boxes too large to fit in their cells, but it fixed itself when I resized the window, and the bug doesn't seem to be repeatable.

>
> Antony Unwin
> Professor of Computer-Oriented Statistics and Data Analysis,
> University of Augsburg,
> Germany



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 18 Dec 2007 - 16:02:06 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 18 Dec 2007 - 18:30:21 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.