From: Gabor Grothendieck <ggrothendieck_at_gmail.com>

Date: Tue 28 Jun 2005 - 15:25:40 GMT

R-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed Jun 29 01:28:37 2005

Date: Tue 28 Jun 2005 - 15:25:40 GMT

Based on Andy's comment a workaround can consist of not using boxplot.formula, e.g. using the data frame d defined by the original poster (see below):

boxplot( by(d, d$b, function(x)x$a) )

On 6/28/05, Liaw, Andy <andy_liaw@merck.com> wrote:

> The issue is not with boxplot, but with split. boxplot.formula()

*> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
**> but look at what split() returns when there are empty levels in
**> the factor:
**>
**> > f <- factor(gl(3, 6), levels=1:5)
**> > y <- rnorm(f)
**> > split(y, f)
**> $"1"
**> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
**>
**> $"2"
**> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801
**>
**> $"3"
**> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561
**>
**> The "culprit" is the following in split.default():
**>
**> f <- factor(f)
**>
**> which drops empty levels in f, if there are any. BTW, ?split doesn't
**> mention what it does in such situation. Perhaps it should?
**>
**> If this is to be "fixed", I suppose an additional argument, e.g.,
**> drop=TRUE, can be added, and the corresponding line mentioned
**> above changed to something like:
**>
**> if (drop || !is.factor(f)) f <- factor(f)
**>
**> Then this additional argument can be pass on from boxplot.formula() to
**> split().
**>
**> Just my $0.02...
**>
**> Andy
**>
**> > From: mwtoews@sfu.ca
**> >
**> > I consider this to be an old bug, which also persists in Splus 7. It
**> > is unnecessary, and annoying.
**> >
**> > ## Section 1: Consider a simple data frame with three possible
**> > factors (in levels)
**> >
**> > d <- data.frame(a=sort(rnorm(10)*10), b=factor(c(rep("A",4), rep("C",
**> > 6)), levels=c("A","B","C")))
**> > tapply(d$a, d$b, mean) # returns three results, which I would expect
**> > plot(a ~ b, d) # plots only two of three objects, ignoring
**> > that there
**> > was "C" in the second position
**> >
**> > # if I tried to plot a blank in between the two boxplots:
**> > plot(a ~ b, d, at=1:3) # nope: error
**> > plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does
**> > nothing for the formula boxplot method)
**> >
**> > # to make this work with the current R/Splus implementation, I have
**> > to add a zero:
**> > d <- rbind(d, data.frame(a=0,b="B")) # which I don't want to do,
**> > since there are no "B"
**> > plot(a ~ b, d) # yuk!
**> >
**> > ## Section 2: Why is this important? Consider another realistic
**> > example of [synthetic] daily temperature
**> >
**> > temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3
**> > d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) #
**> > jday is Julian day [1,365]
**> > d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j")
**> > d1$month <- factor(months(d1$date,TRUE), levels=month.abb)
**> > plot(temp ~ month, d1) # perfect, in a perfect meteorological world
**> >
**> > d2 <- d1[!d1$month %in% c("Mar","Apr","May","Sep"),] # now let's
**> > remove some data
**> > tapply(d2$temp,d2$month,mean) # perfect
**> > plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite
**> > having 12
**> > levels)
**> >
**> > # again the only cure is to add zeros to the missing months
**> > (unnecessary forgery of data)
**> > d3 <- d2
**> > for (i in c("Mar","Apr","May","Sep")) {
**> > d3 <- rbind(d3,NA)
**> > d3$month[nrow(d3)] <- i
**> > d3$temp[nrow(d3)] <- 0
**> > }
**> > plot(temp ~ month, d3) # still ugly, but at least has 12 months!
**> >
**> > ## Section 3: Solution
**> > The obvious solution is to leave a blank where a boxplot should go,
**> > similar to tapply. This would have 1:n positions, where n is the
**> > number of levels of the factor, not the number of factors that have
**> > one or more numbers. The position should also have a label
**> > under the
**> > tick mark.
**> > I don't see any reason why the missing data should be completely
**> > ignored. Users wishing to not plot the blanks where the data
**> > could go
**> > can simply type (for back-compatibility):
**> >
**> > d2$month <- factor(d2$month) # from 12 to 8 levels
**> >
**> > Which will produce the same 8-factor plot as above.
**> >
**> > ## Section 4: Conclusion
**> > I consider this to be a bug in regards to data representation, and
**> > this function is not consistant with other functions like `tapply'.
**> > Considering that the back-compatibility solution is very simple, and
**> > most users would probably prefer a result including all levels (NULL
**> > or real values in each), I feel this an appropriate improvement (and
**> > easy to fix in the code). At the very least, include an option to
**> > honour the factor levels.
**> >
**> > Thanks.
**> > -mt
**> >
**> > --please do not edit the information below--
**> >
**> > Version:
**> > platform = powerpc-apple-darwin8.1.0
**> > arch = powerpc
**> > os = darwin8.1.0
**> > system = powerpc, darwin8.1.0
**> > status = Patched
**> > major = 2
**> > minor = 1.1
**> > year = 2005
**> > month = 06
**> > day = 26
**> > language = R
**> >
**> > Locale:
**> > en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
**> >
**> > Search Path:
**> > .GlobalEnv, package:methods, package:stats, package:graphics,
**> > package:grDevices, package:utils, package:datasets, Autoloads,
**> > package:base
**> >
**> > ______________________________________________
**> > R-devel@r-project.org mailing list
**> > https://stat.ethz.ch/mailman/listinfo/r-devel
**> >
**> >
**> >
**>
**> ______________________________________________
**> R-devel@r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
*

>

R-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed Jun 29 01:28:37 2005

*
This archive was generated by hypermail 2.1.8
: Mon 24 Oct 2005 - 22:27:23 GMT
*