[Rd] Possible bug in "unsplit" (PR#14084)

From: <ivar.herfindal_at_bio.ntnu.no>
Date: Wed, 25 Nov 2009 15:30:12 +0100 (CET)


Dear R-bug-people

I have encountered a problem with "unsplit", which I believe may be caused by a bug in the function. However, unexpericend with bug-reports I apologise if this is barely a user problem rather than a problem within R.

The problem occurs if an object is split by several grouping factors with levels not occuring in the data, and using drop = TRUE. This may appear as a special and hardly relevant case, but I had to split a data frame on several factors, do some analyses on each of the subsets in the splitted object, and then unsplit it. I had to use drop = TRUE, otherwise my analyses would not run. Nevertheless, I found a fix to the unsplit, which I suggest is due to that the drop-argument not is maintained in the call to unsplit within unsplit. Description and example below. The problem was found on R version 2.9.0 and 2.10.0 on windows XP.

> sessionInfo()

R version 2.10.0 (2009-10-26)
i386-pc-mingw32

locale:
[1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252 LC_CTYPE=Norwegian (Bokmål)_Norway.1252
[3] LC_MONETARY=Norwegian (Bokmål)_Norway.1252 LC_NUMERIC=C [5] LC_TIME=Norwegian (Bokmål)_Norway.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] tools_2.10.0
>

## a reproducable example:
dff <- data.frame(gr1=factor(c(1,1,1,1,1,2,2,2,2,2,2), levels=c(1,2,3,4)), gr2=factor(c(1,2,1,2,1,2,1,2,1,2,3), levels=c(1,2,3,4)), yy=rnorm(11))
# note that the two groups "gr1" and "gr2" have defined levels which not occur in the data.

dff2 <- split(dff, list(dff$gr1, dff$gr2), drop=TRUE) # I dont want empty objects, so I use drop=TRUE

# now I want to unsplit it, and expect the following to work: dff3 <- unsplit(dff2, list(dff$gr1, dff$gr2), drop=TRUE) Error in `row.names<-.data.frame`(`*tmp*`, value = c("1", "11", "3", "11", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘11’, ‘3’, ‘5’

### end

Looking at the unsplit function, we find:
> unsplit

function (value, f, drop = FALSE)
{
len <- length(if (is.list(f)) f[[1L]] else f) if (is.data.frame(value[[1L]])) {
x <- value[[1L]][rep(NA, len), , drop = FALSE] rownames(x) <- unsplit(lapply(value, rownames), f) }
else x <- value[[1L]][rep(NA, len)]
split(x, f, drop = drop) <- value
x
}
<environment: namespace:base>
>

Note that if "value" is a data.frame, then rownames for the output x is made by the call:
rownames(x) <- unsplit(lapply(value, rownames), f)

This call to unsplit ignores the drop-argument, and in the example above we get from this call:
> unsplit(lapply(dff2, rownames), list(dff$gr1, dff$gr2))
[1] "1" "11" "3" "11" "5" "1" "7" "3" "9" "5" "11"

i.e. not unique row names for the output x.

A simple fix is to add drop = drop to that argument, such that the updated unsplit (here called unsplit2) is like this:

unsplit2 <- function (value, f, drop = FALSE) {
len <- length(if (is.list(f)) f[[1L]] else f) if (is.data.frame(value[[1L]])) {
x <- value[[1L]][rep(NA, len), , drop = FALSE] rownames(x) <- unsplit(lapply(value, rownames), f, drop=drop) # note new "drop=drop"
}
else x <- value[[1L]][rep(NA, len)]
split(x, f, drop = drop) <- value
x
}

This works fine in the example above, and the original levels in gr1 and gr2 (i.e. they both have four levels) are maintained in the output data frame such that it has similar attributes as the orignial dff:

> dff3 <- unsplit2(dff2, list(dff$gr1, dff$gr2), drop=TRUE)
> dff3

gr1 gr2 yy
1 1 1 2.13749771
2 1 2 -0.02166458
3 1 1 0.45960452
4 1 2 2.72074958
5 1 1 -0.17536995
6 2 2 -0.08909495
7 2 1 0.94260802
8 2 2 -0.09979505
9 2 1 1.22240834
10 2 2 -0.81710781
11 2 3 0.76071130
>

I must admit that I have not the possiblity to check if such a quick-fix conflicts with other use of unsplit or on other types of data, but I cannot see that it should be a problem.

Sincerely

Ivar Herfindal



Centre for Conservation Biology
Norwegian University for Science and Technology N-7491 Trondheim, Norway

email: ivar.herfindal_at_bio.ntnu.no



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 25 Nov 2009 - 16:13:08 GMT

This archive was generated by hypermail 2.2.0 : Wed 25 Nov 2009 - 16:20:47 GMT