Re: [Rd] median and data frames

From: S Ellison <S.Ellison_at_lgc.co.uk>
Date: Thu, 28 Apr 2011 14:09:12 +0100


This seems trivially fixable using something like

median.data.frame <- function(x, na.rm=FALSE) {

   sapply(x, function(y, na.rm=FALSE) if(is.factor(y)) NA else median(y, na.rm=na.rm), na.rm=na.rm)
}

>>> Paul Johnson <pauljohn32_at_gmail.com> 28/04/2011 06:20 >>>
On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns <pburns_at_pburns.seanet.com> wrote:
> Here are some data frames:
>
> df3.2 <- data.frame(1:3, 7:9)
> df4.2 <- data.frame(1:4, 7:10)
> df3.3 <- data.frame(1:3, 7:9, 10:12)
> df4.3 <- data.frame(1:4, 7:10, 10:13)
> df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17)
> df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18)
>
> Now here are some commands and their answers:

>> median(df4.4)
> [1] 8.5 11.5
>> median(df3.2[c(1,2,3),])
> [1] 2 8
>> median(df3.2[c(1,3,2),])
> [1] 2 NA
> Warning message:
> In mean.default(X[[2L]], ...) :
> argument is not numeric or logical: returning NA
>
>
>
> The sessionInfo is below, but it looks
> to me like the present behavior started
> in 2.10.0.
>
> Sometimes it gets the right answer. I'd
> be grateful to hear how it does that -- I
> can't figure it out.
>

Hello, Pat.

Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew.

I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame.

apply and sapply always give the correct answer

> apply(df3.3, 2, median)
  X1.3 X7.9 X10.12

     2 8 11

> apply(df3.2, 2, median)
X1.3 X7.9

   2 8

> apply(df3.2[c(1,3,2),], 2, median)
X1.3 X7.9

   2 8

mean.data.frame is now implemented as

mean.data.frame <- function(x, ...) sapply(x, mean, ...)

I think we would suggest this for medians:

??????????????????????

median.data.frame <- function(x,...) sapply(x, median, ...)

?????????????????????

It works, see:

> median.data.frame(df3.2[c(1,3,2),])
X1.3 X7.9

   2 8

Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive).

I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector.

I put a browser() at the top of median.default

> median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n
debug at <tmp>#4: if (is.factor(x)) stop("need numeric data") Browse[2]> n
debug at <tmp>#4: NULL
Browse[2]> n
debug at <tmp>#6: if (length(names(x))) names(x) <- NULL Browse[2]> n
debug at <tmp>#6: names(x) <- NULL
Browse[2]> n
debug at <tmp>#8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA])
Browse[2]> n
debug at <tmp>#8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n
debug at <tmp>#8: NULL
Browse[2]> n
debug at <tmp>#12: n <- length(x)
Browse[2]> n
debug at <tmp>#13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n
debug at <tmp>#13: NULL
Browse[2]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x,

    partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n
[1] 2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA

Note the sort there in step 16. I think that's what is killing us.

If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect.

First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output.

> median(df3.2)

Called from: median.default(df3.2)
Browse[1]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x,

    partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively, type

Browse[2]> sort(x, partial = half + 0L:1L)   NA NA NA NA NA NA
1 1 7 NULL NULL NULL NULL
2 2 8 <NA> <NA> <NA> <NA>
3 3 9 <NA> <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :   corrupt data frame: columns will be truncated or padded with NAs

But it still gives you a "right" answer:

Browse[2]> n
[1] 2 8

But if you give it data out of order, the second column turns to NA, and that causes doom.

> median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x,

    partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively:

Browse[2]> sort(x, partial = half + 0L:1L)   NA NA NA NA NA NA
1 1 NULL 7 NULL NULL NULL
3 3 <NA> 9 <NA> <NA> <NA>
2 2 <NA> 8 <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :   corrupt data frame: columns will be truncated or padded with NAs

Browse[2]> n
[1] 2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA

Here's a larger test case. Note columns 1 and 3 turn to NULL

> df8.8 <- data.frame(a=2:8, b=1:7)

median(df8.8)

debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x,

    partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> sort(x, partial = half + 0L:1L)

    NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NULL 2 NULL 1 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL

2 <NA>  3 <NA>  2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA>  4 <NA>  3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA>  5 <NA>  4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA>  6 <NA>  5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 <NA>  7 <NA>  6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
7 <NA>  8 <NA>  7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :   corrupt data frame: columns will be truncated or padded with NAs

Run ?sort and you see it was not intended for data.frames.

In conclusion, I think median applied to a data.frame causes undefined behavior because median is not intending to deal with several columns at once.

I don't see any changes in median.default that would explain the changes you see. Compare:

>From R-2.13, src/library/stats/R/median.R
median.default <- function(x, na.rm = FALSE) {

    if(is.factor(x)) stop("need numeric data")     ## all other objects only need sort() & mean() to be working

    if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_)
    if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x)))
return(x[FALSE][NA])

    n <- length(x)
    if (n == 0L) return(x[FALSE][NA])
    half <- (n + 1L) %/% 2L
    if(n %% 2L == 1L) sort(x, partial = half)[half]     else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) }

>From R-2.9.2

median.default <- function(x, na.rm = FALSE) {

    if(is.factor(x)) stop("need numeric data")     ## all other objects only need sort() & mean() to be working

    if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_)
    if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x)))
return(x[FALSE][NA])

    n <- length(x)
    if (n == 0L) return(x[FALSE][NA])
    half <- (n + 1L) %/% 2L
    if(n %% 2L == 1L) sort(x, partial = half)[half]     else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) }

pj

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu 28 Apr 2011 - 13:12:59 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 28 Apr 2011 - 14:30:53 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive