Re: [R] Variable passed to function not used in function in select=... in subset

From: Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk_at_idi.ntnu.no>
Date: Mon, 10 Nov 2008 20:04:45 +0100

pardon me, but does this address in any way the legitimate complaint of the rightfully confused user?

consider the following:

d = data.frame(a=1, b=2)
a = c("a", "b")
z = a

# that is, both a and z are c("a", "b")

subset(d, select=z)
# gives two columns, since z is a two element vector whose elements are valid column names

subset(d, select=a)
# gives one column, since 'a' (but not a) is a valid column name

subset(d, select=c(a,b))
# gives two columns

this is certainly what the authors intended, and they may have good grounds for this smart design. but this must break the expectation of a naive (r-naive, for that matter) user, who may otherwise have excellent experience in using a functional programming language, e.g., scheme. (especially scheme, where symbols and expressions are first-class objects, yet the distinction between a symbol or an expression and their referent is made painfully clear, perhaps except for when one hacks with macros.)

the examples above illustrate the notorious problem with r that one can never tell whether 'a' means "the value referred to with the identifier 'a'" or "the symbol 'a'", unless one gets ugly surprises and is forced to study the documentation. and even then one may not get a clear answer.

the example given by the confused user is a red flag warning. it's a typical abstraction where a nested sequence of operations (here print over names over subset) is abstracted into a single procedure, which can be called with whatever arguments are valid:

pns = function(d, g) print(names(subset(d, select=g)))

what sane person, without carefully studying the gory details of subset, will ever expect that if the first argument happens to have a column named 'g', only this one will be selected, while if it doesn't, subset will select the columns named by the components of what 'g' evaluates to. i wonder how many users have *not* noticed that what they get is not what they assume they get because of such tricky tricks, and in consequence were not able to publish their analyses (or worse, have published them).

what is scary is that this may happen with about any other function in r, because the design is pervasive. no one should ever use any r function without first carefully reading the docs (which is not guaranteed to help) or trying it first on a number of carefully crafted test cases. if such care is not taken, results obtained with r cannot be taken seriously.

vQ

Gabor Grothendieck wrote:
> Forgot the name part. Try:
>
> TestFunc2 <- function(DF, group) names(DF[group])
> TestFunc3 <- function(...) names(subset(..., subset = TRUE))
> TestFunc4 <- function(...) eval.parent(names(subset(..., subset = TRUE)))
>
> # e.g.
> df1 <- data.frame(group = "G1", visit = "V1", value = 0.9)
> TestFunc2(df1, c("group", "visit"))
> TestFunc3(df1, c("group", "visit"))
> TestFunc4(df1, c("group", "visit"))
> TestFunc4(df1, c(group, visit)) # this works too
>
> On Mon, Nov 10, 2008 at 10:43 AM, Gabor Grothendieck
> <ggrothendieck_at_gmail.com> wrote:
>
>> Here are a few things to try:
>>
>> TestFunc1 <- get("[")
>>
>> TestFunc2 <- function(DF, group) DF[group]
>>
>> TestFunc3 <- function(...) subset(..., subset = TRUE)
>>
>>
>>
>> On Mon, Nov 10, 2008 at 10:18 AM, Karl Knoblick <karlknoblich_at_yahoo.de> wrote:
>>
>>> Hello!
>>>
>>> I have the problem that in my function the passed variable is not used, but the variable name of the dataframe itself - difficult to explain, but an easy example:
>>>
>>> TestFunc<-function(df, group) {
>>> print(names(subset(df, select=group)))
>>> }
>>> df1<-data.frame(group="G1", visit="V1", value=0.9)
>>> TestFunc(df1, c("group", "visit"))
>>>
>>> Result:
>>> [1] "group"
>>>
>>> But I expected and want to have [1] "group" "visit" as result! Does anybody know how to get this result?
>>>
>>> Thanks!
>>> Karl
>>>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 10 Nov 2008 - 19:08:49 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 11 Nov 2008 - 14:30:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive