Re: [R] Keep value lables with data frame manipulation

From: Heinz Tuechler <tuechler_at_gmx.at>
Date: Thu 13 Jul 2006 - 23:48:55 EST

At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>Heinz Tuechler wrote:
>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>> Dear R,
>>>>
>>>> I import data from spss into a R data.frame. On this rawdata I do some
>>>> data processing (selection of observations, normalization, recoding of
>>>> variables etc..). The result is stored in a new data.frame, however, in
>>>> this new data.frame the value labels are lost.
>>>>
>>>> Example of what I do in code:
>>>>
>>>> # read raw data from spss
>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>> use.value.labels=FALSE,to.data.frame=TRUE)
>>>>
>>>> # select the observations that we need
>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>> rawdata$D22==24 | rawdata$D22==33,]
>>>>
>>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>>> is numeric without value labels.
>>>>
>>>> Question: How can I prevent this from happening?
>>>>
>>>> Thanks in advance!
>>>> Groeten,
>>>> Arne
>>> Two things:
>>>
>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>> with the following:
>>>
>>> diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>>
>>> See ?subset and ?"%in%" for more information.
>>>
>>>
>>> 2. With respect to keeping the label related attributes, the
>>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>>> default survive the use of "[".data.frame in R (see ?Extract
>>> and ?"[.data.frame").
>>>
>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>> labels should be converted to the factor levels of the respective
>>> columns when 'use.value.labels = TRUE' and these would survive a
>>> subsetting.
>>>
>>> If you want to consider a solution to the attribute subsetting issue,
>>> you might want to review the following post by Gabor Grothendieck in
>>> May, which provides a possible solution:
>>>
>>> https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>
>>> and this post by me, for an explanation of what is happening in Gabor's
>>> solution:
>>>
>>> https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>
>>> HTH,
>>>
>>> Marc Schwartz
>>>
>> Hello Mark and Arne,
>>
>> I worked on the suggestions of Gabor and Mark and programmed some functions
>> in this way, but they are very, very preliminary (see below).
>> In my view there is a lack of convenient possibilities in R to document
>> empirical data by variable labels, value labels, etc. I would prefer to
>> have these possibilities in the "standard" configuration.
>> So I sketched a concept, but in my view it would only be useful, if there
>> was some acceptance by the core developers of R.
>>
>> The concept would be to define a class. For now I call it "source.data".
>> To design it more flexible than the Hmisc class "labelled" I would define a
>> related option "source.data.attributes" with default c('value.labels',
>> 'variable.name', 'label')). This option contains all attributes that should
>> persist in subsetting/indexing.
>>
>> I made only some very, very preliminary tests with these functions, mainly
>> because I am not happy with defining a new class. Instead I would prefer,
>> if this functionality could be integrated in the Hmisc class "labelled",
>> since this is in my view the best known starting point for data
>> documentation in R.
>>
>> I would be happy, if there were some discussion about the wishes/needs of
>> other Rusers concerning data documentation.
>>
>> Greetings,
>>
>> Heinz
>
>I feel that separating variable labels and value labels and just using
>factors for value labels works fine, and I would urge you not to create
>a new system that will not benefit from the many Hmisc functions that
>use variable labels and units. [.data.frame in Hmisc keeps all attributes.
>
>Frank
>

Frank,

of course I aggree with you about the importance of Hmisc and as I said, I do not want to define a new class, but in my view factors are no good substitute for value labels.
As the language definition (version 2.3.1 (2006-06-05) Draft, page 7) says: "Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers. Rather unfortunately users often make use of the implementation in order to make some calculations easier."
So, in my view, the levels represent the "values" of the factor. This has inconveniencies if you want to use value labels in different languages. Further I do not see a simple method to label numerical variables. I often encounter discrete, but still metric data, as e.g. risk scores. Usually it would be nice to use them in their original coding, which may include zero or decimal places and to label them at the same time.
Personally at the moment I try to solve this problem by following a suggestion of Martin, Dimitis and others to use names instead. I doubt, however, that this is a good solution, but at least it makes it possible to have the source data numerically coded and in this sense "language free" (see first attempts of functions below).

Heinz

### These are very preliminary and untested versions.
### They are inteded only to demonstrate the concept, but not for productive
### work.

### function "value.names<-" - version 0.3.0 - 11.7.2006
### function to assign names of elements according to their value
##
##  value.names<-
##  - arguments:
##    - action 
##      - set:           alle eventuell vorhandenen names löschen, valuenames
##                       setzen
##      - add.overwrite: leere und nicht leere names durch neue ersetzen
##      - add:           nur leere names durch neue ersetzen
##    - tolerance:       ordnet names den values innerhalb der Toleranz zu.
##                       Liegt ein Wert innerhalb des Toleranzbereiches
##                       mehrerer names, dann wird geringste Toleranz gewählt.
##    - round:           rounds values in value before matching
##                       This may lead to collapsing of different names in
##                       value to one name (and one value)
##    - col.str:         string used when collapsing several names
##    - others:          name for values not named by other names
##    - value:
##      
##
##  function description:
##  - x must be atomic, preferably numeric or character
##  - if tolerance is given, it must not be NA. tolerance < 0 is ignored
##  - to ensure consistency, value is processed by value.names()
##  - new.names are built by matching with/without tolerance
##  - new.names are assigned to names depending on argument action
##  - if argument others is given, others-name is assigned to all valid values
## without name
##

"value.names<-" <- function(x, action='set', tolerance=NULL, round=NULL,

                            col.str=' ', others=NULL, value)
{
  ## checking parameters
  if(!is.atomic(x)) stop('x must be an atomic object')   if(!is.null(tolerance) &&

     is.na(tolerance)) stop('if given, tolerance must not be NA')   ## to ensure consistency, process value by value.names   value <- value.names(value, round=round, col.str=col.str)   ## delete values with NA-name from value   value <- value[!is.na(names(value))]
  old.names <- names(x) # store original names   ## -- building names
  ## - matching with/without tolerance   if(!is.null(tolerance) && tolerance > 0 && is.numeric(x))     ## - matching with tolerance
    { dif <- abs(outer(x, value, '-'))

      dif[dif>tolerance] <- NA
      within.tolerance <- apply(dif, 1, function(x) sum(!is.na(x)))
      old.option.warn <- options('warn')[[1]]
      options(warn=-1)
      min.dif <- apply(dif, 1, function(x) which(x==min(x, na.rm=TRUE))[1])
      options(warn=old.option.warn)
      new.names <- names(value)[min.dif] }
  else
    ##      - matching without tolerance, i.e. exact matching
    new.names <- names(value)[match( x, value)]
  ##      - matching names for NA-values
  if(length(names(value[is.na(value)]))==1)     new.names[is.na(x)] <- names(value[is.na(value)])   ## assign names depending on action
  if (action=='set') new.names <- new.names   if (action=='add.overwrite') new.names[is.na(new.names)] <-     old.names[is.na(new.names)]
  if (action=='add') new.names[!is.na(old.names)] <-     old.names[!is.na(old.names)]
  ## assigning others-name to all valid values without name   if (!is.null(others)) new.names[!is.na(x) & is.na(new.names)] <-     as.character(others)
  names(x) <- new.names
  return(x)
}
### function value.names - version 0.3.0 - 11.7.2006
### function to return names of elements according to their value
##
##  - arguments:
##    - x         source vector with names for (some) elements
##                x must be atomic ().
##                If x is a factor, value will be a factor. Consequently
##                names are only seen, if unclass() or print.default is used.
##    - col.str:         string used when collapsing several names
##                       default: "/"
##    - round:           rounds values in x
##                       This may lead to collapsing of different names for
##                       one value of x to one name (and one value)
##
##  - value:
##  - vector of the same class as x with sorted unique values and their names,
##    NULL, if x is NULL
##    - NA-values in x appear at the end
##    - if there is a 1:1 realtion between values and names in x, value
##      contains all unique combinations of value and name.
##    - if identical values in x have different (non NA), names these names
##      get collapsed to one new name, seperated by the string col.str
##      This applies also to NA-values in x with different names.
##    - NA-names get suppressed, if non-NA-names for the same x-value exist.
##    - Differen values in x with identical names remain seperated.
##    - values in x without name appear in value with name NA

value.names <- function(x, col.str=' ', round=NULL) {   ## checking parameters
  if(!is.atomic(x)) stop('x must be an atomic object') ## -- define function for pasting unique non empty names

  pasteunique <- function(names.i, col.str)
    { names.i <- sort(unique(names.i))
      names.i <- names.i[!names.i=='' & !is.na(names.i)] # exclude ''
      if (length(names.i))
        names.i <- paste(names.i, sep='', collapse=col.str)
      else names.i <- NA
      invisible(names.i)

    }
  ## branching: if x is.null or has no names   if (is.null(x)) {
    return(NULL) }
  else {
    x <- sort(x, na.last=TRUE) # sort x
    if (!is.null(round)) x <- round(x, round)     ## vector of unique values
    values <- unique(x, na.last = TRUE)
    ## names per value
    nam <- NA
    for (i in seq(along=values)) {
      names.i <- names(x)[x==values[i]]
      if (!is.null(names.i)) nam[i] <- pasteunique(names.i, col.str)
      else nam[i] <- NA

    }
    ## names for NA
    if (is.na(values[length(values)]))
      { names.i <- names(x)[is.na(x)]
        nam[length(values)] <- pasteunique(names.i, col.str)
      }

    names(values) <- nam
    return(values)
  }
}
### function factvn - version 0.3.0 - 11.7.2006
### function to build a factor from vector with named elements
##
##  function description:
##  - if fromvaluesnames is not given factvn calls factor
##  - if fromvaluesnames is in c('values', 'names') a factor based on
##    names(x) is constructed
##
##  - arguments:
##    - x         source vector with names for (some) elements
##                x must be numeric or character.
##    - fromvaluesnames:
##      - fromvaluenames='values': levels are ordered according to the values
##        of x
##      - fromvaluenames='names': levels are ordered according to the names
##        of x
##    - ordered:
##      - fromvaluesnames is not given: ordered=is.ordered(x)
##      - fromvaluesnames='values': ordered=TRUE
##      - fromvaluesnames='names': ordered=FALSE
##
##  - value:
##  - if fromvaluesnames is not given see factor
##  - if fromvaluesnames is in c('values', 'names') a factor based on
##    names(x) is constructed. All x-values without names are NA.
##    The (final) levels of value are the unique(names(x)).

factvn <- function (x = character(), levels = sort(unique.default(x),
                    na.last = TRUE), labels = levels, exclude = NA,
                    ordered = is.ordered(x), fromvaluesnames=NULL)
{
  ## set ordered depending on fromvaluesnames   if (!missing(fromvaluesnames))
    if (missing(ordered)) {
      if (fromvaluesnames=='values') ord <- TRUE
      if (fromvaluesnames=='names') ord <- FALSE
    } else ord <- ordered
  if (!missing(fromvaluesnames)) {
    if (fromvaluesnames=='values')
      fx <- factor(names(x), levels=unique(names(value.names(x))),
                   exclude=exclude, ordered=ord)
    if (fromvaluesnames=='names')
      fx <- factor(names(x), levels=sort(unique(names(value.names(x)))),
                   exclude=exclude, ordered=ord)
  } else fx <- factor(x, levels, labels, ordered)   return(fx)
}

>>

...snip...

>--
>Frank E Harrell Jr Professor and Chair School of Medicine
> Department of Biostatistics Vanderbilt University
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu Jul 13 23:54:29 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 14 Jul 2006 - 04:14:22 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.