Re: [R] Keep value lables with data frame manipulation

From: Frank E Harrell Jr <f.harrell_at_vanderbilt.edu>
Date: Thu 13 Jul 2006 - 23:11:52 EST

Heinz Tuechler wrote:
> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:

>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>> Dear R,
>>>
>>> I import data from spss into a R data.frame. On this rawdata I do some
>>> data processing (selection of observations, normalization, recoding of
>>> variables etc..). The result is stored in a new data.frame, however, in
>>> this new data.frame the value labels are lost.
>>>
>>> Example of what I do in code:
>>>
>>> # read raw data from spss
>>> rawdata <- read.spss("./data/T50937.SAV",
>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>
>>> # select the observations that we need
>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>
>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>> is numeric without value labels.
>>>
>>> Question: How can I prevent this from happening?
>>>
>>> Thanks in advance!
>>> Groeten,
>>> Arne
>> Two things:
>>
>> 1. With respect to your subsetting, your lengthy code can be replaced
>> with the following:
>>
>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>
>> See ?subset and ?"%in%" for more information.
>>
>>
>> 2. With respect to keeping the label related attributes, the
>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>> default survive the use of "[".data.frame in R (see ?Extract
>> and ?"[.data.frame").
>>
>> On the other hand, based upon my review of ?read.spss, the SPSS value
>> labels should be converted to the factor levels of the respective
>> columns when 'use.value.labels = TRUE' and these would survive a
>> subsetting.
>>
>> If you want to consider a solution to the attribute subsetting issue,
>> you might want to review the following post by Gabor Grothendieck in
>> May, which provides a possible solution:
>>
>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>
>> and this post by me, for an explanation of what is happening in Gabor's
>> solution:
>>
>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>
>> HTH,
>>
>> Marc Schwartz
>>

> Hello Mark and Arne,
>
> I worked on the suggestions of Gabor and Mark and programmed some functions
> in this way, but they are very, very preliminary (see below).
> In my view there is a lack of convenient possibilities in R to document
> empirical data by variable labels, value labels, etc. I would prefer to
> have these possibilities in the "standard" configuration.
> So I sketched a concept, but in my view it would only be useful, if there
> was some acceptance by the core developers of R.
>
> The concept would be to define a class. For now I call it "source.data".
> To design it more flexible than the Hmisc class "labelled" I would define a
> related option "source.data.attributes" with default c('value.labels',
> 'variable.name', 'label')). This option contains all attributes that should
> persist in subsetting/indexing.
>
> I made only some very, very preliminary tests with these functions, mainly
> because I am not happy with defining a new class. Instead I would prefer,
> if this functionality could be integrated in the Hmisc class "labelled",
> since this is in my view the best known starting point for data
> documentation in R.
>
> I would be happy, if there were some discussion about the wishes/needs of
> other Rusers concerning data documentation.
>
> Greetings,
>
> Heinz

I feel that separating variable labels and value labels and just using factors for value labels works fine, and I would urge you not to create a new system that will not benefit from the many Hmisc functions that use variable labels and units. [.data.frame in Hmisc keeps all attributes.

Frank

>
>
> ### intention and concept
> # There should be a convenient possibility to keep source data numerical
> # coded and at the same time have labelled categories.
> # Such labelled categorical numerical data should be easily converted
> # to factors.
> # Indexing/subsetting should preserve the concerned attributes of this data.
>
> ### description of (intended!!!) functionality
> # - a class source.data is defined. It is intended only for atomic objects.
> # - option source.data.attributes defines which attributes will be copied
> # in indexing/subsetting objects of class source.data
> # - option source.data.is.ordered sets defining factors as ordered, when
> # built from objects of class source.data by the function factsd
> # - function 'value.labels<-' assigns an attribute value.labels and sets
> # class source.data
> # - function value.labels reads the attribute value.labels
> # - the indexing method '[.source.data' defines indexing for source.data
> # - the print method print.source.data ignores source.data.attributes in
> # printing
> # - the as.data.frame method as.data.frame.source.data enables inclusion
> # of objects of class source.data in data.frames
> # - function factsd should in general behave as function factor but should
> # in case of an object of class source.data by default use the
> value.labels
> # as levels and the names(value.labels) as the labels of the new built
> # factor.
> # If the parameter ordered is NULL it should create ordered factors
> # according to the option source.data.is.ordered.
>
> ### set option for source.data.attributes
> options(source.data.attributes=c('value.labels', 'variable.name', 'label'))
> ### set option for converting source.data class in ordered factors
> options(source.data.is.ordered=TRUE)
>
> ### function to assign value.labels
> 'value.labels<-' <- function (x, value)
> ## adapted from Hmisc function label 30.6.2006
> {
> if(!is.atomic(x)) stop('value.labels<- is applicabel to atomic objects
> only')
> structure(x, value.labels = value, class = c("source.data",
> attr(x, "class")[attr(x, "class") != "source.data"]))
> }
>
> ### function to read value.labels
> value.labels <- function (x) { attr(x, 'value.labels') }
>
> ### definition of indexing method for class=source.data
> ## source.data.attributes shall be conserved
> "[.source.data" <- function(x, ...)
> {
> atr <- attributes(x)
> atr.names <- names(atr)
> sda <- options()$'source.data.attributes'
> sda.match <- match(atr.names, sda)
> sda.match <- sda.match[!is.na(sda.match)]
> x <- NextMethod("[")
> ## assign source.data.attributes to result
> if(length(sda.match))
> for (i in sda.match) attr(x, sda[i]) <- atr[[sda[i]]]
> ## assign class source.data to result
> class(x) <- c('source.data', attr(x, "class")[attr(x, "class")
> != "source.data"])
> x
> }
>
> ### print method for source.data
> 'print.source.data' <- function (x, ...)
> {
> ## adapted from Hmisc print.labelled 31.5.2006
> x.orig <- x
> ## look if there are source.data.attributes
> sda <- options()$'source.data.attributes'
> sda.match <- match(names(attributes(x)), sda)
> sda.match <- sda.match[!is.na(sda.match)]
> ## delete source.data.attributes for printing
> if(length(sda.match))
> for (i in sda.match) attr(x, sda[i]) <- NULL
> ## delete class source.data for printing
> class(x) <- if (length(class(x)) == 1 && class(x) == "source.data")
> NULL
> else class(x)[class(x) != "source.data"]
> NextMethod("print")
> invisible(x.orig)
> }
>
> ### Define function as.data.frame.source.data (copy from as.data.frame.vector)
> # many as.data.frame methods are identical to this
> ## different functions as.data.frame are besides others:
> # as.data.frame.list, as.data.frame.default, as.data.frame.data.frame,
> # as.data.frame.character, as.data.frame.AsIs, as.data.frame.array,
>
> as.data.frame.source.data <-
> function (x, row.names = NULL, optional = FALSE)
> ## copy from as.data.frame.vector 1.6.2006
> {
> nrows <- length(x)
> nm <- paste(deparse(substitute(x), width.cutoff = 500), collapse = " ")
> if (is.null(row.names)) {
> if (nrows == 0)
> row.names <- character(0)
> else if (length(row.names <- names(x)) == nrows &&
> !any(duplicated(row.names))) {
> }
> else if (optional)
> row.names <- character(nrows)
> else row.names <- as.character(1:nrows)
> }
> names(x) <- NULL
> value <- list(x)
> if (!optional)
> names(value) <- nm
> attr(value, "row.names") <- row.names
> class(value) <- "data.frame"
> value
> }
>
> ### function to create factor from source.data class applying variable.labels
> # and copying all source.data.attributes
> # remark: factor(factsd(x)) drops unused factor levels and source.data class
> # factsd(x)[, drop=TRUE] drops unused factor levels but keeps
> # source.data class and attributes
>
> factsd <- function(x = character(),
> levels = sort(unique.default(x), na.last = TRUE),
> labels = levels, exclude = NA, ordered = NULL)
> {
> ## check if is of class source.data
> if ('source.data' %in% class(x))
> {
> if(is.null(ordered)) ordered <- options()$source.data.is.ordered
> fx <- factor(x = x, levels = value.labels(x),
> labels = names(value.labels(x)),
> exclude = exclude,
> ordered = ordered)
> ## copy source.data.attributes
> atr <- attributes(x)
> atr.names <- names(atr)
> sda <- options()$'source.data.attributes'
> sda.match <- match(atr.names, sda)
> sda.match <- sda.match[!is.na(sda.match)]
> ## assign source.data.attributes to result
> if(length(sda.match))
> for (i in sda.match) attr(fx, sda[i]) <- atr[[sda[i]]]
> ## add class source.data to result
> class(fx) <- c('source.data', attr(fx, 'class'))
> }
> else {
> if(is.null(ordered)) ordered <- is.ordered(x)
> fx <- factor(x = x, levels = levels, labels = labels,
> exclude = exclude, ordered = ordered)
> }
> fx
> }
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Thu Jul 13 23:05:36 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 14 Jul 2006 - 00:15:09 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.