Re: [R] Keep value lables with data frame manipulation

From: Frank E Harrell Jr <f.harrell_at_vanderbilt.edu>
Date: Fri 14 Jul 2006 - 02:02:01 EST

Heinz Tuechler wrote:
> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:

>> Heinz Tuechler wrote:
>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>>> Dear R,
>>>>>
>>>>> I import data from spss into a R data.frame. On this rawdata I do some
>>>>> data processing (selection of observations, normalization, recoding of
>>>>> variables etc..). The result is stored in a new data.frame, however, in
>>>>> this new data.frame the value labels are lost.
>>>>>
>>>>> Example of what I do in code:
>>>>>
>>>>> # read raw data from spss
>>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>>
>>>>> # select the observations that we need
>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>>
>>>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>>>> is numeric without value labels.
>>>>>
>>>>> Question: How can I prevent this from happening?
>>>>>
>>>>> Thanks in advance!
>>>>> Groeten,
>>>>> Arne
>>>> Two things:
>>>>
>>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>>> with the following:
>>>>
>>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>>>
>>>> See ?subset and ?"%in%" for more information.
>>>>
>>>>
>>>> 2. With respect to keeping the label related attributes, the
>>>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>>>> default survive the use of "[".data.frame in R (see ?Extract
>>>> and ?"[.data.frame").
>>>>
>>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>>> labels should be converted to the factor levels of the respective
>>>> columns when 'use.value.labels = TRUE' and these would survive a
>>>> subsetting.
>>>>
>>>> If you want to consider a solution to the attribute subsetting issue,
>>>> you might want to review the following post by Gabor Grothendieck in
>>>> May, which provides a possible solution:
>>>>
>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>>
>>>> and this post by me, for an explanation of what is happening in Gabor's
>>>> solution:
>>>>
>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>>
>>>> HTH,
>>>>
>>>> Marc Schwartz
>>>>
>>> Hello Mark and Arne,
>>>
>>> I worked on the suggestions of Gabor and Mark and programmed some functions
>>> in this way, but they are very, very preliminary (see below).
>>> In my view there is a lack of convenient possibilities in R to document
>>> empirical data by variable labels, value labels, etc. I would prefer to
>>> have these possibilities in the "standard" configuration.
>>> So I sketched a concept, but in my view it would only be useful, if there
>>> was some acceptance by the core developers of R.
>>>
>>> The concept would be to define a class. For now I call it "source.data".
>>> To design it more flexible than the Hmisc class "labelled" I would define a
>>> related option "source.data.attributes" with default c('value.labels',
>>> 'variable.name', 'label')). This option contains all attributes that should
>>> persist in subsetting/indexing.
>>>
>>> I made only some very, very preliminary tests with these functions, mainly
>>> because I am not happy with defining a new class. Instead I would prefer,
>>> if this functionality could be integrated in the Hmisc class "labelled",
>>> since this is in my view the best known starting point for data
>>> documentation in R.
>>>
>>> I would be happy, if there were some discussion about the wishes/needs of
>>> other Rusers concerning data documentation.
>>>
>>> Greetings,
>>>
>>> Heinz
>> I feel that separating variable labels and value labels and just using 
>> factors for value labels works fine, and I would urge you not to create 
>> a new system that will not benefit from the many Hmisc functions that 
>> use variable labels and units.  [.data.frame in Hmisc keeps all attributes.
>>
>> Frank
>>

>
> Frank,
>
> of course I aggree with you about the importance of Hmisc and as I said, I
> do not want to define a new class, but in my view factors are no good
> substitute for value labels.
> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7) says:
> "Factors are currently implemented using an integer array to specify the
> actual levels and a second array of names that are mapped to the integers.
> Rather unfortunately users often make use of the implementation in order to
> make some calculations easier."
> So, in my view, the levels represent the "values" of the factor.
> This has inconveniencies if you want to use value labels in different
> languages. Further I do not see a simple method to label numerical
> variables. I often encounter discrete, but still metric data, as e.g. risk
> scores. Usually it would be nice to use them in their original coding,
> which may include zero or decimal places and to label them at the same time.
> Personally at the moment I try to solve this problem by following a
> suggestion of Martin, Dimitis and others to use names instead. I doubt,
> however, that this is a good solution, but at least it makes it possible to
> have the source data numerically coded and in this sense "language free"
> (see first attempts of functions below).
>
> Heinz
>
Those are excellent points Heinz. I addressed that problem partially in sas.get - see the sascodes attribute.

Frank

>
> ### These are very preliminary and untested versions.
> ### They are inteded only to demonstrate the concept, but not for productive
> ### work.
>
> ### function "value.names<-" - version 0.3.0 - 11.7.2006
> ### function to assign names of elements according to their value
> ##
> ## value.names<-
> ## - arguments:
> ## - action
> ## - set: alle eventuell vorhandenen names löschen, valuenames
> ## setzen
> ## - add.overwrite: leere und nicht leere names durch neue ersetzen
> ## - add: nur leere names durch neue ersetzen
> ## - tolerance: ordnet names den values innerhalb der Toleranz zu.
> ## Liegt ein Wert innerhalb des Toleranzbereiches
> ## mehrerer names, dann wird geringste Toleranz gewählt.
> ## - round: rounds values in value before matching
> ## This may lead to collapsing of different names in
> ## value to one name (and one value)
> ## - col.str: string used when collapsing several names
> ## - others: name for values not named by other names
> ## - value:
> ##
> ##
> ## function description:
> ## - x must be atomic, preferably numeric or character
> ## - if tolerance is given, it must not be NA. tolerance < 0 is ignored
> ## - to ensure consistency, value is processed by value.names()
> ## - new.names are built by matching with/without tolerance
> ## - new.names are assigned to names depending on argument action
> ## - if argument others is given, others-name is assigned to all valid values
> ## without name
> ##
>
> "value.names<-" <- function(x, action='set', tolerance=NULL, round=NULL,
> col.str=' ', others=NULL, value)
> {
> ## checking parameters
> if(!is.atomic(x)) stop('x must be an atomic object')
> if(!is.null(tolerance) &&
> is.na(tolerance)) stop('if given, tolerance must not be NA')
> ## to ensure consistency, process value by value.names
> value <- value.names(value, round=round, col.str=col.str)
> ## delete values with NA-name from value
> value <- value[!is.na(names(value))]
> old.names <- names(x) # store original names
> ## -- building names
> ## - matching with/without tolerance
> if(!is.null(tolerance) && tolerance > 0 && is.numeric(x))
> ## - matching with tolerance
> { dif <- abs(outer(x, value, '-'))
> dif[dif>tolerance] <- NA
> within.tolerance <- apply(dif, 1, function(x) sum(!is.na(x)))
> old.option.warn <- options('warn')[[1]]
> options(warn=-1)
> min.dif <- apply(dif, 1, function(x) which(x==min(x, na.rm=TRUE))[1])
> options(warn=old.option.warn)
> new.names <- names(value)[min.dif] }
> else
> ## - matching without tolerance, i.e. exact matching
> new.names <- names(value)[match( x, value)]
> ## - matching names for NA-values
> if(length(names(value[is.na(value)]))==1)
> new.names[is.na(x)] <- names(value[is.na(value)])
> ## assign names depending on action
> if (action=='set') new.names <- new.names
> if (action=='add.overwrite') new.names[is.na(new.names)] <-
> old.names[is.na(new.names)]
> if (action=='add') new.names[!is.na(old.names)] <-
> old.names[!is.na(old.names)]
> ## assigning others-name to all valid values without name
> if (!is.null(others)) new.names[!is.na(x) & is.na(new.names)] <-
> as.character(others)
> names(x) <- new.names
> return(x)
> }
>
>
> ### function value.names - version 0.3.0 - 11.7.2006
> ### function to return names of elements according to their value
> ##
> ## - arguments:
> ## - x source vector with names for (some) elements
> ## x must be atomic ().
> ## If x is a factor, value will be a factor. Consequently
> ## names are only seen, if unclass() or print.default is used.
> ## - col.str: string used when collapsing several names
> ## default: "/"
> ## - round: rounds values in x
> ## This may lead to collapsing of different names for
> ## one value of x to one name (and one value)
> ##
> ## - value:
> ## - vector of the same class as x with sorted unique values and their names,
> ## NULL, if x is NULL
> ## - NA-values in x appear at the end
> ## - if there is a 1:1 realtion between values and names in x, value
> ## contains all unique combinations of value and name.
> ## - if identical values in x have different (non NA), names these names
> ## get collapsed to one new name, seperated by the string col.str
> ## This applies also to NA-values in x with different names.
> ## - NA-names get suppressed, if non-NA-names for the same x-value exist.
> ## - Differen values in x with identical names remain seperated.
> ## - values in x without name appear in value with name NA
>
> value.names <- function(x, col.str=' ', round=NULL) {
> ## checking parameters
> if(!is.atomic(x)) stop('x must be an atomic object')
> ## -- define function for pasting unique non empty names
> pasteunique <- function(names.i, col.str)
> { names.i <- sort(unique(names.i))
> names.i <- names.i[!names.i=='' & !is.na(names.i)] # exclude ''
> if (length(names.i))
> names.i <- paste(names.i, sep='', collapse=col.str)
> else names.i <- NA
> invisible(names.i)
> }
> ## branching: if x is.null or has no names
> if (is.null(x)) {
> return(NULL) }
> else {
> x <- sort(x, na.last=TRUE) # sort x
> if (!is.null(round)) x <- round(x, round)
> ## vector of unique values
> values <- unique(x, na.last = TRUE)
> ## names per value
> nam <- NA
> for (i in seq(along=values)) {
> names.i <- names(x)[x==values[i]]
> if (!is.null(names.i)) nam[i] <- pasteunique(names.i, col.str)
> else nam[i] <- NA
> }
> ## names for NA
> if (is.na(values[length(values)]))
> { names.i <- names(x)[is.na(x)]
> nam[length(values)] <- pasteunique(names.i, col.str)
> }
> names(values) <- nam
> return(values)
> }
> }
>
>
> ### function factvn - version 0.3.0 - 11.7.2006
> ### function to build a factor from vector with named elements
> ##
> ## function description:
> ## - if fromvaluesnames is not given factvn calls factor
> ## - if fromvaluesnames is in c('values', 'names') a factor based on
> ## names(x) is constructed
> ##
> ## - arguments:
> ## - x source vector with names for (some) elements
> ## x must be numeric or character.
> ## - fromvaluesnames:
> ## - fromvaluenames='values': levels are ordered according to the values
> ## of x
> ## - fromvaluenames='names': levels are ordered according to the names
> ## of x
> ## - ordered:
> ## - fromvaluesnames is not given: ordered=is.ordered(x)
> ## - fromvaluesnames='values': ordered=TRUE
> ## - fromvaluesnames='names': ordered=FALSE
> ##
> ## - value:
> ## - if fromvaluesnames is not given see factor
> ## - if fromvaluesnames is in c('values', 'names') a factor based on
> ## names(x) is constructed. All x-values without names are NA.
> ## The (final) levels of value are the unique(names(x)).
>
> factvn <- function (x = character(), levels = sort(unique.default(x),
> na.last = TRUE), labels = levels, exclude = NA,
> ordered = is.ordered(x), fromvaluesnames=NULL)
> {
> ## set ordered depending on fromvaluesnames
> if (!missing(fromvaluesnames))
> if (missing(ordered)) {
> if (fromvaluesnames=='values') ord <- TRUE
> if (fromvaluesnames=='names') ord <- FALSE
> } else ord <- ordered
> if (!missing(fromvaluesnames)) {
> if (fromvaluesnames=='values')
> fx <- factor(names(x), levels=unique(names(value.names(x))),
> exclude=exclude, ordered=ord)
> if (fromvaluesnames=='names')
> fx <- factor(names(x), levels=sort(unique(names(value.names(x)))),
> exclude=exclude, ordered=ord)
> } else fx <- factor(x, levels, labels, ordered)
> return(fx)
> }
>
>
> ...snip...
>
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jul 14 02:08:36 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sat 15 Jul 2006 - 06:14:07 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.