Re: [Rd] (PR#9896) read.spss converts string variables with

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Mon, 17 Sep 2007 12:06:01 +0100 (BST)

The problem here is that the values in the data do not have trailing blanks, and the corresponding values in the label table do. That's an issue about the specific SPSS file, not mentioned in this report.

Take a look at the data read with use.value.labels=FALSE:

> read.spss("problem_file.sav", use.value.labels=FALSE)$CNT
  [1] "CZE" "CZE" "CZE" "CZE" "CZE" "CZE" "CZE" "CZE" "CZE" "CZE" attr(,"value.labels")

      United States            Uruguay             Turkey
         "USA     "         "URY     "         "TUR     "
            Tunisia           Thailand     Chinese Taipei
         "TUN     "         "THA     "         "TAP     "
             Sweden          Slovenia     Slovak Republic
         "SWE     "         "SVN     "         "SVK     "
...

There is another example of this in the test suite:

> electric<-read.spss("electric.sav",TRUE,TRUE)
> summary(electric)

...

       WT58 DAYOFWK VITAL10 FAMHXCVR CHD

  Min.   :123.0   MISSING :130   ALIVE:179   NO  :  0   Min.   :0.0
  1st Qu.:156.0   SUNDAY  : 19   DEAD : 61   YES :  0   1st Qu.:0.0
  Median :171.0   TUESDAY : 19               NA's:240   Median :0.5
  Mean   :173.4   WEDNSDAY: 17                          Mean   :0.5
  3rd Qu.:187.0   SATURDAY: 16                          3rd Qu.:1.0
  Max.   :278.0   THURSDAY: 15                          Max.   :1.0
                  (Other) : 24

where the label.table attribute has

Browse[1]> vl[[12]]

        YES         NO
"Y       " "N       "

but the values are "Y" or "N". And it has been that way since at least R 1.6.2.

I think this has to be a case unanticipated by the original author of read.spss, and needs to be covered by a new argument to read.spss, since presumably trimming when matching might not always be required.

On Wed, 5 Sep 2007, Prof Brian Ripley wrote:

> Thank you.
>
> If anyone wants to work on a patch I've put the unencoded files from my
> direct copy at
>
> http://www.stats.ox.ac.uk/pub/bdr/problem_file.sav
> http://www.stats.ox.ac.uk/pub/bdr/problem_file_read.RData
>
> I am afraid I won't have a chance to take a look for at least a couple of
> weeks.
>
>
> On Wed, 5 Sep 2007, honza_at_ifolk.cz wrote:
>
>> ------=_20070905112441_38848
>> Content-Type: text/plain; charset="iso-8859-2"
>> Content-Transfer-Encoding: 8bit
>>
>> I am sending two files attached. The file problem_file.sav was saved in
>> SPSS 10.0. It contains variables of various types, with and without
>> labeling etc., so that you can make experiments.
>>
>> The file problem_file_read.RData was saved in R 2.5.1 (foreign library
>> version 0.8-20). It contains two data frames Schools and Schools2. The
>> former is result of
>> read.spss("problem_file.sav",to.data.frame=TRUE,use.value.labels=TRUE),
>> the latter differs in use.value.lables=FALSE only. As you can see, in the
>> first case read.spss has not read values of string labeled variables at
>> all.
>>
>> I use WinXP.
>>
>> Thank you for your work!
>>
>> Jan Hucin
>>
>> -------------------------------------------------------------------
>> Reference: <20070903104656.1D1566691D@slim.kubism.ku.dk>
>>
>> There is nothing we can do to reproduce this without an example 'some.sav'
>> file exhibiting the problem. Can you please supply one?
>>
>> On Mon, 3 Sep 2007, honza_at_ifolk.cz wrote:
>>
>>> Full_Name: Jan Hucin
>>> Version: 2.5.1 (foreign 0.8-20)
>>> OS: WinXP
>>> Submission from: (NULL) (195.113.83.7)
>>>
>>>
>>> When reading an SPSS file:
>>>
>>> - containing some variable of type String
>>> - with value labels at that variable
>>> - and with determination which values of that variable are considered to
>>> be
>>> missing,
>>>
>>> I have always get <NA> where digits were in the original SPSS file.
>>>
>>> Example:
>>> Let's have in an SPSS file "some.sav" the variable A. The type of the
>> variable
>>> is String of length 1.
>>> Let's have a value labeling: 1 = Yes, 2 = No, 8 = Invalid, 9 = Missing.
>>> Let's determine that value 9 is considered to be missing.
>>> When this file is read by
>> abc=read.spss("some.sav",use.value.labels=TRUE), we
>>> get <NA> in abc$A on places where "1", "2" etc. were. Surprisingly, we
>> get "N/A"
>>> (not <NA>!) on the place where the string "N/A" is.
>>>
>>> If we specify use.value.labels=FALSE, then we get string values (such as
>> "1",
>>> "2") but we lose value labels (Yes, No etc.).
>>>
>>> Let me add that if the variable in the original SPSS file was of type
>> Numeric
>>> (not String), there would be no problem.
>>>
>>> ______________________________________________
>>> R-devel_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 17 Sep 2007 - 11:17:22 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 17 Sep 2007 - 18:40:54 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.