[Rd] read.spss issues

From: Jeroen Ooms <jeroen.ooms_at_stat.ucla.edu>
Date: Tue, 14 Feb 2012 22:05:29 -0800


Someone supplied me with a small SPSS datafile that caused a buffer overflow and then a crash when reading it in R. It seems like a pretty serious issue to me. Unfortunately I can't supply the dataset at hand and I have a hard time reproducing it with a toy example. But I found at least 2 issues that might be related.

The first one is that when the spss dataset has a 'string' variable that is longer than 200 characters, it generates a bunch of warnings and then additional variables in the dataset. E.g:

library(foreign)
x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/longstring.sav"); str(x);

The second problem is that the spss dataformat allows to specify 'duplicate labels', whereas this is not allowed for factors. read.spss does not deal with this and creates a bad factor

x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/duplicate_labels.sav", use.value.labels=T);
levels(x$opinion);

which causes issues downstream. I am not sure if this is an issue in read.spss() or as.factor(), but I guess it might be wise to try to detect duplicate levels and assign them all with one and the same integer value when converting to a factor.

Thank you,

Jeroen



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 15 Feb 2012 - 06:11:00 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 15 Feb 2012 - 22:10:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive