[R] Appending new values to an existing factor vector

From: David Hall (coding) <hacking_at_gringer.org>
Date: Sat, 15 Mar 2008 13:25:03 +1300


I've recently come across a situation where I'm trying to read in [genotype data] files that have around 80,000,000 lines, 4 fields, with a high proportion of repeated strings, here's a sample:

rsXXXXXXX       SAMPLE0001      CG      0.05302
rsXXXXXX        SAMPLE0001      CC      0.06817
rsXXXXXXXX      SAMPLE0001      CC      0.01369
rsXXXXXXY       SAMPLE0001      GG      0.01816
rsXXXXXXZ       SAMPLE0001      GG      0.006711
rsXXXXXXX       SAMPLE0002      GG      0.05813

[For the purpose of the work I'm doing at the moment, I don't care about the last column]

What's the best way to read in these data?

My understanding of what happens when I do read.table on such a file is that it reads the file into a matrix (or perhaps a list) of character strings, then carries out the character conversions [i.e. as.factor(data[[i]])].

infile.df <- read.table(gzfile("large_file.txt.gz"), nrows = 82000000)

Doing this all in one go results in R complaining about not having enough memory to store a data structure of that size [I'm running on Linux, with 1.5GB memory  + 2GB swap], so I need to do it piecewise, but I suspect the memory issues will still be present if I do that.

What I'd like is a way to read in, say, a million lines at a time, do the factor conversion, then append to my existing data frame, which has columns of factors.

However, something I came across while participating in the ICFP 2007 (http://www.icfpcontest.org/) using R was the strange behaviour when adding new/unknown values to a factor vector:

> (a <- factor(c("I","C","I","C","F","I")))
[1] I C I C F I
Levels: C F I
> append(a,"P")

[1] "3" "1" "3" "1" "2" "3" "P"

What would be nice is for unknown levels to be added and encoded as a new value, without having to refactor the whole list, as follows:

> factor(append(as.character(a),"P"))

[1] I C I C F I P
Levels: C F I P

Is there a better way to do this that means I don't need to do the character conversion process?

The need to do this character conversion seems to removes one of the useful features of a factored vector in that it substantially reduces space requirements.

Thanks for your help,
David Hall

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 15 Mar 2008 - 10:13:10 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 15 Mar 2008 - 11:30:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive