[R] Efficient passing through big data.frame and modifying select fields

From: Johannes Graumann <johannes_graumann_at_web.de>
Date: Tue, 25 Nov 2008 15:16:01 +0100


Hi all,

I have relatively big data frames (> 10000 rows by 80 columns) that need to be exposed to "merge". Works marvelously well in general, but some fields of the data frames actually contain multiple ";"-separated values encoded as a character string without defined order, which makes the fields not match each other.

Example:

> frame1[1,1]

[1] "some;thing"
>frame2[2,1]

[2] "thing;some"

In order to enable merging/duplicate identification of columns containing these strings, I wrote the following function, which passes through the rows one by one, identifies ";"-containing cells, splits and resorts them.

ResortCombinedFields <- function(dframe){   if(!is.data.frame(dframe)){
    stop("\"ResortCombinedFields\" input needs to be a data frame.")   }
  for(row in seq(nrow(dframe))){
    for(mef in grep(";",dframe[row,])){
      dframe[row,mef] <- paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";")     }
  }
  return(dframe)
}

works fine, but is horribly inefficient. How might this be tackled more elegantly?

Thanks for any input, Joh



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 25 Nov 2008 - 14:24:30 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 25 Nov 2008 - 14:30:28 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive