From: Torsten Schindler <Torsten.Schindler_at_chello.at>

Date: Tue 09 Aug 2005 - 20:53:50 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Aug 09 20:57:55 2005

Date: Tue 09 Aug 2005 - 20:53:50 EST

You are right, but unfortunately this is not the limiting step or
bottleneck in the code below.

The filter.const() function is only used to get the non-constant
columns in the

training data set, which is initially small (49 rows and 525 columns).
And this function is only applied for filtering the training set and
takes about 2 seconds on my PowerBook.

After filtering the training data set, just the list of column names
is used to filter the huge "prediction.set".
I think, the really time and memory consuming part is the for-loop
below, but I don't know how to improve this part.

Anyway, thanks for the hint!!!

Best,

Torsten

On Aug 9, 2005, at 12:37 PM, Patrick Burns wrote:

*> Building up an object like you do with 'realdata' is very
**> wasteful (S Poetry says why). I think you want something
**> along the lines of:
**>
**> if(vectors[1] == 'column') {
**> realdata <- apply(X, 2, function(x) diff(range(x))) > tol
**> filteredX <- X[, realdata]
**> } else {
**> realdata <- apply(X, 1, function(x) diff(range(x))) > tol
**> filteredX <- X[realdata, ]
**> }
**>
*

> Patrick Burns

*> patrick@burns-stat.com
**> +44 (0)20 8525 0696
**> http://www.burns-stat.com
**> (home of S Poetry and "A Guide for the Unwilling S User")
**>
**> Torsten Schindler wrote:
**>
**>
**>> Hi,
**>>
**>> I'm a R newbie and want to accelerate the following pre-filtering
**>> step of a data set with more than 115,000 rows :
**>>
**>> #-----------------
**>> # Function to filter out constant data columns
**>> filter.const<-function(X, vectors=c('column', 'row'), tol=0){
**>> realdata=c()
**>> filteredX<-matrix()
**>> if( vectors[1] == 'row' ){
**>> for( row in (1:nrow(X)) ){
**>> if( length(which(X[row,]!=median(X[row,])))>tol ){
**>> realdata[length(realdata)+1]=row
**>> }
**>> }
**>> filteredX=X[realdata,]
**>> } else if( vectors[1] == 'column' ){
**>> for( col in (1:ncol(X)) ){
**>> if( length(which(X[,col]!=median(X[,col])))>tol ){
**>> realdata[length(realdata)+1]=col
**>> }
**>> }
**>> filteredX=X[,realdata]
**>> }
**>> return(list(x=filteredX, ix=realdata))
**>> }
**>>
**>> #-----------------
**>> # Filter out all all-constant columns in my training data set
**>> #
**>> # Read training data set with class information in the first column
**>> training <- read.csv('training_data.txt')
**>> dim(training) # => 49 rows and 525 columns
**>>
**>> # Prepare column names by stripping the underline and the number
**>> at the end
**>> colnames(training) <- sub('_\\d+$', '', colnames(training),
**>> perl=TRUE)
**>>
**>> # Filter out the all-constant columns, exclude column 1, the
**>> class column called myclass
**>> training.filter <- filter.const(training[,-1])
**>>
**>> # The filtered data frame is
**>> training.filtered <- cbind(myclass=training[,1], training.filter$x)
**>> dim(training.filtered) # => 49 rows and 250 columns
**>>
**>> # Save the filtered training set for later use in classification
**>> filtered.data <- 'training_set_filtered.Rdata'
**>> save(training.filtered, file=filtered.data)
**>>
**>> #-----------------
**>> # THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook
**>> # AND CONSUMES ABOUT 600 Mb MEMORY.
**>> #
**>> # I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS.
**>>
**>> # Pre-filter the big data set (more than 115,000 rows and 524
**>> columns) for later class predictions.
**>> # The big data set contains the same column names as the training
**>> set, but in a different order.
**>>
**>> input.file <- 'big_data_set.txt'
**>> filtered.file <- 'big_data_set_filtered.txt'
**>>
**>> # Read header with first row
**>> prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1)
**>>
**>> # Prepare column names by stripping the underline and the number
**>> at the end
**>> colnames(prediction.set) <- sub('_\\d+$', '', colnames
**>> (prediction.set), perl=TRUE)
**>> prediction.set.header <- colnames(prediction.set)
**>>
**>> # Get descriptor columns of the training data set without the
**>> Activity_Class column
**>> training.filtered.property.colnames <- colnames(training.filtered)
**>> [-1]
**>>
**>> # Filter out the all-constant columns from the training set
**>> prediction.set.filtered <- prediction.set
**>> [training.filtered.property.colnames]
**>> dim(prediction.set.filtered) # => 1 row and 249 columns
**>>
**>> # Write header and the first filtered row
**>> write.csv(prediction.set.filtered, file=filtered.file,
**>> append=FALSE,
**>> col.names=training.filtered.property.colnames)
**>>
**>> blocksize <- 1000
**>> for (lineid in (0:120)*blocksize) {
**>> cat('lineid: ', lineid, '\n')
**>>
**>> # Read block of data
**>> # We have to add an dummy colname "x" in the col.names, when
**>> the header is not read!
**>> prediction.set <- try(read.csv(input.file, header=FALSE,
**>> col.names=c('x',prediction.set.header),
**>> row.names=1,
**>> skip=lineid+2, nrow=blocksize))
**>> if (class(prediction.set) == "try-error") break
**>>
**>> # Filter out all-constant training set columns from the block
**>> prediction.set.filtered <- prediction.set
**>> [training.filtered.property.colnames]
**>>
**>> # Append the data
**>> # (I know this function is slow, but I couldn't figure out how
**>> to do it faster, so far.)
**>> write.table(prediction.set.filtered, file=filtered.file,
**>> append=TRUE, col.names=FALSE, sep=",")
**>> }
**>>
**>> #-------------
**>> # Now read in the filtered data set and save it for later use in
**>> classification
**>> prediction.set.filtered <- read.csv(filtered.file, header=TRUE,
**>> row.names=1)
**>> filtered.data <- 'prediction_set_filtered.Rdata'
**>> save(prediction.set.filtered, file=filtered.data)
**>>
**>>
**>>
**>> I would be very happy about any hints how to improve the code
**>> above!!!
**>>
**>> Best regards,
**>>
**>> Torsten
**>>
**>> ______________________________________________
**>> R-help@stat.math.ethz.ch mailing list
**>> https://stat.ethz.ch/mailman/listinfo/r-help
**>> PLEASE do read the posting guide! http://www.R-project.org/posting-
**>> guide.html
**>>
**>>
**>>
**>>
**>>
**>
**>
*

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Aug 09 20:57:55 2005

*
This archive was generated by hypermail 2.1.8
: Sun 23 Oct 2005 - 15:11:03 EST
*