lcols <- length(cols)

for (i in (n.keeprows+1):nrow(x)){

j <- cols[.Internal(sample(lcols, k, FALSE, NULL))] x[i,j] <- NA

}

return(x)

}

Hi Sergey,

This is not an answer to your exact question, but can you use a
matrix? If you can use a matrix instead of a data frame, you should
get a considerable performance boost. Even for very large matrices
(at least on my system), it is fast enough I find it hard to believe
it is a bottle neck in the overall imputation process. For example,
for a 1000 by 100 object

as a data frame:

*> system.time(r0 <- random.del(mat, 100, 50))
*

user system elapsed

1.09 0.02 1.12

and as a matrix:

*> system.time(r0 <- random.del(mat, 100, 50))
*

user system elapsed

0.02 0.00 0.01

Beyond that, for very large objects, this revision gives a slight (i.e., around 5 seconds for 1 million by 100 column object on my system) performance increase, which is small for matrices and completely dwarfed by other bottlenecks for data frames, at the cost of readability/flexibility:

rdel <- function (x, n.keeprows, del.percent){ n.items <- ncol(x) k <- as.integer(n.items * del.percent / 100)cols <- 1:n.items

lcols <- length(cols)

for (i in (n.keeprows+1):nrow(x)){

j <- cols[.Internal(sample(lcols, k, FALSE, NULL))] x[i,j] <- NA

}

return(x)

}

If you must use a data frame, you can gain some performance increase (for a 10000 by 100 data frame, it takes about 30 seconds on my system versus 40 for your original function) by using:

random.del2 <- function (x, n.keeprows, del.percent){
n.items <- ncol(x)

k <- n.items*(del.percent/100)

for (i in (n.keeprows+1):nrow(x)){

j <- sample(1:n.items, k)

`[<-.data.frame`(x, i, j, NA)

}

return(x)

}

which basically just saves R the trouble of figuring out which assignment method to use. Of course the problem is that your function becomes extremely specialized. If you pass anything to it but a data frame, good things will not happen.

Cheers,

Josh

On Sat, Apr 23, 2011 at 5:37 PM, sneaffer <sneaffer_at_mail.ru> wrote:

> Hello R-world,

*> Please, help me to get round my little mess
**> I have a data.frame in which I'd rather like some values to be NA for the
**> future imputation process.
**>
**> I've come up with the following piece of code:
**>
**> random.del <- function (x, n.keeprows, del.percent){
**> n.items <- ncol(x)
**> k <- n.items*(del.percent/100)
**> x.del <- x
**> for (i in (n.keeprows+1):nrow(x)){
**> j <- sample(1:n.items, k)
**> x.del[i,j] <- NA
**> }
**> return (x.del)
**> }
**>
**> The problems is that random.del turns out to be slow on huge samples.
**> Is there any other more effective/charming way to do the same?
**>
**> Thanks,
**> Sergey
**>
**> --
**> View this message in context: http://r.789695.n4.nabble.com/How-to-erase-replace-certain-elements-in-the-data-frame-tp3470883p3470883.html
**> Sent from the R help mailing list archive at Nabble.com.
**>
**> ______________________________________________
**> R-help_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
**>
*

