From: Marc Schwartz <MSchwartz_at_mn.rr.com>
Date: Mon 15 May 2006 - 05:43:17 EST

On Sat, 2006-05-13 at 23:53 -0400, jim holtman wrote:
> Using the algorithm in the reference
> www.everything2.com/index.pl?node_id=449702 and the 'bitops' package,
> here is some code that will select the numeric values with a specified
> number of bits. It takes less than 12 seconds to select from 21-bit
> numbers. If I had more memory, I would expect that for 25-bit numbers
> it would take about 200 seconds to compute the indices.
>
> > require(bitops)
>  TRUE
> > x <- 0:(2^21 - 1) # test all number with 21 bits
> > system.time({
> + })
>  10.40 0.87 11.82 NA NA
> > # print the first 10 numbers
> > which(x == 9)[1:10] - 1 # account for sequence starting at zero
>  511 767 895 959 991 1007 1015 1019 1021 1022

Very nice Jim. I was not aware of the bitops package.

BTW, FWIW: x <- 0:(2^25 - 1)

> system.time({

```+  x <- bitAnd(x, mask1) + bitAnd(bitShiftR(x, 1), mask1)
+  })
```

 52.803 9.016 67.885 0.000 0.000

> str(x)

num [1:33554432] 0 1 1 2 1 2 2 3 1 2 ...

> object.size(x)
 268435484

That compares to:

 4.880 0.700 5.706 0.000 0.000

for x <- 0:(2^21 - 1) on my system. So the timing seems less than linear as the size of 'x' increases.

Now, if Nameeta indeed requires the binary representations of the values as character vectors as she noted, she will likely have to be cautious in approaching that process from a RAM and time efficiency perspective.

I tried doing this on my system in one step and quickly ran out of RAM (2 Gb) and swap (1 Gb) as I watched both go to 0% free on gkrellm. So this will likely need to be done stepwise to avoid multiple internal copying of large objects.

Here is one approach.

For example:

# Using x <- 0:(2^25 - 1)
> length(which(x == 9))
 2042975

So, first converting the above values to binary, again using digitsBase() from sfsmisc:

> system.time(z <- digitsBase(which(x == 9) - 1, ndigits = 25))
 22.125 5.189 29.080 0.000 0.000

> str(z)

basedInt [1:25, 1:2042975] 0 0 0 0 0 0 0 0 0 0 ... - attr(*, "class")= chr "basedInt"
- attr(*, "base")= num 2

# z is 400 Mb
> object.size(z)

 408595348

# Now paste() the results into single character vectors
> system.time(mat <- matrix(apply(z, 2, paste, collapse = ""),

ncol = 1))
 349.762 4.196 391.827 0.000 0.000

> str(mat)

```chr [1:2042975, 1] "0000000000000000111111111" ...

```

# mat is 128 Mb
> object.size(mat)

 130750524

[,1]

```[1,] "0000000000000000111111111"
[2,] "0000000000000001011111111"
[3,] "0000000000000001101111111"
[4,] "0000000000000001110111111"
[5,] "0000000000000001111011111"
[6,] "0000000000000001111101111"
[7,] "0000000000000001111110111"
[8,] "0000000000000001111111011"
```

[9,] "0000000000000001111111101"
[10,] "0000000000000001111111110"

Note however, that even after forcing gc() I get:

> gc()

used (Mb) gc trigger (Mb) max used (Mb) Ncells 2271012 60.7 7706096 205.8 9405098 251.2 Vcells 93910018 716.5 193159857 1473.7 170025834 1297.2

Before gc(), gkrellm was showing 12% RAM free, after it shows 38% free.

So even with 2 Gb of RAM, the above is pushing the envelope of practicality I suspect.

HTH, Marc Schwartz

R-help@stat.math.ethz.ch mailing list