Re: [R] bootstrap resampling - simplified

From: Dennis Murphy <djmuser_at_gmail.com>
Date: Tue, 01 Mar 2011 11:13:37 -0800

Hi:

On Tue, Mar 1, 2011 at 8:22 AM, Bodnar Laszlo EB_HU < Laszlo.Bodnar_at_erstebank.hu> wrote:

> Hello there,
>
> I have a problem concerning bootstrapping in R - especially focusing on the
> resampling part of it. I try to sum it up in a simplified way so that I
> would not confuse anybody.
>
> I have a small database consisting of 20 observations (basically numbers
> from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20).
>

To check on the probability of this event happening, I ran the following: bootmat <- matrix(sample(1:20, 200000, replace = TRUE), nrow = 10000) sum(apply(bootmat, 1, function(x) any(table(x) >= 5)) ) [1] 492

It's about 0.05. A Q& D 'solution' would be to oversample by at least 5% (let's do 10% just to be on the safe side) and then pick out the first B of these. In the above example, we could do 11000 samples instead, and pick out the first 10000 that meet the criterion:

bootmat <- matrix(sample(1:20, 220000, replace = TRUE), nrow = 11000) badsamps <- apply(bootmat, 1, function(x) any(tabulate(x) >= 5)) bootfin <- bootmat[-badsamps, ][1:10000, ]

Time:

   user system elapsed
   0.28 0.00 0.28

(Note 1: Using table instead of tabulate took 4.22 seconds on my machine - tabulate is much faster.)
(Note 2: In the call above, there were 539 bad samples, so the 5% ballpark estimate seems plausible.)

This is a simple application of the accept-reject criterion. I don't know how large 'many' is to you, but 10,000 seems to be a reasonable starting point. I ran it again for 1,000,000 such samples, and the completion time was

   user system elapsed
  36.74 0.31 37.15
so the processing time is of an order a bit larger than linear. If your simulations are of this magnitude and are to be run repeatedly, you probably need to write a function to improve the speed and to get rid of the waste produced by a rejection sampling approach. If this is a one-off deal, perhaps the above is sufficient.

HTH,
Dennis

> I would like to resample this database many times for the bootstrap process
> with the following conditions. Firstly, every resampled database should also
> include 20 observations. Secondly, when selecting a number from the
> above-mentioned 20 numbers, you can do this selection with replacement. The
> difficult part comes now: one number can be selected only maximum 5 times.
> In order to make this clear I show you a couple of examples. So the
> resampled databases might be like the following ones:
>
> (1st database) 1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4
> 4 different numbers are chosen (1, 2, 3, 4), each selected - for the
> maximum possible - 5 times.
>
> (2nd database) 1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1
> Two numbers - 8 and 6 - selected 5 times (the maximum possible times),
> number 1 selected 4 times, the others selected less than 4 times.
>
> (3rd database) 1,1,2,2,3,3,4,4,9,9,9,10,10,13,10,9,3,9,2,1
> Number 9 chosen for the maximum possible 5 times, number 10, 3, 2, 1 chosen
> for 3 times, number 4 selected twice and number 13 selected only once.
>
> ...
>
> Anybody knows how to implement my "tricky" condition into one of the R
> functions - that one number can be selected only 5 times at most? Are 'boot'
> and 'bootstrap' packages capable of managing this? I guess they are, I just
> couldn't figure it out yet...
>
> Thanks very much! Best regards,
> Laszlo Bodnar
>
>
>
> ____________________________________________________________________________________________________
> Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy
> jogilag, szakmailag vagy más módon védett információt tartalmazhat.
> Amennyiben nem Ön a levél címzettje akkor a levél tartalmának közlése,
> reprodukálása, másolása, vagy egyéb más úton történő terjesztése,
> felhasználása szigorúan tilos. Amennyiben tévedésből kapta meg ezt az
> üzenetet kérjük azonnal értesítse az üzenet küldőjét. Az Erste Bank Hungary
> Zrt. (EBH) nem vállal felelősséget az információ teljes és pontos -
> címzett(ek)hez történő - eljuttatásáért, valamint semmilyen késésért,
> kapcsolat megszakadásból eredő hibáért, vagy az információ felhasználásából
> vagy annak megbízhatatlanságából eredő kárért.
>
> Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és
> hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az EBH
> folytonos munkamenetének biztosítása érdekében.
>
>
> This e-mail and any attached files are confidential and/...{{dropped:19}}
>
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 01 Mar 2011 - 19:17:34 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 01 Mar 2011 - 19:40:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive