Re: [R] how to control the sampling to make each sample unique

From: HelponR <suncertain_at_gmail.com>
Date: Thu, 10 May 2007 13:47:07 -0700

Yeah, I want to get all unique combinations of choosing ntest from ntotal.

for example, choosing 4000 training data from 10,000 total data.

Suppose they are sequenced as 1:10,000

One obvious combination is 1:4000

Then I run

sample ((1:1000), 4000)

it may output 4000 numbers:

1, 3, 5, .... 7999

Then I run again,

it may output another 4000 numbers:

2, 4, 6, ..., 8000

I know the number of such unique combinations is

Choose 4000 from 10,000

(I forgot how to denote this.)

Anyway, I remember choosing m from n is computed as T = n! /(m!(m-n)!)

! is factorial

My concern is:
when the sample output will start to repeat?

For example, maybe I run next time, the output will be the same as the first time.
1,2, 3, ...., 4000
That's not what I want.

I hope to get T different or unique combinations in T runs. It is fine it may start to repeat after T times.

I know the sample() may already do this way. But I am not sure.

Thank you!

On 5/10/07, Rory Martin <rory.martin_at_comcast.net> wrote:
>
> I think you're asking a design question about a Monte Carlo
> simulation. You
> have a "population" (size 10,000) from which you're defining an empirical
> distribution, and you're sampling from this to create pairs of training
> and
> test samples.
>
> You need to ensure that each specific pair of training and test samples is
> disjoint, meaning no observations in common. Normally, you wouldn't want
> to
> make the different training samples disjoint, if that's what you meant by
> them being "unique". Or were you using it to mean "identical"?
>
> Regards
> Rory Martin
>
>
> > From: HelponR <suncertain_at_gmail.com> Date: Wed, 09 May 2007 17:28:19
> >
> > I have a dataset of 10000 records which I want to use to compare two
> > prediction models.
> >
> > I split the records into test dataset (size = ntest) and training
> dataset
> > (size = ntrain). Then I run the two models.
> >
> > Now I want to shuffle the data and rerun the models. I want many
> shuffles.
> >
> > I know that the following command
> >
> > sample ((1:10000), ntrain)
> >
> > can pick ntrain numbers from 1 to 10000. Then I just use these rows as
> the
> > training dataset.
> >
> > But how can I make sure each run of sample produce different results? I
> > want the data output be unique each time. I tested sample(). and found
> it
> > usually produce different combinations. But can I control it some how?
> Is
> > there a better way to write this?
>
> ______________________________________________
> R-help_at_stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]



R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 10 May 2007 - 20:52:48 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 10 May 2007 - 21:32:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.