Re: [Rd] proposed change to 'sample'

From: William Dunlap <wdunlap_at_tibco.com>
Date: Sun, 20 Jun 2010 10:49:43 -0700

> -----Original Message-----
> From: r-devel-bounces_at_r-project.org
> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of Patrick Burns
> Sent: Sunday, June 20, 2010 3:08 AM
> To: r-devel_at_r-project.org
> Subject: [Rd] proposed change to 'sample'
>
> There is a weakness in the 'sample'
> function that is highlighted in the
> help file. The 'x' argument can be
> either the vector from which to sample,
> or the maximum value of the sequence
> from which to sample.
>
> This can be ambiguous if the length of
> 'x' is one.
>
> I propose adding an argument that allows
> the user (programmer) to avoid that
> ambiguity:
>
> function (x, size, replace = FALSE, prob = NULL,
> max = length(x) == 1L && is.numeric(x) && x >= 1)

S+'s sample() has an argument 'n' to achieve the same result. It has been there since at least 2005 (S+ 7.0.6). sample(n=n) means to return a sample from seq_along(n), where n must be a scalar nonnegative integer. sample(x=x) retains it old ambiguous meaning.
  sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)

S+ also has an rsample function where n (with the same meaning) is the only way to specify the population. It also has an order=TRUE/FALSE argument where order=TRUE means to randomly order the output. order=FALSE means that the ordering of the output is unspecified, but it allows the person writing rsample methods to use the quickest way to get a random sample (for big data it can be fastest to return the sample from 1:n in increasing order).
  rsample(n, size = n, replace = F, prob = NULL,

        bigdata = F, minimal = NULL, ..., order = T) I like the idea of separating the concepts of sampling and permuting data. Many statistics are invariant to ordering of the data and it can be a waste of time to randomly order a sample to feed to such functions.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> {
> if (max) {
> if (missing(size))
> size <- x
> .Internal(sample(x, size, replace, prob))
> }
> else {
> if (missing(size))
> size <- length(x)
> x[.Internal(sample(length(x), size, replace, prob))]
> }
> }
> <environment: namespace:base>
>
>
> This just takes the condition of the first
> 'if' to be the default value of the new 'max'
> argument.
>
> So in the "surprise" section of the examples
> in the 'sample' help file
>
> sample(x[x > 9])
>
> and
>
> sample(x[x > 9], max=FALSE)
>
> have different behaviours.
>
> By the way, I'm certainly not convinced that
> 'max' is the best name for the argument.
>
> --
> Patrick Burns
> pburns_at_pburns.seanet.com
> http://www.burns-stat.com
> (home of 'Some hints for the R beginner'
> and 'The R Inferno')
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun 20 Jun 2010 - 18:04:20 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 20 Jun 2010 - 22:51:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive