Re: [Rd] proposed change to 'sample'

From: William Dunlap <wdunlap_at_tibco.com>
Date: Sun, 20 Jun 2010 21:04:13 -0700

> -----Original Message-----
> From: Peter Dalgaard [mailto:pdalgd_at_gmail.com]
> Sent: Sunday, June 20, 2010 2:12 PM
> To: William Dunlap
> Cc: Patrick Burns; r-devel_at_r-project.org
> Subject: Re: [Rd] proposed change to 'sample'
>
> William Dunlap wrote:
> >> -----Original Message-----
> >> From: r-devel-bounces_at_r-project.org
> >> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of Patrick Burns
> ....
> >>
> >> I propose adding an argument that allows
> >> the user (programmer) to avoid that
> >> ambiguity:
> >>
> >> function (x, size, replace = FALSE, prob = NULL,
> >> max = length(x) == 1L && is.numeric(x) && x >= 1)
> >
> > S+'s sample() has an argument 'n' to achieve
> > the same result. It has been there since at
> > least 2005 (S+ 7.0.6). sample(n=n) means to
> > return a sample from seq_along(n), where n must
> > be a scalar nonnegative integer. sample(x=x)
> > retains it old ambiguous meaning.
> > sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)
>
> Hmm, that doesn't really solve the issue does it? I.e., you
> still cannot
> conveniently sample from a vector that is possibly of size 1.
>
> I would be more inclined to make sampling from a vector the
> normal case,
> and default x to say 1:max(n, size), forcing users to say
> sample(n=5) if
> sampling from x=1:5 is desired. This could be a manageable change; the
> deprecation sequence is a bit painful to think through, though.

I think that the breaking of old code was why we allowed the user to use an unambiguous sample(n=n), but didn't change how sample(x=scalar) worked. Internally, we had long discouraged using sample(x=vector) because of the ambiguity problem, preferring x[sample(length(x),...)].

I notice that S+'s rsample() does not allow sampling from a vector, only from seq_len(n). I think that is because it was felt that sampling rows from a data.frame (or the bigdata equivalent, bdframe) was a more common operation and the code was simpler/faster if rsample didn't have to call out to possible subscripting methods. Relaxing the requirement that the output be a randomly permuted sample was a bigger requirement when dealing with long datasets.

In any case, I was just stating that if sample were changed to allow disambiguation of its first argument, using 'n' instead of 'max' would be compatible with S+.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

>
> --
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Phone: (+45)38153501
> Email: pd.mes_at_cbs.dk Priv: PDalgd_at_gmail.com
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Mon 21 Jun 2010 - 04:14:30 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 21 Jun 2010 - 19:31:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive