Re: [Rd] sweep sanity checking?

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Mon, 03 Sep 2007 08:44:40 +0200

I've been asked to close this thread "in public"

>>>>> "TH" == Turner, Heather <Heather.Turner_at_warwick.ac.uk>
>>>>> on Wed, 22 Aug 2007 10:20:21 +0100 writes:

    TH> Petr Savicky kindly brought this thread to my attention
    TH> as I'm afraid it had passed me by. As one of the
    TH> contributors to the earlier discussion on adding
    TH> warnings to sweep I would like to give my support to
    TH> Petr's proposed patch.

Petr's proposed patch has been added to R-devel (to become R-2.6.0 alpha in two days) last week.
Actually it was found subsequently that there was a small problem in the patch [basically because n %% n is zero only for n != 0 and the case n==0 is valid].

If you are interested in using the new sweep() version, and because we are entering alpha testing of R 2.6.0, I'd recommend you to start using "R-devel".

Martin Maechler

    TH> For the record I should say that Petr was right to point
    TH> out that the use of MARGIN in my examples did not make
    TH> sense
    TH> https://stat.ethz.ch/pipermail/r-devel/2007-July/046487.html
    TH> so I have no quibble with that.

    TH> I think it is sensible too, to use the dim attribute of
    TH> STATS as the basis of the test, when the dim attribute
    TH> is present. This provides a way to control the strength
    TH> of the test in the case of sweeping out a vector, as
    TH> Petr describes in his message below. I think that the
    TH> proposed patch successfully brings together the
    TH> different views on what should be tested, which was the
    TH> stumbling block last time around     TH> https://stat.ethz.ch/pipermail/r-help/2005-June/074037.html
    TH> Even if people set check.margin = FALSE for reasons of
    TH> speed, this in itself should be a useful check, since
    TH> they will need to be confident that the test is
    TH> unnecessary.

    TH> Heather

    TH> -----Original Message----- From:
    TH> r-devel-bounces_at_r-project.org
    TH> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of Petr
    TH> Savicky Sent: 08 August 2007 07:54 To:
    TH> r-devel_at_r-project.org Subject: Re: [Rd] sweep sanity
    TH> checking?

    TH> Thanks to Martin Maechler for his comments, advice and
    TH> for pointing out the speed problem. Thanks also to Ben
    TH> Bolker for tests of speed, which confirm that for small
    TH> arrays, a slow down by a factor of about 1.2 - 1.5 may
    TH> occur. Now, I would like to present a new version of
    TH> sweep, which is simpler and has an option to avoid the
    TH> test. This is expected to be used in scripts, where the
    TH> programmer is quite sure that the usage is correct and
    TH> speed is required. The new version differs from the     TH> previous one in the following:
    TH> 1. The option check.margin has a different meaning. It
    TH> defaults to TRUE and it determines whether the test is
    TH> performed or not.

    TH> 2. Since check.margin has the meaning above, it cannot
    TH> be used to select, which test should be performed. This
    TH> depends on the type of STATS. The suggested sweep
    TH> function contains two tests: - a vector test by Heather
    TH> Turner, which is used, if STATS has no dim attribute
    TH> and, hence, is a vector (STATS should not be anything
    TH> else than a vector or an array) - an array test used if
    TH> STATS has dim attribute.  The vector test allows some
    TH> kinds of recycling, while the array test does
    TH> not. Hence, in the most common case, where x is a matrix
    TH> and STATS is a vector, if the user likes to be warned if
    TH> the length of the vector is not exactly the right one,
    TH> the following call is suggested:
    TH> sweep(x,MARGIN,as.array(STATS)). Otherwise, a warning
    TH> will be generated only if length(STATS) does not divide
    TH> the specified dimension of x, which is nrow(x)     TH> (MARGIN=1) or ncol(x) (MARGIN=2).
    TH> 3. If STATS is an array, then the test is more
    TH> restrictive than in the previous version. It is now
    TH> required that after deleting dimensions with one level,
    TH> the remaining dimensions coincide.  The previous version
    TH> allowed additionally the cases, when dim(STATS) is a
    TH> prefix of dim(x)[MARGIN], for example, if dim(STATS) =
    TH> k1 and dim(x)[MARGIN] = c(k1,k2).

    TH> The code of the tests in the suggested sweep is based on
    TH> the previous suggestions
    TH> https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html
    TH> by Robin Hankin
    TH> https://stat.ethz.ch/pipermail/r-help/2005-June/074001.html
    TH> by Heather Turner

    TH> https://stat.ethz.ch/pipermail/r-devel/2007-June/046217.html     TH> by Ben Bolker with some further modifications.

    TH> The modification of sweep.Rd was prepared by Ben Bolker     TH> and me.

    TH> I would like to encourage everybody who likes to express
    TH> his opinion on the patch to do it now. In my opinion,
    TH> the suggestion of the new code stabilized in the sense
    TH> that I will not modify it unless there is a negative
    TH> feedback.

    TH> A patch against R-devel_2007-08-06 is attached. It
    TH> contains tabs. If they are corrupted by email transfer,
    TH> use the link

    TH> http://www.cs.cas.cz/~savicky/R-devel/patch-sweep which     TH> is an identical copy.

    TH> Petr Savicky.

    TH> ========================================================================
    TH> ======================== ---
    TH> R-devel_2007-08-06/src/library/base/R/sweep.R 2007-07-27
    TH> 17:51:13.000000000 +0200 +++
    TH> R-devel_2007-08-06-sweep/src/library/base/R/sweep.R
    TH> 2007-08-07 10:30:12.383672960 +0200 @@ -14,10 +14,29 @@
    TH> # A copy of the GNU General Public License is available     TH> at # http://www.r-project.org/Licenses/  
    TH> -sweep <- function(x, MARGIN, STATS, FUN = "-", ...)
    TH> +sweep <- function(x, MARGIN, STATS, FUN = "-",
    TH> check.margin=TRUE, ...)  { FUN <- match.fun(FUN) dims <-
    TH> dim(x) + if (check.margin) { + dimmargin <- dims[MARGIN]
    TH> + dimstats <- dim(STATS) + lstats <- length(STATS) + if
    TH> (lstats > prod(dimmargin)) { + warning("length of STATS
    TH> greater than the extent of dim(x)[MARGIN]") + } else if
    TH> (is.null(dimstats)) { # STATS is a vector + cumDim <-
    TH> c(1, cumprod(dimmargin)) + upper <- min(cumDim[cumDim >=
    TH> lstats]) + lower <- max(cumDim[cumDim <= lstats]) + if
    TH> (upper %% lstats != 0 || lstats %% lower != 0) +
    TH> warning("STATS does not recycle exactly across MARGIN")
    TH> + } else { + dimmargin <- dimmargin[dimmargin > 1] +
    TH> dimstats <- dimstats[dimstats > 1] + if
    TH> (length(dimstats) != length(dimmargin) || any(dimstats
    TH> != dimmargin)) + warning("length(STATS) or dim(STATS) do
    TH> not match dim(x)[MARGIN]") + } + } perm <- c(MARGIN,
    TH> (1:length(dims))[ - MARGIN]) FUN(x, aperm(array(STATS,
    TH> dims[perm]), order(perm)), ...)  } ---
    TH> R-devel_2007-08-06/src/library/base/man/sweep.Rd
    TH> 2007-07-27 17:51:35.000000000 +0200 +++
    TH> R-devel_2007-08-06-sweep/src/library/base/man/sweep.Rd
    TH> 2007-08-07 10:29:45.517757200 +0200 @@ -11,7 +11,7 @@
    TH> statistic.  } \usage{ -sweep(x, MARGIN, STATS, FUN="-",
    TH> \dots) +sweep(x, MARGIN, STATS, FUN="-",
    TH> check.margin=TRUE, \dots) } \arguments{ \item{x}{an
    TH> array.}  @@ -22,8 +22,18 @@ case of binary operators
    TH> such as \code{"/"} etc., the function name must
    TH> backquoted or quoted. (\code{FUN} is found by a call to
    TH> \code{\link{match.fun}}.)}  +
    TH> \item{check.margin}{logical. If \code{TRUE} (the
    TH> default), warn if the + length or dimensions of
    TH> \code{STATS} do + not match the specified dimensions of
    TH> \code{x}.}  \item{\dots}{optional arguments to
    TH> \code{FUN}.}  } +\details{ + The consistency check among
    TH> \code{STATS}, \code{MARGIN} and \code{x} + is stricter
    TH> if \code{STATS} is an array than if it is a vector.  +
    TH> In the vector case, some kinds of recycling are allowed
    TH> without a + warning. Use
    TH> \code{sweep(x,MARGIN,as.array(STATS))} if \code{STATS} +
    TH> is a vector and you want to be warned if any recycling
    TH> occurs.  +} \value{ An array with the same shape as
    TH> \code{x}, but with the summary statistics swept out.

    TH> ______________________________________________
    TH> R-devel_at_r-project.org mailing list     TH> https://stat.ethz.ch/mailman/listinfo/r-devel
    TH> ______________________________________________
    TH> R-devel_at_r-project.org mailing list     TH> https://stat.ethz.ch/mailman/listinfo/r-devel

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Mon 03 Sep 2007 - 06:52:49 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 03 Sep 2007 - 12:40:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.