Re: [Rd] Possible page inefficiency in do_matrix in array.c

From: Simon Urbanek <simon.urbanek_at_r-project.org>
Date: Tue, 04 Sep 2012 17:05:14 -0400

On Sep 4, 2012, at 8:07 AM, "Matthew Dowle" <mdowle_at_mdowle.plus.com> wrote:

>

>> Actually, my apologies, I was assuming that your example was based on the
>> SO question while it is not at all (the code is not involved in that test
>> case). Reversing the order does indeed cause a delay. Switching to a
>> single index doesn't seem to have any impact. R-devel has the faster
>> version now (which now also works with large vectors).
>> 
>> Cheers,
>> Simon

>
> I was intrigued why the compiler doesn't swap the loops when you thought
> it should, though. You're not usually wrong! From GCC's documentation (end
> of last paragraph is the most significant) :
>
> ====
>
> -floop-interchange
> Perform loop interchange transformations on loops. Interchanging two
> nested loops switches the inner and outer loops. For example, given a loop
> like:
> DO J = 1, M
> DO I = 1, N
> A(J, I) = A(J, I) * C
> ENDDO
> ENDDO
>
> loop interchange transforms the loop as if it were written:
>
> DO I = 1, N
> DO J = 1, M
> A(J, I) = A(J, I) * C
> ENDDO
> ENDDO
>
> which can be beneficial when N is larger than the caches, because in
> Fortran, the elements of an array are stored in memory contiguously by
> column, and the original loop iterates over rows, potentially creating at
> each access a cache miss. This optimization applies to all the languages
> supported by GCC and is not limited to Fortran. To use this code
> transformation, GCC has to be configured with --with-ppl and --with-cloog
> to enable the Graphite loop transformation infrastructure.
>
> ====
>
> Could R build scripts be configured to set these gcc flags to turn on
> "Graphite", then? I guess one downside could be the time to compile.
>

The is something odd happening - when I use stand-alone code it works:

$ gcc -o t2 -O3 t2.c

$ time ./t2
0x7fbae3b4f010

real	0m1.045s
user	0m0.784s
sys	0m0.260s

$ gcc -o t2 -floop-interchange -O3 t2.c

$ time ./t2
0x7f4e516f2010

real	0m0.418s
user	0m0.044s
sys	0m0.372s

However, when I split off the loop into a parametrized function it doesn't:

$ gcc -floop-interchange -O3 t.c tt.c -o t && ./t 0x7fdd37cca010
loop time = 1772.085ms
$ gcc -O3 t.c tt.c -o t && ./t
0x7f3aa8777010
loop time = 1763.888ms

For comparison , manually swapping i and j: $ gcc -floop-interchange -O3 t.c tt.c -o t && ./t 0x7feecd4c9010
loop time = 451.744ms

For the same reason, it doesn't work for the old R code. I wonder what's happening there - I guess the optimizer is not smart enough to realize the coverage is the entire m*n span despite the fact that m and n are parameters ... But it's certainly something it should optimize as it did when used directly in a function... Odd ...

Cheers,
Simon

--

PS:  Note this has nothing to do with "R builds scripts" - it is user's responsibility to add optimization flags since they are very much compiler and architecture-dependent. Also optimizations are occasionally known to backfire, so the default is always more conservative.

FWIW this is my latest incarnation of configuring optimized R on E5 machines (minus BLAS and prefix flags which will vary by system):

'--enable-lto' '--enable-R-shlib' 'CFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'CXXFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FCFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 


-- test code (test compiled with gcc (Ubuntu/Linaro 4.7.1-7ubuntu1) 4.7.1 20120814 (prerelease))

t.c:

void foo(int *x, const unsigned int m, const unsigned int n) {
    int i, j;
    for (i = 0; i < m; i++)
	for(j = 0; j < n; j++)
	    x[i + j * m] = 1;
}

tt.c:

#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>

void foo(int *, int, int);

static double ts() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (double)tv.tv_sec + ((double) tv.tv_usec) / 1000000.0;
}

int main() {
    int m = 10000, n = 10000;
    int *x = (int*) malloc(m * n * sizeof(int));
    double a = ts(), b;
    foo(x, m , n);
    b = ts();
    printf("%p\nloop time = %.3fms\n", x, (b - a) * 1000);
    return 0;
}

t2.c:

#include <stdio.h>
#include <stdlib.h>

int *foo() {
    int m = 10000, n = 10000;
    int *x = (int*) malloc(m * n * sizeof(int));
    int i, j;
    for (i = 0; i < m; i++)
	for(j = 0; j < n; j++)
	    x[i + j * m] = 1;
    return x;
}

int main() {
    printf("%p", foo());
    return 0;
}



> Matthew
>
>
>> >> On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote: >> >>> On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote: >>> >>>> >>>> In do_matrix in src/array.c there is a type switch containing : >>>> >>>> case LGLSXP : >>>> for (i = 0; i < nr; i++) >>>> for (j = 0; j < nc; j++) >>>> LOGICAL(ans)[i + j * NR] = NA_LOGICAL; >>>> >>>> That seems page inefficient, iiuc. Think it should be : >>>> >>>> case LGLSXP : >>>> for (j = 0; j < nc; j++) >>>> for (i = 0; i < nr; i++) >>>> LOGICAL(ans)[i + j * NR] = NA_LOGICAL; >>>> >>>> or more simply : >>>> >>>> case LGLSXP : >>>> for (i = 0; i < nc*nr; i++) >>>> LOGICAL(ans)[i] = NA_LOGICAL; >>>> >>>> ( with some fine tuning required since NR is type R_xlen_t whilst i, nc >>>> and nr are type int ). >>>> >>>> Same goes for all the other types in that switch. >>>> >>>> This came up on Stack Overflow here : >>>> http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r >>>> >>> >>> That is completely irrelevant - modern compilers will optimize the loops >>> accordingly and there is no difference in speed. If you don't believe >>> it, run benchmarks ;) >>> >>> original >>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10) >>> Unit: milliseconds >>> expr min lq median uq >>> max >>> 1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196 >>> 1409.901 >>> >>> >>> swapped >>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10) >>> Unit: milliseconds >>> expr min lq median uq >>> max >>> 1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001 >>> 1246.573 >>> >>> Cheers, >>> Simon >>> >>> >>>> Matthew >>>> >>>> ______________________________________________ >>>> R-devel_at_r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>>> >>> >>> ______________________________________________ >>> R-devel_at_r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> >> >>
>
>
>
______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Tue 04 Sep 2012 - 21:08:03 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 04 Sep 2012 - 21:10:42 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive