Thursday 22 June 2017

Vectorization and broadcasting

At the moment I'm actually ahead of my schedule and this week I started to work on support for vectorization on N-dimensional arrays. The by far biggest challenge was to implement proper broadcasting and most of this post will be devoted to going through that. At the end I also mention some of the other things I have done during the week.

Broadcasting arrays

At the moment I have implement support for broadcasting on all binary functions. Since all binary functions behave similarly in respect to broadcasting I will use $+$ in all my example below, but this could in principle be any binary function working on intervals.

When adding to arrays, $A, B$, of the same size the result is just an arrays of the same size with each entry containing the sum of the corresponding entries in $A$ and $B$. If $A$ and $B$ does not have the same size then we try to perform broadcasting. The simplest form of broadcasting is when $A$ is an arrays and $B$ is a scalar. Then we just take the value of $B$ and add to every element in $A$. For example

> A = infsupdec ([1, 2; 3, 4])
A = 2×2 interval matrix
   [1]_com   [2]_com
   [3]_com   [4]_com
> B = infsupdec (5)
B = [5]_com
> A + B
ans = 2×2 interval matrix
   [6]_com   [7]_com
   [8]_com   [9]_com

However it is not only when one of the inputs is a scalar that broadcasting can be performed. Broadcasting is performed separately for each dimension of the input. We require either that the dimensions are equal, and no broadcasting is performed, or that one of the inputs have that dimension equal to $1$, we then concatenate this input along that dimension until they are of equal size. If for example $A$ has dimension $4\times4\times4$ and $B$ dimension $4\times4\times1$ we concatenate $B$ with itself along the third dimension four times to get two arrays of the same size. Since a scalar has all dimensions equal to 1 we see that it can be broadcasted to any size. Both $A$ and $B$ can also be broadcasted at the same time, along different dimensions, for example

> A = infsupdec (ones (1, 5, 2))
A = 1×5×2 interval array
ans(:,:,1) =
   [1]_com   [1]_com   [1]_com   [1]_com   [1]_com
ans(:,:,2) =
   [1]_com   [1]_com   [1]_com   [1]_com   [1]_com
> B = infsupdec ([1, 2, 3, 4, 5; 6, 7, 8, 9, 10])
B = 2×5 interval matrix
   [1]_com   [2]_com   [3]_com   [4]_com    [5]_com
   [6]_com   [7]_com   [8]_com   [9]_com   [10]_com
> A + B
ans = 2×5×2 interval array
ans(:,:,1) =
   [2]_com   [3]_com   [4]_com    [5]_com    [6]_com
   [7]_com   [8]_com   [9]_com   [10]_com   [11]_com
ans(:,:,2) =
   [2]_com   [3]_com   [4]_com    [5]_com    [6]_com
   [7]_com   [8]_com   [9]_com   [10]_com   [11]_com

The implementation

I'll go through a little bit about my implementation. I warn you that I'm not that familiar with the internals of Octave so some things I say might be wrong, or at least not totally correct.

Internally all, numerical, arrays are stored as a linear vector and the dimensions are only metadata. This means that the most efficient way to walk through an array is with a linearly increasing index. When $A$ and $B$ have the same size the most efficient way to sum them is to linearly go through the arrays. In pseudo code

// Calculate C = A + B
for (int i = 0; i < numel (A); i++) {
  C(i) = A(i) + B(i);
}

This works fine, and apart from unrolling the loop or doing optimizations like that it is probably the most efficient way to do it.

If $A$ and $B$ are not of the same size then one way to do it would be to simply extend $A$ or/and $B$ along the needed dimensions. This would however require coping a lot of data, something we want to avoid (memory access is expensive). Instead we try to be smart with our indexing to access the right data from both $A$ and $B$.

After asking on the IRC-channel I got pointed to this Octave function which performs broadcasting. My implementation, which can be found here, is heavily inspired by that function.

Performance

Here I compare the performance of the new implementation with the old one. Since the old one could only handle matrices we are limited by that. We can measure the time it takes to add two matrices $A$, $B$ with the code

tic; A + B; toc;

We do 10 runs for each test and all times are in seconds.

Addition of large matrices

Case 1: A = B = infsupdec (ones (1000, 1000));
       Old         New
       0.324722    0.277179
       0.320914    0.276116
       0.322018    0.276075
       0.318713    0.279258
       0.332041    0.279593
       0.318429    0.279987
       0.323752    0.279089
       0.317823    0.276036
       0.320509    0.280964
       0.320610    0.281123
Mean:  0.32195     0.27854
Case 2: A = B = infsupdec (ones (10, 100000));
        Old         New
        0.299321    0.272691
        0.297020    0.282591
        0.296460    0.274298
        0.294541    0.279661
        0.298306    0.277274
        0.301532    0.275531
        0.298163    0.278576
        0.298954    0.279868
        0.302849    0.275991
        0.297765    0.278806
Mean:   0.29849    0.27753

Case 3: A = B = infsupdec (ones (100000, 10));
        Old         New
        0.286433    0.279107
        0.289503    0.278251
        0.297562    0.279579
        0.292759    0.283311
        0.292983    0.281306
        0.290947    0.282310
        0.293025    0.286172
        0.294153    0.278886
        0.293457    0.278625
        0.296661    0.280804
Mean:   0.29275     0.28084

Broadcasting scalars

Case 4: A = infsupdec (ones (1000, 1000));
             B = infsupdec (1);
        Old         New
        0.298695    0.292419
        0.298158    0.292274
        0.305242    0.296036
        0.295867    0.291311
        0.296971    0.297255
        0.304297    0.292871
        0.298172    0.300329
        0.297251    0.291668
        0.299236    0.294128
        0.300457    0.298005
Mean;   0.29943     0.29463

Case 5: A = infsupdec (1);
             B = infsupdec (ones (1000, 1000));
         Old         New
        0.317276    0.291100
        0.316858    0.296519
        0.316617    0.292958
        0.316159    0.299662
        0.317939    0.301558
        0.322162    0.295338
        0.321277    0.293561
        0.314640    0.291500
        0.317211    0.295487
        0.317177    0.294376
Mean:   0.31773     0.29521

Broadcasting vectors

Case 6: A = infsupdec (ones (1000, 1000));
             B = infsupdec (ones (1000, 1));
        Old         New
        0.299546    0.284229
        0.301177    0.284458
        0.300725    0.276269
        0.299368    0.276957
        0.303953    0.278034
        0.300894    0.275058
        0.301776    0.276692
        0.302462    0.282946
        0.304010    0.275573
        0.301196    0.273109
Mean:   0.30151     0.27833

Case 7: A = infsupdec (ones (1000, 1000));
             B = infsupdec (ones (1, 1000));
         Old         New
        0.300554    0.295892
        0.301361    0.294287
        0.302575    0.299116
        0.304808    0.294184
        0.306700    0.291606
        0.301233    0.298059
        0.301591    0.292777
        0.302998    0.290288
        0.300452    0.291975
        0.305531    0.290178
Mean:   0.30278     0.29384

We see that in all cases the new version is faster or at least equally fast as the old version. In the old version the order of the input made a slight difference in performance (case 4 vs case 5). In the new version both inputs are treated in exactly the same way so we no longer see that difference.

Possible improvements

In theory the cases when we broadcast a scalar could be the fastest ones. If $B$ is a scalar we could, in pseudo code, do something similar to
// Calculate C = A + B with B scalar
for (int i = 0; i < numel (A); i++) {
  C(i) = A(i) + B;
}

This is however not implemented at the moment. Instead we use the ordinary routine to calculate the index for $B$ (since it is a scalar it will always evaluate to $1$). If we would like to optimize more for this case we could add a check for if $A$ or $B$ are scalars and then optimize for that. Of course this would also make the code more complicated, something to watch out for. At the moment I leave it like this but if we later want to optimize for that case it could be done.

Other work

Apart from the work to fix the broadcasting for binary functions there were very little to do for many of the functions. All binary functions that use this code, and all unary functions using an even simpler code, worked directly after fixing the oct-files. Some of them required small changes to the documentation but other than that the octave-scripts were fine. So mainly it has been a matter of actually going through all files and check that they actually did work.

Bug #51283

When going through all the functions I noticed a bug in the interval version of $\sin$,

 > sin (infsupdec (0))
ans = [0]_com
> sin (infsupdec ([0, 0]))
ans = 1×2 interval vector
   [0, -0]_com   [0, -0]_com

The second version here is wrong, $-0$ should never be allowed as a value for the supremum of an interval. I was able to track this down to how Octaves $\max$ function works, see bug #51283. As Oliver writes there the exact behaviour of the $\max$-function is not specified in IEEE Std 754-2008 so we cannot rely on that. To solve this I have added a line to manually set all $-0$ to $+0$ in the supremum of the interval.

 

No comments:

Post a Comment