Optimization: Restrict Keyword

The restrict keyword is specific to C and C++ to my knowledge. Its most common use is in using pointers to pass arrays to functions.

I thought I had a good example of its use, but the optimizer for gcc has gotten smarter. I'll show my initial journey of failing to find an example where the restrict keyword yields faster code. Later, I plan to find an example that really shows a difference in speed.

Example: sum of two vectors

Here's an example function in C that saves the sum of two vectors.

   void sum(size_t n, double *a, double *b, double *c)
      for (size_t i = 0; i < n; i++)
         c[i] = a[i] + b[i];

The code says to do the following sequence in order.

   c[0] = a[0] + b[0]
   c[1] = a[1] + b[1]
   c[2] = a[2] + b[2]

Because the compiler only sees pointers to the arrays, it can't safely parallelize this code and use vector instructions. Why? Imagine that c points to memory location a[1]. Then the first line above will change c[0] which will also change a[1]. Hence, computing the first line affects the computation of the second line. Then, computing the second line, affects the third, and so on.

This is called an aliasing problem. To tell the compiler that a[i], b[j], and c[k] cannot equal each other, regardless of the indices, i, j, and k, you use the keyword restrict. Below is the same function in C using the restrict keyword. Note: C++ does not have restrict in the standard, but every compiler has some version of it like __restrict__.

   void rsum(size_t n, double * restrict a, double * restrict b, double * restrict c)
      for (size_t i = 0; i < n; i++)
         c[i] = a[i] + b[i];

The restrict keyword allows the compiler to vectorize the for loop and sum 2 or more values of the array in parallel. HOWEVER, the optimizer in gcc was more clever than I realized. Both sum and rsum are vectorized with the compiler flags -O3 -march=native using gcc 9.2.0.

The function rsum is vectorized easily enough. To vectorize the function sum, gcc inserts a vectorized and non-vectorized version of the code and inserts the equivalent of an if statement to see if the pointers point to memory locations far enough apart to allow vectorization.

I confirmed this in two ways. First, I read the assembly output of both functions using the -S flag. Second, I used the flag -fopt-info-vec-all to get vectorization information. Here is the output of that second flag from sum and rsum.

   sum.c:4:4: optimized: loop vectorized using 32 byte vectors
   sum.c:4:4: optimized:  loop versioned for vectorization because of possible aliasing
   sum.c:2:6: note: vectorized 1 loops in function.

   rsum.c:4:4: optimized: loop vectorized using 32 byte vectors
   rsum.c:2:6: note: vectorized 1 loops in function.

Notice how sum.c has an extra caveat about its vectorization due to "possible aliasing".

I ran both functions 1000 times, with 10 million elements in the arrays. Both sum and rsum took about 15 seconds to run on my laptop with no clear winner in timings. The source code can be found here on github.

To be continued ...