Compiler Optimizations
for
Divide and Conquer Applications

Blocked Algorithms and Recursive Algorithms

Blocked algorithms can be implemented using basically two different approaches: using loop nests or using recursion (i.e., recursive divide and conquer algorithms).

Modern high-performance libraries are loop-based implementations mostly because of:
1. Code Legacy: The first high-performance libraries were written in FORTRAN (e.g., recall that recursion is not allowed in FORTRAN).
2. Overhead: a recursive algorithm can be often re-written using loop nests without the overhead due to function calls.
3. Loop Tiling: loop tiling can be used to implement blocked algorithms as effectively as recursion can.
Blocked algorithms can be naturally translated into applications using recursive algorithms. This straightforward translation makes recursive algorithms a powerful tools in developing new applications or prototypes. With the presentation in the market of modern processors and compilers, some of the important reasons to opt for a pure loop-nest-based implementation are weakening.
1. New architectures need new compilers and new libraries. The life-spam of architectures and of compilers, does not justify the investment for hand-tuned high-performance libraries.
2. In case the environment of choice is based on an interpreter, an optimal register allocation may be a quixotic attempt.
3. Some compiler are more and more aggressive in the allocation of function parameters into registers, reducing one of the function call overhead.
4. Modern processors and architectures are based on deeper and deeper memory hierarchy: to exploit the performance potential, tiling must be used several times, tailored and tuned. A recursive implementation will not be always optimal but it will not be always the worst and it will need little maintenance.

Compiler Optimizations for Divide and Conquer Applications

Recursive implementations of divide and conquer algorithms have space for tremendous improvements, only if the compiler has available a model of the computation.

For example, the Fast Fourier Transform in the West (FFTW) is an application where multiple algorithms are available and the choice is postponed as long as a plan has been determined. The choice is based on the profiling of the performance for several versions of an algorithm (but for the same input). The plan is a model of the dynamic behavior of the computation.

A compiler may optimize the solution for the leaves aggressively (because leaves are simple basic blocks) when the leaves have enough code to work on (and some pointer disambiguation are possible).

More interestingly, a compiler can determine a model of the dynamic behavior and generate run-time support for the execution of the application:
- If a problem of size n is divided in 8 sub-problems of which one has size n/2, the others 7 have size n/8, and each sub-problem can be executed independently, then the compiler may suggest the allocation of the largest problem to the fastest processor in an heterogeneous network, while the others are allocated to less powerful processors.
- If a problem is too large to obtain useful information using profiling, it may be useful to have a static tool which is able to achieve an estimation of how many time a particular leaf is executed. A hybrid solution, where a partial execution is allowed can offer extremely valuable information as well as fast executions.
This project aims at the modeling of the decomposition process and it aims at the minimization of decomposition work.

Compiler Optimizations for Divide and Conquer Applications

Blocked Algorithms and Recursive Algorithms

Compiler Optimizations for Divide and Conquer Applications

Compiler Optimizations
for
Divide and Conquer Applications