Compiler Optimizations
for
Divide and Conquer Applications


  1. Blocked Algorithms and Recursive Algorithms

    Blocked algorithms can be implemented using basically two different approaches: using loop nests or using recursion (i.e., recursive divide and conquer algorithms).

    Modern high-performance libraries are loop-based implementations mostly because of:

    1. Code Legacy: The first high-performance libraries were written in FORTRAN (e.g., recall that recursion is not allowed in FORTRAN).
      • Register allocations and scheduling algorithms: some of the most effective register allocations and instruction scheduling algorithms aim at the optimizations of basic loop nests.
    2. Overhead: a recursive algorithm can be often re-written using loop nests without the overhead due to function calls.
    3. Loop Tiling: loop tiling can be used to implement blocked algorithms as effectively as recursion can.

    Blocked algorithms can be naturally translated into applications using recursive algorithms. This straightforward translation makes recursive algorithms a powerful tools in developing new applications or prototypes. With the presentation in the market of modern processors and compilers, some of the important reasons to opt for a pure loop-nest-based implementation are weakening.

    1. New architectures need new compilers and new libraries. The life-spam of architectures and of compilers, does not justify the investment for hand-tuned high-performance libraries.
    2. In case the environment of choice is based on an interpreter, an optimal register allocation may be a quixotic attempt.
    3. Some compiler are more and more aggressive in the allocation of function parameters into registers, reducing one of the function call overhead.
    4. Modern processors and architectures are based on deeper and deeper memory hierarchy: to exploit the performance potential, tiling must be used several times, tailored and tuned. A recursive implementation will not be always optimal but it will not be always the worst and it will need little maintenance.


  2. Compiler Optimizations for Divide and Conquer Applications

    Recursive implementations of divide and conquer algorithms have space for tremendous improvements, only if the compiler has available a model of the computation.

    For example, the Fast Fourier Transform in the West (FFTW) is an application where multiple algorithms are available and the choice is postponed as long as a plan has been determined. The choice is based on the profiling of the performance for several versions of an algorithm (but for the same input). The plan is a model of the dynamic behavior of the computation.

    A compiler may optimize the solution for the leaves aggressively (because leaves are simple basic blocks) when the leaves have enough code to work on (and some pointer disambiguation are possible).

    More interestingly, a compiler can determine a model of the dynamic behavior and generate run-time support for the execution of the application:

    This project aims at the modeling of the decomposition process and it aims at the minimization of decomposition work.