Blocked Algorithms and Recursive Algorithms
Blocked algorithms can be implemented using basically two
different approaches: using loop nests or using recursion (i.e.,
recursive divide and conquer algorithms).
Modern high-performance libraries are loop-based
implementations mostly because of:
- Code Legacy: The first high-performance libraries were
written in FORTRAN (e.g., recall that recursion is not
allowed in FORTRAN).
- Register allocations and scheduling algorithms: some
of the most effective register allocations and
instruction scheduling algorithms aim at the
optimizations of basic loop nests.
- Overhead: a recursive algorithm can be often re-written
using loop nests without the overhead due to function
calls.
- Loop Tiling: loop tiling can be used to implement blocked
algorithms as effectively as recursion can.
Blocked algorithms can be naturally translated into
applications using recursive algorithms. This straightforward
translation makes recursive algorithms a powerful tools in
developing new applications or prototypes. With the
presentation in the market of modern processors and compilers,
some of the important reasons to opt for a pure
loop-nest-based implementation are weakening.
- New architectures need new compilers and new
libraries. The life-spam of architectures and of
compilers, does not justify the investment for
hand-tuned high-performance libraries.
- In case the environment of choice is based on an
interpreter, an optimal register allocation may be a
quixotic attempt.
- Some compiler are more and more aggressive in the
allocation of function parameters into registers,
reducing one of the function call overhead.
- Modern processors and architectures are based on deeper
and deeper memory hierarchy: to exploit the performance
potential, tiling must be used several times, tailored
and tuned. A recursive implementation will not be always
optimal but it will not be always the worst and it will
need little maintenance.
Compiler Optimizations for Divide and Conquer Applications
Recursive implementations of divide and conquer algorithms have
space for tremendous improvements, only if the compiler has
available a model of the computation.
For example, the Fast Fourier Transform in the West (FFTW) is an
application where multiple algorithms are available and the choice
is postponed as long as a plan has been determined. The
choice is based on the profiling of the performance for several
versions of an algorithm (but for the same input). The plan is a
model of the dynamic behavior of the computation.
A compiler may optimize the solution for the leaves aggressively
(because leaves are simple basic blocks) when the leaves have enough
code to work on (and some pointer disambiguation are possible).
More interestingly, a compiler can determine a model of the
dynamic behavior and generate run-time support for the execution of
the application:
- If a problem of size n is divided in 8 sub-problems
of which one has size n/2, the others 7 have size
n/8, and each sub-problem can be executed independently,
then the compiler may suggest the allocation of the largest
problem to the fastest processor in an heterogeneous network,
while the others are allocated to less powerful processors.
- If a problem is too large to obtain useful information using
profiling, it may be useful to have a static tool which is able
to achieve an estimation of how many time a particular leaf is
executed. A hybrid solution, where a partial execution is allowed
can offer extremely valuable information as well as fast
executions.
This project aims at the modeling of the decomposition process and
it aims at the minimization of decomposition work.