The goal of this project is to improve memory hierarchy performance by varying a block size dynamically within a given range of allowed sizes. This allows for better utilization of memory at a given level in the hierarchy, reduces conflicts, and allows a better control of memory traffic. The hierarchy behavior can be optimized for miss rate or traffic.
This work was started in the Fall of 1998 and is supported in part by DARPA via the AMRM project.
The main idea is to vary the block size dynamically and adapt it to application behavior as the program executes. The project is investigating several approaches to accomplish this which are briefly described below. The project has initially concentrated on varying the cache block size.
This work investigates the affect of block-size adaptivity in the L1 and L2 caches.
The block size can be changed by hardware in one of the two possible ways:
1. on block replacement for each block individially or
2. for all blocks during a specified time interval
In either case, the past behavior is used to decide on the next block size. The use
of individual words within a block as well as the presence of an "adjecent" block of
the same size are used to indicate whether to grow or shrink the block by a factor of 2x.
Block size is allowed to vary from 8B to 256B for L1 caches and 64B to 512B for the
L2 caches. Significant performance improvement or traffic reduction can be achieved,
exceeding the performance with "optimal" fixed block size.
In this case the decision to change the block size is made by software, the hardware simply
supports multiple sizes and has an interface allowing the software to change it. We have
investigated this approach to adapting the size using profiling to select the appropriate
size in a variety of ways. We are currently investigating the use of compile-time
analysis instead of profiling to make the block size selection.
The compiler is currently capable of generating code using profiling or user supplied
information to run a program on our prototype hardware (see below).


We have designed a board (above, bare and under test) to implement some of the above ideas.
The board consists of an L1 cache, memory, and a PCI interface.
It can be used with any system with a PCI bus.
The hierarchy control is implemented in an FPGA and can be changed to
support other adaptation algorithms.
The L1 cache has a software contrallable block size (as well as
other paramemters such as write policy, size).
A program can be compiled to run out of this memory hierarchy instead
of the host memory. Current software allows this to be done under
Windows NT.