Projects

Improving single core performance via compiler-assisted out-of-order commit

The growth in uniprocessor (single core) performance resulting from improvements in semiconductor technology has recently slowed down significantly. Sequential applications or sequential portions of parallel applications require further advances to improve their performance. Today's OOO processors complete instructions in their program order, which is a major performance bottleneck because any long-latency instruction, such as access to memory, delays the completion of all subsequent instructions. This project aims to achieve higher single core performance by defining a new, compiler assisted mechanism for out of order instruction completion. It investigates how the use of compile-time program knowledge can be passed to the hardware and be used to simplify the architectural checks required for such out of order completion. The architecture of a standard processor will be fully preserved and legacy software can execute without modification.

Supported by the National Science Foundation

Cache-Aware Synchronization and Scheduling of Data-Parallel Programs for Multi-Core Processors

Multi-core (parallel) processors have become ubiquitous. The use of such systems is key to science, engineering, finance, and other major areas of the economy. However, increased applications performance on such systems can only be achieved with advances in mapping such applications to multi-core machines. This task is made more difficult by the presence of complex memory organizations which is perhaps the key bottleneck to efficient execution, and which has not been addressed effectively. This research involves making the mapping of the program to the machine aware of the complexities of the memory-hierarchy in all phases of the compilation process. This will ensure a good fit between the application code and the actual machine and thereby guarantee much more effective utilization of the hardware (and thus efficient/fast execution) than was previously possible.
Multi-cores can benefit from new cache-hierarchy-aware compilation and runtime system (i.e., including compilation, scheduling, and static/dynamic processor mapping of parallel programs). These tasks have one thing in common: they all need accurate estimates of data element (iteration, task) computation and memory access times which are currently beyond the (cache-oblivious) state-of-the-art. This research thus develops new techniques for iteration space partitioning, scheduling, and synchronization which capture the variability due to cache, memory, and conditional statement behavior and their interaction.

Supported by the National Science Foundation

Acceleration of neural simulations

We are collaborating with scientists who study and model how human brain performs certain tasks, e.g. vision. Computer simulation of such models is extremely compute-bound. We are looking at parallel, application-specific or custom architectures to accelerate such computations. Preliminary experience with FPGA-based, Cell, GPU, and parallel architectures is very encouraging.

Reducing Power Consumption in Processors and Systems

Power dissipation is a major issue in designing new processors and systems. In particular, CMOS technology scaling has significantly increased the leakage power dissipation so that it accounts for an increasingly large share of processor power dissipation. One of the main issue is how to achieve power savings without loss of performance.
Much of our work in this area has focused on cache power dissipation. We addressed issues in L1 I- and D-cache dynamic as well as static power consumption. This included way caching to save static and dynamic power in high-associativity caches (as an alternative to way prediction), cached load-store queue as a low-cost alternative to L0 cache, using branch prediction information to save power in instruction caches. We addressed L2 power consumption, in particular leakage power in L2 peripheral circuits. The results of this research are applicable in both embedded and high-performance processors.
Another aspect of this research is low-power instruction queue design for out-of-order processors. CAM-based instruction queues are not scalable and consume significant amount of power due to wide issue and CAM search on each cycle. One approach we proposed used a banked queue, thus dividing a CAM into smaller banks with faster search. A pointer table indicates which bank an instruction belongs to. A more complex approach disposed of CAM-based queue altogether and used instruction dependence pointers and RAM-based queue for "direct" wakeup. It solved the problem of how to achieve fast branch misprediction recovery when using pointers while using dependent pointers.
We have investigated the problem of power consumption in the register file. Content-aware register file utilized knowledge of instruction operand and effective address width to reduce the number of bits read from the RF and to speed up TLB access using an "L0 TLB". This type of register file was also shown to enable a new type of clustered processor with improved performance and reduced power.
Leakage in peripheral circuits of SRAM-based units is a major contributor to overall power dissipation as well as temperature increases. We have developed a number of circuit techniques using sleep transistors to reduce this leakage as well as architectural techniques to control the application of leakage reduction techniques.
Finally, we studied power consumption in the main memory (DRAM) of embedded systems. For certain types of embedded systems and applications this is as important a component of overall power as the processor itself. We proposed ways to reduce power consumption in the DRAMs by utiizing buffering, delayed writes, and prefetching techniques.

Supported by the National Science Foundation and DARPA

Past Projects

Speeding up Mobile Code Execution on Resource-Constrained Embedded Processors
Supported by the National Science Foundation

Compiler-Controlled Continuous Power-Performance Management
Supported by DARPA

Adaptive Memory Reconfiguration & Management
Supported by DARPA