HW2: Pipelines, out-of-order execution

Note: Make reasonable assumptions where necessary and clearly state them. Feel free to discuss problems with classmates, but the only written material that you may consult while writing your solutions are the textbook, lecture notes, and lecture slides.

Q1: Data Dependences

Consider a 32-bit in-order pipeline that has the following stages. Note some differences from the examples in class: a stage to do register reads, one stage to do register writes, two stages to access the data memory, and 3 stages for the FP-ALU.

Fetch Decode Regread IntALU Regwrite
IntALU Datamem1 Datamem2 Regwrite
FPALU1 FPALU2 FPALU3 Regwrite

After instruction fetch, the instruction goes through the Decode stage where dependences are analyzed, and a Regread stage where input operands are read from the register file. After this, an instruction takes one of three possible paths.

Int-adds go through the stages labeled "IntALU" and "Regwrite".

Loads/stores go through the stages labeled "IntALU", "Datamem1", "Datamem2", and "Regwrite".

FP-adds go through the stages labeled "FPALU1", "FPALU2", "FPALU3", and "Regwrite".

Assume that the register file has an infinite number of write ports so stalls are never introduced because of structural hazards, also assume that register read and register write take half a cycle same as in our simple 5-stage pipeline that we discussed in class.

How many stall cycles are introduced between the following pairs of successive instructions (i) for a processor with no register bypassing and (ii) for a processor with full bypassing?

  1. Int-add, followed by a dependent Int-add
  2. FP-add, followed by a dependent FP-add
  3. Load, providing the address for a store
  4. Load, providing the data for a store
  5. FP-add, providing the data for a store

Q2: Out-of-order processing

Consider an out-of-order processor similar to the one described in class. The architecture has 32 logical registers (also known as architected registers or program-defined registers and indicated as LR*) and 38 physical registers (indicated as PR*). On power up, the following program starts executing. To simplify the problem, some of the initialization code is not shown and you can ignore that code. The loop in the program is executed for at least three iterations.

Line1: L.D LR1 0(LR2)
       DADD LR3, LR3, LR1
       DADD LR2, LR2, 8
       BNE LR2, LR4, Line1

The processor has a width of 2, i.e., every pipeline stage can move up to 2 instructions through in every cycle. Show the renamed code for the first 12 instructions of this program. In what cycle will the 12th instruction get committed?

Assumptions:

Assume that branch prediction is perfect for a simple program like this. With the help of a trace cache, even fetch is perfect. Assume that caches are perfect as well. Assume that the dependent of a DADD instruction can leave the issue queue in the cycle right after the DADD. Assume that the dependent of an L.D cannot leave in the next cycle, but the cycle after that. Assume a ROB, an issue queue, and an LSQ with 20 entries each. When the thread starts executing, its logical register LR1 is mapped to physical register PR1, LR2 is mapped to PR2, and so on. An instruction goes through 5 pipeline stages before it gets placed in the issue queue and an additional 5 pipeline stages (6 for a LD/ST) after it leaves the issue queue (in other words, an instruction will take a minimum of 11 cycles to go through the pipeline).

Q3: Load-Store Queue

The table below lists a sequence of loads and stores in the LSQ, and the cycles when their one (for loads) or two (for stores) input operands are made available, and their computed effective addresses. Estimate when the address calculation happens for each ld/st and when each ld/st accesses the data memory. Assume that the processor does no memory dependence prediction to speculatively issue loads.

LD/ST The register for the address calculation is made available The register that must be stored into memory is made available The calculated effective address
LD 2 - abcd
ST 7 10 abde
LD 5 - abde
LD 3 - abcd
ST 3 4 abde
LD 1 - abde
LD 4 - abcd

Submit your solution through Gradescope HW1 (as a PDF file (please mark which parts of the PDF are used for each question (this can be done through Gradescope)).

Updated: November, 2019