Execution Unit: Mini Superscalar Processor Design

In my processor, I have used 5 execution units, namely: 2 ALUs, 1 Branch Handling Unit, 2 Load/Store Units. The overview of the system is shown in the figure below and the details are explained ahead. After the detailed execution units' view, the problem of R7's equivalence to PC is solved.

Arithmetic Logic Unit, The ALU

The instructions which are executed by this unit are:

The result along with the carry and zero flags are then broadcasted when the execution is finished. Before completing the instruction these flags are updated and checked against the old values in deciding whether or not to write back to the register file. This is as single cycle stage and can be visualized as:

The Branch Handling Unit

The instructions executed by this unit are:

The value of PC is either used or modified in these instructions. This unit consists of an adder and an equality check unit. Depending on the instruction, the corresponding action is performed. The PC value calculated here is checked against the instruction after in the ROB. If they match, then the instruction is marked for completion else ROB is flushed after the corresponding branch instruction and the corresponding tags in the RRF are declared void.

Load/Store Unit

The instructions executed by this unit are:

Executing Load/Store instructions in parallel is one of the most challenging task. This is due to limited memory reference bandwidth available. Various methods like 'Load Bypassing' and 'Load Forwarding' are used. During execution, the store instruction does not form the bottleneck. This is because, except for load instructions, no other instructions depend on the store instructions. Whereas, many instructions actually depends on the value fetched by the load for their execution. Therefore, the stores are completed in a lazy manner and loads are given priority for access to the memory.

Lazy store completion and retiring is done by the use of a store buffer along with the reorder buffer. The store buffer consists of two different segments: (1) Finished, and (2) Completed. Finished stores are completed by the help of ROB (Reorder buffer).

Note: The loads and stores are executed in order. One of the unit is dedicated to Stores and the other to Loads.

Load Bypassing and Load Forwarding: These techniques enable us to execute load instructions before the store instructions are retired, i.e., written back to the memory. Load bypassing involves checking the address aliasing of the load with the stores present in the Store Buffer. If there is no address aliasing, we can bypass the load to read directly from the memory without any loss of correctness of execution. Load Forwarding helps us in the situation in which there exists an address aliasing. During the search, in case of address aliasing, the data in the store instruction is directly copied to the corresponding register in the load instruction. This is done in parallel to the data search from the memory. Depending upon the match or no match, the corresponding data is chosen.

The structure of the unit can be visualized as:

R7 equivalence to PC

This equivalence can be broken down into 2 parts;

Maintaining R7 to be equal to PC
Maintaining PC to be equal to R7

For solving the first part we use the fact that the value of PC corresponding to an instruction is forwarded along with instruction during the fetch stage. Therefore, whenever the value of R7 is required, corresponding value of PC is used. Also, the value of R7 is updated in the ARF after the instruction is retired.

For solving the second part, we take the help of ROB. If during the marking of completion of an instruction in the ROB, it modifies R7, it's data to be written back is checked against the PC of the instruction just ahead. If there is a match then there is no extra work, but if they do not match, all the instruction after the R7 modifying instruction are flushed similar to the handling of the completed branch change instructions.

Search This Blog