Designing a mini Superscalar Processor

Sub-parts of a Superscalar Processor

The designing of the processor is done by dividing the task into multiple smaller parts. This helps in concentrating on one problem at a time and also helps us keep a track of how the things are going on overall. Here I will be designing a 2-way fetch out of order superscalar microprocessor - IITB-RISC. The ISA for the same can be found here. I have divided the overall designing into the following sub-designs:

Fetch
Decode
Dispatch
Execute
Complete
Retire

Design implemented can be found here.

Instruction Fetch

For fetching 2 instructions per cycle, we need two port read instruction memory. As we always read from 2 consecutive addresses, we need an address register for storing the current address (PC). Also for storing the instructions fetched, we need 2 additional registers (Register1 and Register2).

For improving the efficiency of the processor, we will be using a Branch Predictor which will help from stalling the processor. The arrangement of these components can be visualized as:

I will be discussing the structure of the Branch Predictor in detail later. For simplicity you can assume that the Branch Predictor at present just returns PC+1 as it's output.

Instruction Decode

In the decode stage, we need to decode the instruction code into - (a) instruction, (b) required registers, and (c) to be modified registers. This is done side by side for both the fetched instructions using comparators. Along with this we need to make some corresponding changes in the Architectural Register File (ARF), the Renamed Register File (RRF) and the Re-Order Buffer. Due to too much complexity involved, we further breakdown this step into the following:

1. Decoding

Fetched instructions are first decoded and then their are compared for the data dependencies. This can be viewed with the help of following diagram:

2. The Register Files

The information about the decoded instructions are now passed to the register files to provide back the suitable operand or tag values (as may be the case). The complete working of the architectural register file is described here.

3. The Re-order Buffer

After getting the operand values, corresponding instruction information is then transferred to the Re-order Buffer (ROB) and the reservation stations. Note that till now all the instructions have been processed sequentially. Once they enter the reservation station, instructions can be issued out-of-order. Details about the ROB will be described here.

Instruction Dispatch

Decoded instructions are transferred to reservation stations, from where they are dispatched when they are ready. An instruction is said to be ready when all it's input operands are present. This is indicated by the validity bit. This stage is interfaced with three other stages: (1) Decode, (2) Execute, and (3) Complete. There are two types of reservation stations: (1) Centralized, and (2) Distributed. The design of Centralized Reservation Station (CRS) is described next.
Each register in the RS is of type:

Tag	Occupancy	Data/Reference-Tag
X bits	1 bit	N bit

Here the occupancy bit indicates if that register is occupied or not.The structure of the reservation station (assuming that we have 5 executing units) can be visualized as:

The three units: (1) Allocating Unit, (2) Dispatching Unit, and (3) Update Unit are described in detail here.
Similar to CRS there exist Distributed Reservation Stations (DRS). DRS is what I have used in my design of superscalar processor. The difference in the design is that the registers are distributed and separate 'Dispatching' and 'Update' Units. The allocating unit is now divided into 2 parts: (1) Central and (2) Distributed. These are explained alongside CRS units.

Instruction Execute

This stage consists of various execution units of different types in parallel. This can be viewed as follows:

Each of the execution unit is described in detail here.

Instruction Complete

Every cycle, the ROB sends a maximum of 2 instructions to the Complete stage. These instructions are stored in the Instruction Complete Buffer. They are kept here unit they are retired by the Retired Stage. Breaking down the Instruction finish and the instruction complete serves as a mechanism to provide lazy retirement of the write to memory type of instructions.

Instruction Retire

Instructions are retired after they are completed. As the completed instructions are present in-order, retiring is simple. Every cycle, top two of the instructions present in the completion buffer is written back to the memory or the register file (as may be the case). Writing back to the memory is done in a lazy manner so as to provide maximum memory bandwidth to the load instructions. For this purpose a store queue is maintained which can be referenced by the load execution unit and to be written by the store execution unit.

Search This Blog