REGISTER RENAMING

In computer engineering, 'register renaming' refers to a technique used
to avoid unnecessary serialization of program operations imposed by the reuse
of registers by those operations.

Contents
Problem definition
Hazards and renaming
Architectural vs physical registers
Details: tag-indexed register file
Details: reservation stations
Comparison between the schemes
History
References

Problem definition


Programs are composed of instructions which operate on values. The
instructions must name these values in order to distinguish them from one
another. A typical instruction might say, add X and Y and put the result
in Z. In this instruction, X, Y, and Z are the names of storage locations.
In order to have a compact instruction encoding, most processor instruction
sets have a small set of special locations which can be directly named.
For example, the x86 instruction set architecture has 8 integer registers,
x86-64 has 16, many RISCs have 32, and IA-64 has 128. In smaller processors,
the names of these locations correspond directly to elements of a
register file.
Different instructions take different amounts of time. For instance, a
processor may be able to execute hundreds of instructions while a
single load from main memory is in progress. Shorter instructions executed while the load is outstanding will finish first, thus the instructions are finishing out of the original program order. Out of order execution has been used in most recent high-performance CPUs to achieve some of their speed gains.
Consider this piece of code running on an out-of-order CPU:
# Load register 1 from memory location 1024
# Add the number 2 to register 1
# Store register 1 to memory location 1032
# Load register 1 from memory location 2048
# Add the number 4 to register 1
# Store register 1 to memory location 2056
Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but
the processor cannot finish 4 until 3 is done, because 3 would then write
the wrong value.
We can eliminate this restriction by changing the names of some of the
registers:
# Load register 1 from memory location 1024
# Add the number 2 to register 1
# Store register 1 to memory location 1032
# Load register 2 from memory location 2048
# Add the number 4 to register 2
# Store register 2 to memory location 2056
Now instructions 4, 5, and 6 can be executed in parallel with instructions
1, 2, and 3, so that the program can be executed faster.
When possible, the compiler performs this renaming. The compiler is constrained in many ways, primarily by the finite number of register names
in the instruction set. Many high performance CPUs have more physical registers than may be named directly in the instruction set, so rename registers in hardware to achieve additional parallelism.

Hazards and renaming


When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to three kinds of problems, also known as hazards:

★ 'Read-after-write (RAW)'
:A read from a register or memory location must return the value placed there by the last write in program order, not some other write. This is referred to as a 'true dependency' or 'flow dependency', and requires the instructions to execute in program order.

★ 'Write-after-write (WAW)'
:Successive writes to a particular register or memory location must leave that location containing the result of the second write. This can be resolved by 'squashing' (synonyms: cancelling, annulling, mooting) the first write if necessary. WAW dependencies are also known as 'output dependencies'.

★ 'Write-after-read (WAR)'
:A read from a register or memory location must return the last prior value written to that location, and not one written programmatically after the read. This is the sort of 'false dependency' that can be resolved by renaming. WAR dependencies are also known as 'anti-dependencies'.
:Instead of delaying the write until all reads are completed, two copies of the location can be maintained, the old value and the new value. Reads that precede, in program order, the write of the new value can be provided with the old value, even while other reads that follow the write are provided with the new value. The false dependency is broken and additional opportunities for out-of-order execution are created. When all reads needing the old value have been satisfied, it can be discarded. This is the essential concept behind register renaming.
Anything that is read and written can be renamed. While the general-purpose and floating-point registers are discussed the most, flag and status registers or even individual status bits are commonly renamed as well.
Memory locations can also be renamed, although it is not commonly done to the extent practiced in register renaming. The Transmeta Crusoe processor's gated store buffer is a form of memory renaming.
If programs refrained from reusing registers immediately, there
would be no need for register renaming. Some instruction sets (e.g. IA-64)
specify very large numbers of registers for specifically this reason.
There are limitations to this approach:

★ It is very difficult for the compiler to avoid reusing registers without large code size increases. In loops, for instance, successive iterations would have to use different registers, which requires replicating the code (but see register rotation)

★ Large numbers of registers require lots of bits to specify those registers, making the code size increase.

★ Many instruction sets historically specified smaller numbers of registers and cannot be changed now.
Code size increases are important because when the program code is
larger, the instruction cache misses more often and the processor
stalls waiting for new instructions.

Architectural vs physical registers


Machine language programs specify reads and writes to a limited set of registers specified by the instruction set architecture (ISA).
For instance, the DEC Alpha ISA specifies 32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide. These are the 'architectural' registers. Programs written for processors running the Alpha instruction set will specify operations reading and writing those 64 registers. If a programmer stops the program in a debugger, she or he can observe the contents of these 64 registers (and a few status registers) to determine the progress of the machine.
One particular processor which implements this ISA, the Alpha 21264, has 80 integer and 72 floating-point 'physical' registers. There are, on an Alpha 21264 chip, 80 physically separate locations which can store the results of integer operations, and 72 locations which can store the results of floating point operations. (In fact, there are even more locations than that, but those extra locations are not germane to the register renaming operation.)
Below are described two styles of register renaming, distinguished by the circuit which holds data ready for an execution unit.
In all renaming schemes, the machine converts the architectural
registers referenced in the instruction stream into tags. Where the
architectural registers might be specified by 3 to 5 bits, the tags
are usually a 6 to 8 bit number. The rename file must have
a read port for every input of every instruction renamed every cycle,
and a write port for every output of every instruction renamed every
cycle. Because the size of a register file generally grows as the
square of the number of ports, the rename file is usually physically
large and consumes significant power.
In the 'tag-indexed register file' style, there is one large register file for data values, containing one register for every tag. For example, if the machine has 80 physical registers, then it would use 7 bit tags. 48 of the possible tag values in this case are unused.
In this style, when an instruction is issued to an execution unit, the tags of the source registers are sent to the physical register file, where the values corresponding to those tags are read and sent to the execution unit.
In the 'reservation station' style, there are many small associative register files, usually one at the inputs to each execution unit. Each operand of each instruction in an issue queue has a place for a value in one of these register files.
In this style, when an instruction is issued to an execution unit, the register file entries corresponding to the issue queue entry are read and forwarded to the execution unit.







Architectural Register File
or Retirement Register File (RRF)

The committed register state of the machine.
RAM indexed by logical register number.
Typically written into as results are retired
or committed out of a reorder buffer.

Future File

The most speculative register state of the machine.
RAM indexed by logical register number.

Active Register File

The Intel P6 group's term for Future File.

History Buffer

Typically used in combination with a future file.
Contains the "old" values of registers that have
been overwritten. If the producer is still in flight
it may be RAM indexed by history buffer number.
After a branch misprediction must use results
from the history buffer - either they are copied,
or the future file lookup is disabled and the
history buffer is CAM indexed by logical register
number.

Reorder Buffer (ROB)

Pretty much any structure that is sequentially (circularly)
indexed on a per operation basis, for instructions in flight.
Except... differs from a history buffer, in that the
reorder buffer typically comes after the future file (if it exists)
and before the architectural register file.


Reorder buffers come in data-less and data-ful versions.


In Willamette's ROB, the ROB entries point to registers
in the physical register file (PRF), and also contain other
bookkeeping. This was also the first OOO design done by Andy Glew,
at Illinois with HaRRM.


In P6's ROB, the ROB entries contain data; there is no
separate PRF. Data values from the ROB are copied
from the ROB to the RRF at retirement.



Small detail: if there is temporal
locality in ROB entries, i.e. if instructions close together
in the Von Neuman instruction sequence write back
close together in time, it may be possible to perform write
combining on ROB entries and so have fewer ports than
a separate ROB/PRF would. It's not clear if it makes a difference,
since a PRF should be banked.
ROBs usually don't have associative logic,
and certainly none of the ROBs designed by Andy Glew have CAMs.
Keith Dieffendorf insisted that ROBs have
complex associative logic for many years. The first
ROB proposal may have had CAMs.

Details: tag-indexed register file


register_renaming:tag_indexed_scheme.png

This is the renaming style used in the MIPS R10000, the 21264, and in
the FP section of the AMD Athlon.
In the renaming stage, every architectural register referenced (for
read or write) is looked up in an architecturally-indexed 'remap file'.
This file returns a tag and a ready bit. The tag is unready
if there is a queued instruction which will write to it that has not
yet executed. For read operands, this tag takes the place of the
architectural register in the instruction. For every register write,
a new tag is pulled from a 'free tag FIFO',
and a new mapping is written into the remap file, so that
future instructions reading the architectural register will refer to
this new tag. The tag is marked as unready, because the instruction
has not yet executed. The previous physical register allocated for
that architectural register is saved with the instruction in the
'reorder buffer', which is a FIFO that holds the instructions in
program order between the decode and graduation stages.
The instructions are then placed in various 'issue queues'.
As instructions are executed, the tags for their results are
broadcast, and the issue queues match these tags against the tags
of their unready source operands. A match means that the operand
is ready. The remap file also matches these tags, so that it can mark
the corresponding physical registers as ready.
When all the operands to an instruction in an issue queue are ready,
that instruction is ready to issue. The issue queues pick ready
instructions to send to the various functional units each cycle.
Unready instructions stay in the issue queues. This unordered removal
of instructions from the issue queues is one of the things that makes
them large and use lots of power.
Issued instructions read from a tag-indexed physical register file
(bypassing just-broadcast operands), then execute.
Execution results are written to tag-indexed physical register file,
as well as broadcast to the bypass network preceding each functional
unit.
Graduation puts the previous tag for the written architectural
register into the free queue so that it can be reused for a newly
decoded instruction.
An exception or branch misprediction causes the remap file to back up
to the remap state at last valid instruction via combination of state
snapshots and cycling through the previous tags in the in-order
pre-graduation queue. Since this mechanism is required, and since it
can recover any remap state (not just the state before the instruction
currently being graduated), branch mispredictions can be handled
before the branch reaches graduation, potentially hiding the branch
misprediction latency.

Details: reservation stations


register_renaming:reservation_station_scheme.png

This is the style used in the integer section of the AMD K7 and K8
designs.
In the renaming stage, every architectural register referenced for
reads is looked up in both the architecturally-indexed 'future file'
and the rename file.
The future file read gives the value of that register, if there is no
outstanding instruction yet to write to it (i.e. it's "ready").
When the instruction is placed in an issue queue, the values read from
the future file are written into the corresponding entries in the
reservation stations. Register writes in the instruction cause a new,
unready tag to be written into the rename file. The tag number is
usually serially allocated in instruction order -- no free tag FIFO is
necessary.
Just as with the tag-indexed scheme, the issue queues wait for unready
operands to see matching tag broadcasts. Unlike the tag-indexed
scheme, matching tags cause the corresponding broadcast value to be
written into the issue queue entry's reservation station.
Issued instructions read their arguments from the reservation station,
bypass just-broadcast operands, and then execute. As mentioned
earlier, the reservation station register files are usually small,
with perhaps eight entries.
Execution results are written to the reorder buffer, to the reservation
stations (if the issue queue entry has a matching tag), and to the
future file if this is the last instruction to target that
architectural register (in which case register is marked ready).
Graduation copies the value from the reorder buffer into the
architectural register file. The sole use of the architectural
register file is to recover from exceptions and branch mispredictions.
Exceptions and branch mispredictions, recognised at graduation, cause
the architectural file to be copied to the future file, and all
registers marked as ready in the rename file. There is usually no way to
reconstruct the state of the future file for some instruction intermediate
between decode and graduation, so there is usually no way to do early
recovery from branch mispredictions.

Comparison between the schemes


In both schemes, instructions are inserted in-order into the issue
queues, but are removed out-of-order. If the queues do not collapse
"holes", then they will either have many unused entries, or require
some sort of variable priority encoding for when multiple instructions
are simultaneously ready to go. Queues that collapse holes have
simpler priority encoding, but require simple but large circuitry to
advance instructions through the queue.
Reservation stations have better latency from rename to execute,
because the rename stage finds the register values directly, rather
than finding the physical register number, and then using that to find
the value. This latency shows up as a component of the branch
mispredict latency.
Reservation stations also have better latency from instruction issue
to execution, because each local register file is smaller than the
large central file of the tag-indexed scheme. Tag generation and
exception processing are also simpler in the reservation station
scheme, as discussed below.
The physical register files used by reservation stations usually
collapse unused entries in parallel with the issue queue they serve,
which makes these register files larger in aggregate, and burn more power,
and more complicated than the simpler register files used in a tag-indexed
scheme. Worse yet, every entry in each reservation station can be
written by every result bus, so that a reservation-station machine
with e.g. 8 issue queue entries per functional unit will typically
have 9 times as many bypass networks as an equivalent tag-indexed
machine. Result forwarding thus takes much more power and area than
in a tag-indexed design.
Furthermore, the reservation station scheme has four places (Future
File, Reservation Station, Reorder Buffer and Architectural File)
where a result value can be stored, where the tag-indexed scheme has
just one (the physical register file). Because the results from the
functional units, broadcast to all these storage locations, must reach
a much larger number of locations in the machine than in the
tag-indexed scheme, this function consumes more power, area, and time.

History


The original R10000 design had neither collapsing issue queues nor
variable priority encoding, and suffered starvation problems as a
result -- the oldest instruction in the queue would sometimes not be
issued until both instruction decode stopped completely for lack of
rename registers, and every other instruction had been issued. Later
revisions of the design used a partially variable priority encoder to
mitigate this problem.
Early out-of-order machines did not separate the renaming and
ROB/PRF storage functions. For that matter, some of the
earliest, such as Sohi's RUU or the Metaflow DCAF,
combined scheduling, renaming, and storage all in the same
structure.
Most modern machines do renaming by RAM indexing a map table
with the logical register number. E.g. P6 did this; future files do this,
and have data storage in the same structure.
However, earlier machines used CAMs in the renamer.
E.g. the HPSM RAT, or Register Alias Table, essentially
used a CAM on the logical register number in combination
with different versions of the register.
In many ways, the story of out-of-order microarchitecture
has been how these CAMs have been progressively eliminated.
Small CAMs are useful; large CAMs are impractical.
The P6 "microarchitecture" was the first Intel based processors that implemented both out-of-order execution and register renaming. The P6 microarchitecture manifested in Pentium Pro, Pentium II, Pentium III, Pentium M, Core 2 and Core 2 DUO microprocessors.

References



"Implementing Precise Exceptions in Pipelined Processors", J.E. Smith and A.R. Pleszkun, 1985

This article provided by Wikipedia. To edit the contents of this article, click here for original source.

psst.. try this: add to faves