Some notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx. The same shader ISA is present, with some small differences, with adreno a4xx.
Compared to the previous generation a2xx ISA (ir2), the a3xx ISA is a “simple” scalar instruction set. However, the compiler is responsible, in most cases, to schedule the instructions. The hardware does not try to hide the shader core pipeline stages. For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or nops). When operating on vec4’s, typically the corresponding scalar instructions for operating on the remaining three components could typically fit. Although that results in a lot of edge cases where things fall over, like:
ADD TEMP[0], TEMP[1], TEMP[2]
MUL TEMP[0], TEMP[1], TEMP[0].wzyx
Here, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the add r0.w, r1.w, r2.w and mul r0.x, r1.x, r0.w. Which is why the original/old compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
So the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
For additional documentation about the hardware, see wiki: a3xx ISA.
The ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s). But there are a few extensions, in the form of meta instructions. And additionally, for normal (non-const, etc) src registers, the IR3_REG_SSA flag is set and reg->instr points to the source instruction which produced that value. So, for example, the following TGSI shader:
VERT
DCL IN[0]
DCL IN[1]
DCL OUT[0], POSITION
DCL TEMP[0], LOCAL
1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
2: MOV OUT[0], TEMP[0].xxxx
3: END
eventually generates:
(after scheduling, etc, but before register assignment).
Represents a basic block.
TODO: currently blocks are nested, but I think I need to change that to a more conventional arrangement before implementing proper flow control. Currently the only flow control handles is if/else which gets flattened out and results chosen with sel instructions.
TODO
Certain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers. In the simplest example:
sam (f32)(xyz)r2.x, r0.z, s#0, t#0
for a 2d texture, would read r0.zw to get the coordinate, and write r2.xyz.
Before register assignment, to group the two components of the texture src together:
The frontend sets up the SSA ptrs from sam source register to the fanin meta instruction, which in turn points to the instructions producing the coord.x and coord.y values. And the grouping pass sets up the left and right neighbor pointers to the fanin‘s sources, used later by the register assignment pass to assign blocks of scalar registers.
And likewise, for the consecutive scalar registers for the destination:
Most instructions support a relative src flag, in which case the src value is taken from r<a0.x + n> or c<a0.x + n> (where n is encoded in the instruction. For relative addressing of the const file (for example, a uniform array), it is relatively simple as we don’t have to do register assignment of the const file. A deref instruction points to both the address register src and encodes the constant offset. For example:
Note that nop‘s for scheduling constraints, type specifiers (ie. add.f vs add.u), etc, omitted for brevity in examples
mova a0.x, hr1.y
add r0.x, r1.y, c<a0.x + 2>
results in:
The scheduling pass has some smarts to schedule things such that only a single a0.x value is used at any one time.
To implement variable arrays, values are stored in consecutive scalar registers. This has some overlap with register groups, in that fanin and fanout are used to help group things for the register assignment pass.
To use a variable array as a src register, a slight variation of what is done for const array src. The deref instruction has a second src which is the fanin instruction that groups all the array members:
mova a0.x, hr1.y
add r0.x, r1.y, r<a0.x + 2>
results in:
TODO better describe how actual deref offset is derived, ie. based on array base register.
To do an indirect write to a variable array, a fanout is used. Say the array was assigned to registers r0.z through r1.y (hence the constant offset of 2):
min r0.z, c0.x, c0.y
mova a0.x, hr1.y
add r<a0.x + 2>, r1.y, r2.x
mul r0.x, r0.z, c0.z
In this case, the fanout has a second src which points back to the instruction which last wrote the array element. In this case, the add instruction does not write all elements of the array (compared to usage of fanout for sam instructions in grouping). So now each array element has two potential assigners, which needs to be accounted for in scheduling and register assignment stages. This is handled via that second src, and drawing left/right arrows between them.
Note that there would in fact be fanin nodes generated for each array element (although only the reachable ones will be scheduled, etc).
TODO left/right ptrs need to be set on the 2nd src’s to the fanouts, but how will that work, when the 2nd src might be the fanout from a previous indirect access???
TODO scheduler probably needs to somehow enforce that the fanins first src is scheduled before it’s second src!
After the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling and register assignment. Because inserting mov instructions after scheduling would also require inserting additional nop instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that register assignment after scheduling cannot fail.
Note that we essentially have ~256 scalar registers in the architecture (although larger register usage will at some thresholds limit the number of threads which can run in parallel). And at some point we will have to deal with spilling.
In this stage, simple if/else blocks are flattened into a single block with phi nodes converted into sel instructions. The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
Currently the frontend inserts movs in various cases, because certain categories of instructions have limitations about const regs as sources. And the CP pass simply removes all simple movs (ie. src-type is same as dst-type, no abs/neg flags, etc).
The eventual plan is to invert that, with the front-end inserting no movs and CP legalize things.
In the grouping pass, instructions which need to be grouped (for fanins, etc) have their left / right neighbor pointers setup. In cases where there is a conflict (ie. one instruction cannot have two unique left or right neighbors), an additional mov instruction is inserted. This ensures that there is some possible valid register assignment at the later stages.
In the depth pass, a depth is calculated for each instruction node within it’s basic block. The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of it’s source instructions. (meta instructions don’t add to the depth). As an instruction’s depth is calculated, it is inserted into a per block list sorted by deepest instruction. Unreachable instructions and inputs are marked.
TODO: we should probably calculate both hard and soft depths (?) to try to coax additional instructions to fit in places where we need to use sync bits, such as after a texture fetch.
After the grouping pass, there are no more instructions to insert or remove. Start scheduling each basic block from the deepest node in the depth sorted list created by the depth pass, recursively trying to schedule each instruction after it’s source instructions plus delay slots. Insert nops as required.
TODO