Branch at
https://gitlab.freedesktop.org/idr/mesa/-/commits/constbuf-lolz
has a
commit
that enables some basic addc
and subb
support for iadd64
. This improves some cases that use iadd64
for address arithmetic.
Ice Lake shader-db:
Ice Lake
total instructions in shared programs: 19915312 -> 19915820 (<.01%)
instructions in affected programs: 71169 -> 71677 (0.71%)
helped: 2
HURT: 103
total cycles in shared programs: 855253936 -> 855255634 (<.01%)
cycles in affected programs: 13087960 -> 13089658 (0.01%)
helped: 44
HURT: 55
total spills in shared programs: 6498 -> 6490 (-0.12%)
spills in affected programs: 283 -> 275 (-2.83%)
helped: 3
HURT: 0
total fills in shared programs: 8149 -> 8138 (-0.13%)
fills in affected programs: 858 -> 847 (-1.28%)
helped: 3
HURT: 0
Ice Lake fossil-db:
Instructions in all programs: 141431828 -> 141462306 (+0.0%)
Instructions helped: 1
Instructions hurt: 12805
Cycles in all programs: 9133187485 -> 9132468847 (-0.0%)
Cycles helped: 7432
Cycles hurt: 5356
Spills in all programs: 19583 -> 19581 (-0.0%)
Spills helped: 1
Spills hurt: 1
Fills in all programs: 31464 -> 31462 (-0.0%)
Fills helped: 1
Fills hurt: 1
Introduce a six new opcodes: iadd64_split[234]_(hi|lo)
. The count is the
number of 32-bit sources for the instruction. The first sources are the low
32-bits of each addend, and the remaining parameters are the high 32-bits of
each addend. The _hi
and _lo
suffix selects that either the upper or
lower 32-bits of the 64-bit result is written to the destination.
The actual instruction definitions are near the bottom.
The idea is the instructions will appear in matched pairs, and the driver will
generate nearly identical code for each. After lower_simd_width()
, we might
further lower iadd64_split4_hi
to:
add(8) g101<1>UD g84<8,8,1>UD g99<8,8,1>UD
addc(8) null<1>UD g16<8,8,1>UD g1<8,8,1>UD
add(8) g101<1>UD g101<8,8,1>UD acc0<8,8,1>UD
On DG2 it might be possible to use add3
, but I don't know if that can use
the accumulator as a source.
After lower_simd_width()
, we might further lower iadd64_split4_lo
to:
addc(8) g102<1>UD g16<8,8,1>UD g1<8,8,1>UD
We'd then cross our fingers that backend optimizations passes would combine
the two addc
instructions!
Open question: Is it possible to lower iadd64_split4_hi
before
lower_simd_width()
to generate a SIMD16
or SIMD32
version of the
first add
instruction?
add(16) g100<1>UD g84<8,8,1>UD g99<8,8,1>UD
addc(8) null<1>UD g16<8,8,1>UD g1<8,8,1>UD
add(8) g100<1>UD g100<8,8,1>UD acc0<8,8,1>UD
addc(8) null<1>UD g16.1<8,8,1>UD g1.1<8,8,1>UD
add(8) g100.1<1>UD g100.1<8,8,1>UD acc0<8,8,1>UD
(I know those registers are wrong, but I hope that communicates the point.)
Another possibility is to lower iadd64_split4_hi
without any SIMD lowering to
something like:
add(16) g101<1>UD g84<8,8,1>UD g99<8,8,1>UD
add.o.f0(16) null<1>UD g16<8,8,1>UD g1<8,8,1>UD
(+f0) add(16) g101<1>UD g101<8,8,1>UD 1UD
This replaces the accumulator with a flag dependency. It's not immediately obvious to me which is better. It would be easy enough to try both.
The instruction definitions:
# Takes four sources, a_lo, b_lo, a_hi, and b_hi, to produce the high 32-bits
# of a 64-bit addition.
quadop("iadd64_split4_hi", tuint32, _2src_commutative,
"dst = (int)(src0 + src1 < src1) + src2 + src3;")
# Takes four sources, a_lo, b_lo, a_hi, and b_hi, to produce the low 32-bits
# of a 64-bit addition.
quadop("iadd64_split4_lo", tuint32, _2src_commutative,
"dst = src0 + src1;")
# Takes three sources, a_lo, b_lo, and a_hi, to produce the high 32-bits of a
# 64-bit addition.
triop("iadd64_split3_hi", tuint32, _2src_commutative,
"dst = (int)(src0 + src1 < src1) + src2;")
# Takes three sources, a_lo, b_lo, and a_hi, to produce the low 32-bits of a
# 64-bit addition.
triop("iadd64_split3_lo", tuint32, _2src_commutative,
"dst = src0 + src1;")
# Takes two sources, a_lo and b_lo, to produce a the high 32-bits of a 64-bit
# addition. This is basically uadd_carry. DO NOT LOWER THIS TO uadd_carry!
# This will break backends that want used these iadd64_split instructions.
binop("iadd64_split2_hi", tuint64, tuint32, _2src_commutative + associative,
"dst = (int)(src0 + src1 < src1);")
# Takes two sources, a_lo and b_lo, to produce a the low 32-bits of a 64-bit
# addition.
binop("iadd64_split2_lo", tuint64, tuint32, _2src_commutative + associative,
"dst = src0 + src1;")
Some optimizations to reduce the instructions:
# If either "hi" component is zero, reduce to the next smaller number of
# parameters.
(('iadd64_split4_hi', a, b, 0, c), ('iadd64_split3_hi', a, b, c)),
(('iadd64_split4_lo', a, b, 0, c), ('iadd64_split3_lo', a, b, c)),
(('iadd64_split4_hi', a, b, c, 0), ('iadd64_split3_hi', a, b, c)),
(('iadd64_split4_lo', a, b, c, 0), ('iadd64_split3_lo', a, b, c)),
(('iadd64_split3_hi', a, b, 0), ('iadd64_split2_hi', a, b)),
(('iadd64_split3_lo', a, b, 0), ('iadd64_split2_lo', a, b)),
# If either "lo" component is zero, there cannot be any carry into the upper
# 32-bits. Convert to regular 32-bit addition.
(('iadd64_split4_hi', a, 0, b, c), ('iadd', b, c)),
(('iadd64_split4_lo', a, 0, b, c), c),
(('iadd64_split3_hi', a, 0, b), b),
(('iadd64_split3_lo', a, 0, b), a),
(('iadd64_split2_hi', a, 0), 0),
(('iadd64_split2_lo', a, 0), a),