Branch at https://gitlab.freedesktop.org/idr/mesa/-/commits/constbuf-lolz has a commit that enables some basic addc and subb support for iadd64. This improves some cases that use iadd64 for address arithmetic.

Ice Lake shader-db:

    Ice Lake
    total instructions in shared programs: 19915312 -> 19915820 (<.01%)
    instructions in affected programs: 71169 -> 71677 (0.71%)
    helped: 2
    HURT: 103

    total cycles in shared programs: 855253936 -> 855255634 (<.01%)
    cycles in affected programs: 13087960 -> 13089658 (0.01%)
    helped: 44
    HURT: 55

    total spills in shared programs: 6498 -> 6490 (-0.12%)
    spills in affected programs: 283 -> 275 (-2.83%)
    helped: 3
    HURT: 0

    total fills in shared programs: 8149 -> 8138 (-0.13%)
    fills in affected programs: 858 -> 847 (-1.28%)
    helped: 3
    HURT: 0

Ice Lake fossil-db:

    Instructions in all programs: 141431828 -> 141462306 (+0.0%)
    Instructions helped: 1
    Instructions hurt: 12805

    Cycles in all programs: 9133187485 -> 9132468847 (-0.0%)
    Cycles helped: 7432
    Cycles hurt: 5356

    Spills in all programs: 19583 -> 19581 (-0.0%)
    Spills helped: 1
    Spills hurt: 1

    Fills in all programs: 31464 -> 31462 (-0.0%)
    Fills helped: 1
    Fills hurt: 1

Introduce a six new opcodes: iadd64_split[234]_(hi|lo). The count is the number of 32-bit sources for the instruction. The first sources are the low 32-bits of each addend, and the remaining parameters are the high 32-bits of each addend. The _hi and _lo suffix selects that either the upper or lower 32-bits of the 64-bit result is written to the destination.

The actual instruction definitions are near the bottom.



































The idea is the instructions will appear in matched pairs, and the driver will generate nearly identical code for each. After lower_simd_width(), we might further lower iadd64_split4_hi to:

      add(8)          g101<1>UD       g84<8,8,1>UD    g99<8,8,1>UD
      addc(8)         null<1>UD       g16<8,8,1>UD    g1<8,8,1>UD
      add(8)          g101<1>UD       g101<8,8,1>UD   acc0<8,8,1>UD

On DG2 it might be possible to use add3, but I don't know if that can use the accumulator as a source.

After lower_simd_width(), we might further lower iadd64_split4_lo to:

      addc(8)         g102<1>UD       g16<8,8,1>UD    g1<8,8,1>UD

We'd then cross our fingers that backend optimizations passes would combine the two addc instructions!



































Open question: Is it possible to lower iadd64_split4_hi before lower_simd_width() to generate a SIMD16 or SIMD32 version of the first add instruction?

      add(16)         g100<1>UD       g84<8,8,1>UD    g99<8,8,1>UD
      addc(8)         null<1>UD       g16<8,8,1>UD    g1<8,8,1>UD
      add(8)          g100<1>UD       g100<8,8,1>UD   acc0<8,8,1>UD
      addc(8)         null<1>UD       g16.1<8,8,1>UD  g1.1<8,8,1>UD
      add(8)          g100.1<1>UD     g100.1<8,8,1>UD acc0<8,8,1>UD

(I know those registers are wrong, but I hope that communicates the point.)

Another possibility is to lower iadd64_split4_hi without any SIMD lowering to something like:

      add(16)         g101<1>UD       g84<8,8,1>UD    g99<8,8,1>UD
      add.o.f0(16)    null<1>UD       g16<8,8,1>UD    g1<8,8,1>UD
(+f0) add(16)         g101<1>UD       g101<8,8,1>UD   1UD

This replaces the accumulator with a flag dependency. It's not immediately obvious to me which is better. It would be easy enough to try both.



































The instruction definitions:

# Takes four sources, a_lo, b_lo, a_hi, and b_hi, to produce the high 32-bits
# of a 64-bit addition.
quadop("iadd64_split4_hi", tuint32, _2src_commutative,
       "dst = (int)(src0 + src1 < src1) + src2 + src3;")

# Takes four sources, a_lo, b_lo, a_hi, and b_hi, to produce the low 32-bits
# of a 64-bit addition.
quadop("iadd64_split4_lo", tuint32, _2src_commutative,
       "dst = src0 + src1;")

# Takes three sources, a_lo, b_lo, and a_hi, to produce the high 32-bits of a
# 64-bit addition.
triop("iadd64_split3_hi", tuint32, _2src_commutative,
      "dst = (int)(src0 + src1 < src1) + src2;")

# Takes three sources, a_lo, b_lo, and a_hi, to produce the low 32-bits of a
# 64-bit addition.
triop("iadd64_split3_lo", tuint32, _2src_commutative,
      "dst = src0 + src1;")

# Takes two sources, a_lo and b_lo, to produce a the high 32-bits of a 64-bit
# addition.  This is basically uadd_carry. DO NOT LOWER THIS TO uadd_carry!
# This will break backends that want used these iadd64_split instructions.
binop("iadd64_split2_hi", tuint64, tuint32, _2src_commutative + associative,
      "dst = (int)(src0 + src1 < src1);")

# Takes two sources, a_lo and b_lo, to produce a the low 32-bits of a 64-bit
# addition.
binop("iadd64_split2_lo", tuint64, tuint32, _2src_commutative + associative,
      "dst = src0 + src1;")

Some optimizations to reduce the instructions:

# If either "hi" component is zero, reduce to the next smaller number of
# parameters.
(('iadd64_split4_hi', a, b, 0, c), ('iadd64_split3_hi', a, b, c)),
(('iadd64_split4_lo', a, b, 0, c), ('iadd64_split3_lo', a, b, c)),
(('iadd64_split4_hi', a, b, c, 0), ('iadd64_split3_hi', a, b, c)),
(('iadd64_split4_lo', a, b, c, 0), ('iadd64_split3_lo', a, b, c)),

(('iadd64_split3_hi', a, b, 0),    ('iadd64_split2_hi', a, b)),
(('iadd64_split3_lo', a, b, 0),    ('iadd64_split2_lo', a, b)),

# If either "lo" component is zero, there cannot be any carry into the upper
# 32-bits.  Convert to regular 32-bit addition.
(('iadd64_split4_hi', a, 0, b, c), ('iadd', b, c)),
(('iadd64_split4_lo', a, 0, b, c), c),

(('iadd64_split3_hi', a, 0, b), b),
(('iadd64_split3_lo', a, 0, b), a),

(('iadd64_split2_hi', a, 0), 0),
(('iadd64_split2_lo', a, 0), a),