17:39fdobridge: <gfxstrand> Right now, NAK should be lowering away 64-bit shifts already. The only 64-bit op we implement right now is iadd
18:34benjaminl: the instr in the IR is OpShf, which has low and high srcs, but nak_from_nir always sets one of them to zero
18:34benjaminl: would it be a good idea for me to replace this with OpShl, and then emit SHF on sm75 only?
18:35benjaminl: right now I'm doing the reverse of this, where the IR still has the full SHF, and then the sm50 encoding asserts that one of the two srcs is zero and emits SHL/SHR
18:39karolherbst: apparently on sm50 we use SHF only for 64 bit shifts
18:40karolherbst: the two sources put together make up a 64 bit value
18:40benjaminl: yeah, SHL/SHR doesn't exist on sm75, right?
18:40benjaminl: I haven't checked this
18:40karolherbst: correct
18:41karolherbst: on sm75+ SHL/SHR are just assembly alias to SHF
18:41karolherbst: but on sm50 it _should_ work the same
18:41benjaminl: nak has a similar situation with IMad64, where we're generating this instruction for imul/imul_high and then discarding one of the outputs
18:42benjaminl: while IMAD64 doesn't exist on sm50, so I currently have the legalization pass replacing it with IMAD.LO.CC and IMAD.HI.X
18:42karolherbst: imad64 doesn't exist anywhere
18:42benjaminl: wait what's encode_sm75 doing then
18:43karolherbst: there is no 64 bit alu on nvidia
18:43karolherbst: only shifts and selects exist in 64 bit
18:43karolherbst: well.. and some fp64 stuff
18:43karolherbst: also
18:44karolherbst: on sm50 you don't want to use IMAD
18:44karolherbst: you want to use XMAD
18:45karolherbst: XMAD is constant latency 16 bit IMAD and usually always faster than using IMAD
18:46benjaminl: ah, that's good to know
18:46karolherbst: you can also always compile some ptx to SASS and use nvdisasm to see what Nvidia generates
18:46fdobridge: <gfxstrand> Oh, fun. More 16-bit multiply shenanigans. We had that on Intel, too.
18:47fdobridge: <gfxstrand> 🙄
18:47fdobridge: <karolherbst🐧🦀> yeah.. the tldr is that IMAD has variable runtime
18:47fdobridge: <karolherbst🐧🦀> and even using 3 XMAD to make a full 32 bit IMAD is fater
18:47fdobridge: <karolherbst🐧🦀> *faster
18:48fdobridge: <karolherbst🐧🦀> luckily, all the subops were already REed: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir.h?ref_type=heads#L284-305
18:48benjaminl: ... weird, are there any situations where you would want IMAD?
18:48karolherbst: I honestly don't know
18:48karolherbst: nvidia never seems to use it
18:49karolherbst: but maybe there are some rough corner cases
18:50fdobridge: <karolherbst🐧🦀> maybe there weren't sure if the hardware is correct?
18:50fdobridge: <gfxstrand> Those really aren't 64-bit, though. Just two sources which combine to a 64-bit thing. The only real 64-bit integer thing is conversations.
18:51fdobridge: <karolherbst🐧🦀> yeah...
18:51fdobridge: <karolherbst🐧🦀> I guess conversions does count as a proper 64 bit alu
18:52fdobridge: <gfxstrand> Could be. Could also be that they needed it for imul_high and just made it do imul too, because why not?
18:52fdobridge: <karolherbst🐧🦀> ohhh.... I just found a cool instruction: I2IP
18:52fdobridge: <karolherbst🐧🦀> heh..
18:52fdobridge: <karolherbst🐧🦀> Turing also got an actual `I2I`
18:53fdobridge: <karolherbst🐧🦀> always saturating though
18:53fdobridge: <karolherbst🐧🦀> and always S32 source
18:53fdobridge: <karolherbst🐧🦀> dest only 8/16
18:53fdobridge: <gfxstrand> Saturating is what you want
18:53fdobridge: <karolherbst🐧🦀> yeah
18:53fdobridge: <karolherbst🐧🦀> just curious why Volta doesn't have it
18:53fdobridge: <gfxstrand> 🤷🏻♀️
18:54fdobridge: <karolherbst🐧🦀> I2IP is like F2FP
18:54fdobridge: <karolherbst🐧🦀> though F2FP is ampere only
18:54fdobridge: <gfxstrand> For most integer down-casts, you just want &
18:54fdobridge: <karolherbst🐧🦀> yeah
18:54fdobridge: <karolherbst🐧🦀> anyway.. no 64 bit alu on nvidia 😄
18:55fdobridge: <gfxstrand> There's no actual conversation. The only interesting integer down-conversion is a saturating one.
18:55fdobridge: <gfxstrand> Are there 64-bit ISetP?
18:56fdobridge: <karolherbst🐧🦀> mhhh.. I thought there is/was
18:57benjaminl: I'm still kinda confused about sm75 here. does it have a real IMAD64?
18:57fdobridge: <gfxstrand> No, it doesn't.
18:57fdobridge: <karolherbst🐧🦀> mhh.. there is also `SGXT`
18:58benjaminl: is the encode_imad64 function in encode_sm75 emitting something else?
18:58fdobridge: <karolherbst🐧🦀> ohh we even have `OP_SGXT` in codegen 🙃 do we even use it
18:58fdobridge: <gfxstrand> It has IMAD.HI which returns the top 32 bits of the result. You have to use those to build a full 64-bit imad
18:58fdobridge: <karolherbst🐧🦀> for EXTBF lowering.. figures
18:59fdobridge: <gfxstrand> Hrm... Oh, right. The IMad64 is the version that takes 32-bit sources and produces a 64-bit result.
18:59benjaminl: yeah
19:00benjaminl: ah, and this isn't "really 64-bit" because it doesn't take 64-bit sources?
19:00fdobridge: <karolherbst🐧🦀> ohh right.. the only 64 bit SET was a fp64 one
19:00fdobridge: <gfxstrand> We really need to make the NIR lowering pass produce those... Need to think about that.
19:00fdobridge: <karolherbst🐧🦀> `DSETP`
19:02fdobridge: <gfxstrand> ISETP does have two destination predicates. We need to figure out what the second one does. I suspect it's a free equality check so you can chain two ISETP to get a 64-bit one.
19:02fdobridge: <karolherbst🐧🦀> `IMNMX` is Turing+ only 🙃
19:02fdobridge: <karolherbst🐧🦀> and they ditched `DMNMX` with Volta+
19:04fdobridge: <karolherbst🐧🦀> @gfxstrand the other predicate is the negation of the first one
19:04fdobridge: <gfxstrand> Unsurprising, given that there's no NaN behavior to worry about with imin/max. We can just ask NIR to lower, I think.
19:04fdobridge: <karolherbst🐧🦀> uhm.. for 32 bit ops
19:04fdobridge: <karolherbst🐧🦀> the second predicate is always true with `.EX`
19:05fdobridge: <gfxstrand> Really? That's dumb. Why would I want a predicate and it's negative? I can just negate the first predicate.
19:05fdobridge: <karolherbst🐧🦀> 🤷
19:05benjaminl: hmm, why is it useful to have a negated predicate output when most places that take a predicate already allow negation?
19:05fdobridge: <gfxstrand> That feels off to me.
19:05fdobridge: <karolherbst🐧🦀> ohhh wait
19:05fdobridge: <karolherbst🐧🦀> it's a _bit_ smarter than that
19:05fdobridge: <karolherbst🐧🦀> the negation happens before the chaining
19:05benjaminl: ahhh
19:06fdobridge: <karolherbst🐧🦀> so you do the chaining with the actual and the negated result of the operands
19:06fdobridge: <karolherbst🐧🦀> I guess it's only useful for `.XOR` and `.OR` then
19:08fdobridge: <gfxstrand> We should get my unit test framework in an upstreamable state and have tests for this stuff. It was really handy when figuring out the exact behavior of IADD3.
19:08fdobridge: <karolherbst🐧🦀> and I guess it does the same with `.EX` 🙃 I just got confused by the doc
19:08fdobridge: <karolherbst🐧🦀> without chaining the second output pred is always true
19:08fdobridge: <gfxstrand> What do you mean by "without chaining"?
19:08fdobridge: <karolherbst🐧🦀> a plain `ISETP`
19:08fdobridge: <karolherbst🐧🦀> not `ISETP.AND` or whatever
19:08fdobridge: <gfxstrand> Without .EX?
19:09fdobridge: <karolherbst🐧🦀> .EX can also chain
19:09fdobridge: <karolherbst🐧🦀> then you can even input two preds
19:11fdobridge: <karolherbst🐧🦀> no idea how `.EX` workds
19:11fdobridge: <karolherbst🐧🦀> *works
19:12fdobridge: <karolherbst🐧🦀> anyway..
19:12fdobridge: <karolherbst🐧🦀> oP0 = (src0 cmp src1) .op inP0
19:12fdobridge: <karolherbst🐧🦀> oP1 = (!(src0 cmp src1)) .op inP0
19:13fdobridge: <karolherbst🐧🦀> maybe with `.EX` you chain both inputs?
19:14fdobridge: <karolherbst🐧🦀> unchained `.EX` has still the second input predicate...
19:14fdobridge: <karolherbst🐧🦀> unchained I think is basically `.AND` with the first input pred set to `PT`
19:20fdobridge: <gfxstrand> Okay. That's something potentially useful. IDK how to use it yet but might be able to. 😅
19:21fdobridge: <karolherbst🐧🦀> you can do it to optimize `OR`s and such away
19:21fdobridge: <karolherbst🐧🦀> uhm.. not sure about the oP1 stuff tho
19:21fdobridge: <karolherbst🐧🦀> maybe if the code does both
19:21fdobridge: <karolherbst🐧🦀> ohh
19:21fdobridge: <karolherbst🐧🦀> I know
19:22fdobridge: <karolherbst🐧🦀> if you have an `or(not(or(a, b)), c)` thing going
19:22fdobridge: <karolherbst🐧🦀> ehh.. a cmp instead of the inner or
19:23fdobridge: <karolherbst🐧🦀> `or(not(ieq(a, b)), c)`
19:23fdobridge: <karolherbst🐧🦀> or something silly like that
19:23fdobridge: <karolherbst🐧🦀> mhhhh
19:23fdobridge: <gfxstrand> Yeah, it's so you can more easily fold stuff in.
19:23fdobridge: <karolherbst🐧🦀> yeah....
19:23fdobridge: <karolherbst🐧🦀> I guess it makes sense if you need both values
19:23fdobridge: <karolherbst🐧🦀> otherwise you could fold the `not` with the compare
19:23fdobridge: <gfxstrand> You can still do it most of the time if you know demorgan's laws but that makes it easier, I guess.
20:40benjaminl: got 'compute.basic.*' down to two CTS failures on nak sm50
20:40benjaminl: both of the failures are image atomics