07:19tzimmermann: sima, airlied, hi. could you please formward drm-next to -rc6?
07:26sima: tzimmermann, I guess backmerge and do you need something specific? or did you mean drm-fixes?
07:27tzimmermann: sima, yeah backmerge. i want to get drm-misc-next and -next-fixes to -rc6
07:32sima: tzimmermann, will look into it later today, going to the gym this morning
07:50airlied: tzimmermann: doing a backmerge at the moment
07:50airlied: needed it for msm
07:51airlied: tzimmermann: pushed out now
07:56tzimmermann: airlied, thanks
09:07jani: hwentlan__: stumbled on parse_edid_displayid_vrr() and parse_amd_vsdb() etc. in amdgpu_dm.c... why is this being added in the driver, it's all supposed to be in drm_edid.c...
09:08jani: and the idea kind of is that drivers don't modify connector->display_info
09:23tzimmermann: PSA: With -rc6 tagged, drm-misc-next-fixes is now open. Features still go into drm-misc-next. Fixes for v6.17 or stable go into drm-misc-fixes. Fixes for v6.18 go into drm-misc-next-fixes. Patches in -fixes branches should be small and have a Fixes tag.
13:58sima: airlied, thx :-)
15:51alyssa: how is b2b32 actually defined?
15:51alyssa: is (('ineg', ('b2i32', a)), ('b2b32', a)) legal?
15:52pendingchaos: IIRC b2b1(b2b32(a@1)) == a
15:52karolherbst: my gut feeling says yes, but who knows
15:52alyssa: pendingchaos: right. that's not strong enough for the rule I'd like
15:53pendingchaos: (I could be wrong, I just remember the opcode being added so that the 32-bit value can be some faster but backend specific value)
15:54alyssa: right..
15:54alyssa: Intel would like an opcode that's actually explicit 0/~0
15:54alyssa: so then we can optmize ineg/b2i32 in NIR
15:55alyssa: Or maybe the crazier rule this shader could benefit from -- i2i32(unpack_32_4x8(ineg(b2i32(x)).<whatever>)) -> b2b32(x)
15:56karolherbst: I wonder if there is hardware that defines those bools differently?
15:56karolherbst: afaik nvidia defines them the same as intel
15:58alyssa: ac/llvm seems to do 0/1 but could maybe be fixed
15:58karolherbst: mhhh
15:58karolherbst: could add configurable opts
15:59jenatali: HLSL defines 32-bit bools as 0/1, not 0/~0
15:59pendingchaos: ACO has always implemented it as 0/1
15:59pendingchaos: problem with just "b2b1(b2b32(a@1)) == a" is that it can break if the backend doesn't match constant folding
15:59alyssa: jenatali: that's not really relevant. we can opt_algebraic chew thru whatever.
15:59alyssa: pendingchaos: yeah, exactly
16:00karolherbst: at least on nvidia: 0/-1 and 0.0/1.0
16:00alyssa: I don't care what encoding we pick but we should really have one canonical encoding
16:00alyssa: in NIR
16:00alyssa: because yeah not matching constant folding is.. bad
16:00alyssa: and "NIR ops that change behaviour by backend" are.. bad
16:00karolherbst: could make it part of the nir options
16:01alyssa: no
16:01alyssa: r600/sfn i cant tell at a quick glance what it does
16:01alyssa: zink does 0/1
16:01alyssa: as does dxil
16:02alyssa: as does nak seemingly
16:02alyssa: well this a mess.
16:02karolherbst: nvidia might have gotten rid of a canonical format in hw
16:02alyssa: half the backends do one thing and half do another
16:02jenatali: What's constant folding do?
16:02alyssa: jenatali: 0/~0
16:02alyssa: i think
16:02jenatali: Ouch
16:03karolherbst: I'm actually curious why nak does 0/1 🙃
16:03karolherbst: though I think the hw used to accept anything
16:04karolherbst: as long as it's 0 and not 0
16:04alyssa: r600/sfn I think does 0/~0 if I'm reading the code right
16:04alyssa: it's zink/dxil, aco/ac-llvm, and nak vs everything else
16:05alyssa: zink/dxil & nak look trivial to change to 0/~0 with no/minimal perf impact
16:05alyssa: amd idk C++ scares me, which is why I work at compilers at Intel -- frick
16:07karolherbst: just port it over to C before touching it, ez
16:08alyssa: dont tempt me with a bad time
16:08karolherbst: what scares me is, that with you I'm not sure you wouldn't suddently do it
16:08alyssa: of note, AGX internally uses 0/1 booleans (it's better for the hw)
16:09alyssa: but I still implemented b2b32 as 0/~0 because I thought I had to :)
16:09alyssa: seems fine \shrug/
16:09karolherbst: so in case it matters, CLC actually defines them as 0/1
16:09alyssa: it really doesn't matter what frontends or backends want, it's easy to massage to whatever
16:09alyssa: we just need to pick something and be consistent
16:11alyssa: pendingchaos: do you have concenrs from an AMD perspective changing to 0/~0?
16:11alyssa: gfxstrand: ^ from a nvidia
16:13pendingchaos: b2b32 is faster for uniform booleans because it can use SCC directly
16:13pendingchaos: but because we always use b2b32 right before a shared memory store (unless that changed at some point), we would insert a copy to convert to VGPR anyway
16:13pendingchaos: for divergent booleans, 0/1 allows a trick with "a + b + b2b32(c)" to use only one instruction, but that's the shared store thing again
16:13pendingchaos: so 0/~0 for b2b32 is probably fine
16:16alyssa: the `a + b + b2b32(c)` trick being.. add-with-carry instruction?
16:17pendingchaos: yes
16:18alyssa: right..
16:18pendingchaos: the carry-in is the same representation as divergent booleans
16:18alyssa: where is the b2b32 coming from in that case? why is it not a b2i32?
16:18alyssa: it's concerning given b2b32 is, currently underdefined it seems
16:21pendingchaos: I'm not sure if that code actually appears, because IIRC b2b32 is only used for shared stores
16:22pendingchaos: the carry-in opt was probably made to optimize "a + b + (c ? 1 : 0)" instead, but both look the same to ACO at this point
16:23alyssa: right.. so that should be fixed to only look for b2i32 instead
16:28alyssa: actually the aco opt is already fine
16:29alyssa: yeah so all of this to me sounds like "define b2b32 as 0/~0, leave b2i32 as 0/1, fix isel in a few backends, move on"
18:32alyssa: Or.. delete b2b32 altogether and just use bcsel(0, ~0) explicitly
18:33alyssa: (and make nir_b2b32 a helper that generates a bcsel)
18:33alyssa: similar to what idr did with i2b32 years ago
18:34alyssa: also probably delete b2b1 and make it a helper doing ine
18:35alyssa: this might require aco's bcsel & ine becoming more clever to avoid regressing codegen for scalar
18:37alyssa: ir3 does a trick with ABSNEG_S which would need replumbed
18:38alyssa: all of zink_nir_algebraic would get deleted which is nice
18:38alyssa: ok. Yeah I think this is worth doing
18:40idr: alyssa: Not to throw a wrench in things...
18:40alyssa: /o\
18:41idr: I have a branch that I've been poking at from time to time that tries to emit 16-bit Booleans to decrease register pressure.
18:41alyssa: Great!
18:41alyssa: ..So?
18:41idr: That ends up producing some b2b32 when a b16 and a b32 would be mixed.
18:42idr: I don't know of that would run afoul of what you're thinking of doing.
18:42idr: The branch hasn't shown a clear win yet, so... *shrug*.
18:43alyssa: idr: My current proposal is simply "remove b2b1 & b2b32 opcodes, systematically convert producers to ine/bcsel, make backend's ine/bcsel smarter to match codegen if needed"
18:43idr: Okay. That sounds reasonable.
18:44idr: I've been thinking we might want type conversion opcodes in brw, but that's a topic for another day.
18:45idr: (Short version: MOV is too flexible. It's a hassle to determine, "Is this just type conversion, or is it doing other regioning nonsense too?")
18:45alyssa: Sure. I don't think that has any bearing on the NIR clean up
19:28alyssa: ..Ok, NIR trivia question..
19:28alyssa: Is ieq valid on 1-bit bools? What about ine?
19:29alyssa: (Can we done xnor in one op?)
19:33alyssa: 27 files changed, 36 insertions(+), 211 deletions(-)
19:33alyssa: Yeah...
19:39alyssa: Am I scared to CTS/shader-db this? Sure am.
19:43pendingchaos: I don't know if it's valid, but it should work with ACO anyway
19:45glehmann: alyssa: iirc both 1bit ieq and ine are valid and used
19:46alyssa: Cool
19:46alyssa: because ir3 doesn't think so (:
19:46glehmann: we should probably document/validate which ops can be used with 1bit vals, but that's annoying work
19:49alyssa: yeah..
19:49alyssa: in the interet of fairnes we should also allow iadd
19:50alyssa: with equivalent behaviour to ine
19:50alyssa: ineg, with equivalent behaviour to mov
19:50alyssa: ...
19:50alyssa: :p
20:19alyssa: Kayden: yeah but those stages suck :p
20:19alyssa: oops
20:20jenatali: Wouldn't ineg be not, instead of mov, if we're treating i1 as 0/~0 (i.e. -1 in 2s complement)?
20:22alyssa: jenatali: -x = (~x) + 1 = ~(~x) = x
20:22alyssa: yes i am trolling
20:22jenatali: Ah yeah ok
20:23alyssa: jenatali: or if you prefer, the only bit is the sign bit
20:23jenatali: Right, -0 == 0, and -1 would be 1 but that wraps back around to -1
20:23alyssa: -(INT32_MIN) = INT32_MIN and all that jazz
20:23alyssa: likewise, -(INT1_MIN) = INT1_MIN
20:24alyssa: isn't modular arithmetic fun
20:24alyssa: `-1 = 1 mod 2`
20:24jenatali: Yep
20:25alyssa: or if you prefer, `2x = 0 mod 2` hence `x = -x mod 2`
20:25jenatali: I'm sorry I said anything :P
20:25alyssa: i am waiting for intel shader-db to compile, i've got lots of math trolling time free :p
20:26alyssa: (building AGX shaderdb on x86 is.. a lot faster than building for intel gpu,.)
20:26alyssa: compiling every fragment shader twice is great
20:29HdkR: More cores for more faster, just get like 48 :D
20:30alyssa: HdkR: I can tell when shaderdb is done based on when the fans stop (:
20:31HdkR: Oh hey, that's how I recognize that binaryninja is done
20:31alyssa: my macbook is fanless \o/
20:34HdkR: I think if I put enough radiator on this Threadripper I /could/ be fanless.
20:34alyssa: Lol
20:35dwfreed: anything can be fanless with a large enough radiator
20:35dwfreed: the problem is "large enough" is often substantially larger than one usually has space for
20:40alyssa: Oh gahhhhh
20:41alyssa: I now see why these opcodes exist.
20:41alyssa:twiddles her blue badge in frustration
20:44alyssa: awful. well, I tried
21:18pac85: Maybe ai could make use of those 1 bit signed integers
21:21Kayden: maybe we can lower those 1 bit numbers to tomatoes and throw them at things
21:21alyssa: pac85: vec32!
21:23pac85: lol
21:28pac85: Mmm would dot product be bit_count(a&b) & 1
21:32karolherbst: pac85: ..... well cuda has support for it
21:32pac85: Ah
21:33karolherbst: behold a 16x8x256 matmul: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-168256
21:34karolherbst: I know for certain it's not a thing on hw 🙃
21:42pac85: Messed up lol
22:10Mis012[m]: AI likes 1.5bit, Nvidia should really get on that