00:00gfxstrand[d]: Can someone at least tell me how to repro? Prime or no? Nouveau or Zink? Wayland or X11? Compositor?
00:01redsheep[d]: Oh wow I went to go test on 6.9.3 and it's completely busted for me on plasma, I can't even get into sddm
00:02gfxstrand[d]: 😫
00:02gfxstrand[d]: It's got to be related to explicit sync somehow
00:03redsheep[d]: I'm not sure if that's the same as what you were all talking about reproducing, but that's plasma 6.0.5 and mesa main from a week ago
00:04redsheep[d]: Oh and zink l
00:04airlied[d]: gfxstrand[d]: f40 just running against gnome-shell does it here
00:04gfxstrand[d]: Kk
00:04gfxstrand[d]: That's easy
00:04gfxstrand[d]: airlied[d]: Prime or no?
00:04airlied[d]: no prime
00:04gfxstrand[d]: Kk
00:04gfxstrand[d]: I'll test on the desktop then
00:04airlied[d]: we try to use the kernel API due to getting a dedicated allocation with modifiers
00:05airlied[d]: VK_IMAGE_TILING_DRM_FORMAT_MODIFIER_EXT
00:05gfxstrand[d]: Yeah but why? Why is the WSI code doing that to us?
00:06esdrastarsis[d]: gfxstrand[d]: My setup: stand-alone, wayland, sway, just nvk, kernel 6.9.x and nvk from mesa 24.1
00:07esdrastarsis[d]: I didn't use zink
00:07gfxstrand[d]: Kk
00:08gfxstrand[d]: I'm out buying groceries at the moment. I'll poke about when I get home.
00:08airlied[d]: https://paste.centos.org/view/33bff343
00:08airlied[d]: that seems to fix it
00:25gfxstrand[d]: airlied[d]: Ugh... I thought I did that. 🤦🏻♀️ RB. Add a `Fixes:` and some `Closes:` tags and merge it. Or I can after a bit
00:31airlied[d]: thrown to marge
00:43gfxstrand[d]: Thanks
00:46gfxstrand[d]: The worst part is that I was sure I'd double-checked like 3 times that we did that. 🤦🏻♀️
00:48esdrastarsis[d]: And the fix missed the 24.1.1 release 💀
00:49redsheep[d]: Shame I was caught up in testing the special gfxstrand nvk linux branch the whole time, probably could have gotten more eyes on this earlier
01:02gfxstrand[d]: 🤷🏻♀️
01:02gfxstrand[d]: We'll get it in the next one
01:19gfxstrand[d]: I tweaked it a bit to line wrap and add another closes tag
02:45gfxstrand[d]: And... Merged. 🥳
02:47tiredchiku[d]: I can keep an eye out for issues that seem like mistakes from our end, if you'd like that
02:48tiredchiku[d]: and are not perf/missing-feature related
14:03zmike[d]: DB next I assume
15:08gfxstrand[d]: Dependencies! It's always instruction dependencies...
15:16karolherbst[d]: it is
16:01gfxstrand[d]: I don't know why we use `bmov.clear`...
16:04gfxstrand[d]: Uh.... The opclass for `bmov` is `bmov_dst64`...
16:04gfxstrand[d]: Does that mean a barrier is actually 64 bits and I've been doing it wrong this whole time?!?
16:07gfxstrand[d]: I'm using bmov.32
16:07karolherbst[d]: gfxstrand[d]: it clears the value
16:08gfxstrand[d]: karolherbst[d]: Well yes, I know that. But I don't get why we need to do that before `bssy`. Especially given that I'm always using `pT` for the predicate.
16:08karolherbst[d]: barrier to barrier requires .CLEAR
16:09karolherbst[d]: barrier to register it's optional
16:09karolherbst[d]: and with barrier to barrier, I mean the general purpose ones
16:09gfxstrand[d]: Oh, .64 is for those weird other bar values
16:10karolherbst[d]: ATEXIT_PC only
16:10karolherbst[d]: but yeah
16:10gfxstrand[d]: what about register to barrier?
16:10karolherbst[d]: that has no .CLEAR
16:11karolherbst[d]: but it wouldn't make sense anyway, no?
16:11karolherbst[d]: (though I also don't know why clearing the source barrier reg is needed in the first place)
16:33dadschoorse[d]: does nvidia have native inverse_ballot?
16:33gfxstrand[d]: I don't think so
16:34gfxstrand[d]: I mean, we have ballot and a `!` modifier so...
16:34karolherbst[d]: what's inverse ballot?
16:34karolherbst[d]: ballot == `VOTE`, right?
16:35karolherbst[d]: but yeah.. `VOTE` has `.ALL`, `.ANY` and `.EQ` modifiers and I'm sure you can do whatever you want with it
16:35ahuillet[d]: it's dictatorship, the strongest thread wins
16:37dadschoorse[d]: karolherbst[d]: inverse_ballot(a) -> a & (1 << subgroup_invocation_id) != 0
16:37cwabbott: karolherbst[d]: no, ballot is not vote
16:37dadschoorse[d]: basically, ballot takes a boolean and gives you a mask, inverse_ballot takes a mask and gives you a boolean
16:37gfxstrand[d]: cwabbott: It is on nvidia
16:38gfxstrand[d]: dadschoorse[d]: Yeah, we don't have that. We do have a system value with the mask so we can avoid the shift
16:38dadschoorse[d]: afaiu nvidia has predicate registers, is there no way to move a value from a gpr to those?
16:39dadschoorse[d]: or from a uniform register to a predicate
16:39karolherbst[d]: okay
17:23gfxstrand[d]: dadschoorse[d]: Not in a way that splits it across lanes
17:24gfxstrand[d]: We can do `x != 0` but that's about it
17:24gfxstrand[d]: Pretty much the only cross-lane communication NVIDIA has is VOTE (which also does ballot) and SHFL
17:25gfxstrand[d]: (And SWZADD but that's really only for derivatives)
17:26gfxstrand[d]: Importantly, no lane can write data in another lane. VOTE and SHFL only read the other lanes' data.
17:35dadschoorse[d]: no subgroupBroadcast that reads one lane and writes a ugpr?
17:40gfxstrand[d]: I don't think USHFL exists
17:41gfxstrand[d]: You can `shfl` and then `r2ur`
17:42karolherbst[d]: there is `VOTEU`
17:42karolherbst[d]: but that only operates on uniform inputs
17:42karolherbst[d]: ehh wait
17:42karolherbst[d]: and writes into one
17:43karolherbst[d]: gfxstrand[d]: ^^ in case you weren't aware of that
17:43karolherbst[d]: why isn't it called `UVOTE`? No idea 🙂
17:43karolherbst[d]: ohh
17:43karolherbst[d]: the ballot is non uniform
17:44asdqueerfromeu[d]: karolherbst[d]: `WEVOTE` 🗳️
17:44karolherbst[d]: anyway
17:44karolherbst[d]: `VOTEU` exists
17:45dadschoorse[d]: how fast is SHFL?
17:46dadschoorse[d]: gfxstrand[d]: is `r2ur` subgroupBroadcastFirst?
17:48gfxstrand[d]: karolherbst[d]: Yes, I'm aware. VOTEU takes a non-uniform predicate source but returns a uniform ballot and vote.
17:48gfxstrand[d]: dadschoorse[d]: I don't know
17:51karolherbst[d]: r2ur is funky, but you can't select which thread
17:51gfxstrand[d]: Yeah, the question is what thread does it pick?
17:51gfxstrand[d]: I don't expect it's random
17:52karolherbst[d]: "a thread"
17:52karolherbst[d]: it will tell each thread if their value differs from the chosen thread
17:53karolherbst[d]: gfxstrand[d]: predicate it off for any other thread?
17:53karolherbst[d]: oh yeah :ferrisUpsideDown:
17:53karolherbst[d]: just have a predicate which is true only for thread 0, and then...
17:54karolherbst[d]: but anyway, I don't know which thread is choosen
20:08gfxstrand[d]: Oh, right. I forgot there was a predicate. That's pretty neat, actually.
21:05karolherbst[d]: gfxstrand[d]: as you mentioned.. uldc is fast, it's the only thing that needs a WaR latency
21:05karolherbst[d]: uldc, umov, voteu needs to wait three cycles
21:05karolherbst[d]: on other U* instructions
21:06gfxstrand[d]: Yeah... pretty sure I've seen other WaR races
21:06karolherbst[d]: ohh
21:06karolherbst[d]: I have to correct myself
21:06karolherbst[d]: WaR on uniform regs
21:06karolherbst[d]: on normal regs the max wait is 2
21:07karolherbst[d]: but the list is a bit ... complex
21:07karolherbst[d]: voteu needs 2 for uniform predicates
21:07karolherbst[d]: anyway...
21:07karolherbst[d]: it's 3