16:51fdobridge: <gfxstrand> @zmike. Should I try running the GL or ES CTS again? I'm a bit lost on where everything's at at the moment.
17:36fdobridge: <zmike.> @gfxstrand what's your motivation for asking?
17:37fdobridge: <zmike.> I don't think anything has significantly changed in terms of CLs being merged in CTS
17:37fdobridge: <zmike.> and I don't think I've merged anything that would affect nvk this week-ish?
18:16fdobridge: <gfxstrand> Okay, I saw a WSI thing go in
18:17fdobridge: <zmike.> oh
18:17fdobridge: <zmike.> yeah I guess there's that
18:17fdobridge: <zmike.> but EGL caselists aren't required for...some versions of cts
18:54fdobridge: <gfxstrand> Okay, I'll try again now that those are fixed and see where we're at.
19:25fdobridge: <gfxstrand> ES tests are dying in EGL. Some sort of GPU hang that doesn't give us any useful information. It's in the `multi_context` tests and looks like what happens when we have too many contexts active in the GSP at the same time.
19:26fdobridge: <gfxstrand> Anyway, we'll see how my desktop GL run goes
19:28fdobridge: <gfxstrand> Of course my sparse fixes aren't merged yet so I can't actually submit anything I run today
19:28fdobridge: <gfxstrand> But we'll see how it goes
20:08fdobridge: <zmike.> I run cts pretty regularly
20:08fdobridge: <zmike.> everything is passing for me, though that's with all cts changes applied and whatever the hell has gathered in my branches after 2 weeks of nir fugue
20:09fdobridge: <zmike.> I don't think I run the EGL list though since historically that one always had issues so I never added it to my scripts
20:13fdobridge: <gfxstrand> Yeah, we need to figure out EGL for ES conformance. We can do GL without.
20:13fdobridge: <gfxstrand> Which, honestly, is the one I care about
20:17fdobridge: <zmike.> if I ever finish nir hell I'll get back to it in the course of fixing some of the issues that have been piling up
21:12fdobridge: <gfxstrand> @karolherbst @mhenning I'm poking about with `bssy`. It appears that the result, when copied to a GPR via `bmov` is, indeed, simply `ballot(true)`.
21:12fdobridge: <gfxstrand> I'm not sure how to get at bsync, though.
21:12fdobridge: <karolherbst🐧🦀> bsync doens't write to the barrier afaik
21:12fdobridge: <gfxstrand> Incidentally, this means one could optimize `bssy+bmov` to `vote`
21:13fdobridge: <karolherbst🐧🦀> I wouldn't bet on it, as there might be some internal magic they are doing
21:13fdobridge: <gfxstrand> Eh, if they're doing internal magic than my spilling strategy is hosed.
21:13fdobridge: <gfxstrand> But that's totally testable...
21:14fdobridge: <karolherbst🐧🦀> if one could optimize that to vote, why doesn't nvidia then?
21:14fdobridge: <karolherbst🐧🦀> mhh well.. vote can't write to a barrier anyway
21:14fdobridge: <gfxstrand> 🤷🏻♀️
21:14fdobridge: <gfxstrand> You wouldn't want to most of the time
21:14fdobridge: <karolherbst🐧🦀> also, bssy has a constant thread mask as the input
21:15fdobridge: <karolherbst🐧🦀> ehh wait
21:15fdobridge: <karolherbst🐧🦀> is it the mask?
21:15fdobridge: <karolherbst🐧🦀> ehh no, it's the jump target...
21:15fdobridge: <karolherbst🐧🦀> it had an input predicate
21:15fdobridge: <karolherbst🐧🦀> but vote has that as well.. but also an output predicate
21:16fdobridge: <karolherbst🐧🦀> the one thing I'm still curious about is, that apparently there is an aliasing bit _somewhere_ but I have no idea what it is even doing
21:16fdobridge: <gfxstrand> I suspect nvidia doesn't because those registers are magic and have special scheduling rules and it's easier to just use them for barriers and only bmov when you need to spill
21:16fdobridge: <karolherbst🐧🦀> yeah.. probably
21:17fdobridge: <gfxstrand> But it's good to know that they are masks.
21:17fdobridge: <karolherbst🐧🦀> I wonder if the order is funky
21:18fdobridge: <karolherbst🐧🦀> like.. there is an aliasing bit, but it's not part of the barrier, so much I know
21:18fdobridge: <karolherbst🐧🦀> and it gets cleared on `BMOV.CLEAR`
21:19fdobridge: <karolherbst🐧🦀> any BMOV writing zero actually
22:33fdobridge: <gfxstrand> Actually, it looks like `~ballot(true)`
22:35fdobridge: <gfxstrand> No, just `ballot(true)`.
22:35fdobridge: <gfxstrand> Ugh. Writing tests is complicated
22:56fdobridge: <gfxstrand> Okay, so figuring out all this sync stuff is tricky. It looks like the hardware has some sort of deadlock detection where, if every thread is blocked, it kicks all the active `bsync`s.
22:57fdobridge: <gfxstrand> This is good for not getting the GPU stuck. Tricky for R/E
23:01fdobridge: <karolherbst🐧🦀> mhhh
23:01fdobridge: <karolherbst🐧🦀> but that basically means, that after a `bsync` you are not guaranteed to have all threads converged?
23:02fdobridge: <gfxstrand> Not if you do it wrong and deadlock, you aren't.
23:03fdobridge: <karolherbst🐧🦀> right, just means that bsync kicking threads waiting on other bsyncs is indeed how the hardware works
23:03fdobridge: <gfxstrand> Maybe?
23:03fdobridge: <gfxstrand> It just means the HW has deadlock detection
23:03fdobridge: <karolherbst🐧🦀> I mean.. the docs state if all threads are blocked/sleeping/exited in the mask, it kicks them all
23:04fdobridge: <gfxstrand> Yes
23:04fdobridge: <gfxstrand> Which is fine
23:04fdobridge: <karolherbst🐧🦀> regardless of where the threads currently wait
23:04fdobridge: <gfxstrand> Sure
23:04fdobridge: <karolherbst🐧🦀> only `WARPSYNC` guarantees that after unblocking _all_ threads execute the next instruction
23:05fdobridge: <karolherbst🐧🦀> next relative to `WARPSYNC`
23:05fdobridge: <gfxstrand> And I've determined that it only kicks the current set of hung threads. If you wait again, that wait works.
23:05fdobridge: <gfxstrand> So it's not like one hang disables bsync or something (not that I would expect it to.
23:06fdobridge: <gfxstrand> That sounds like a different but fairly important distinction.
23:06fdobridge: <karolherbst🐧🦀> I wonder if what the hardware does if you have multiple mask fullfilled
23:06fdobridge: <karolherbst🐧🦀> like.. different masks, or even subsets
23:06fdobridge: <karolherbst🐧🦀> yeah.. `WARPSYNC` is like the cuda `__syncwarp` thing
23:07fdobridge: <gfxstrand> That's tricky to figure out because most of the cases where that would happen are also deadlock cases. 😭
23:07fdobridge: <karolherbst🐧🦀> including the memory barrier
23:07fdobridge: <karolherbst🐧🦀> well..
23:07fdobridge: <karolherbst🐧🦀> worst case we trust the docks who say that bsync doesn't care about anything besides the threads status
23:08fdobridge: <gfxstrand> That's important because with bsync nothing actually guarantees that a subgroup op actually executes in lock step. Depending on `$details`, you may have threads converged but still executing separately.
23:08fdobridge: <karolherbst🐧🦀> I wonder how `YIELD` plays into all of this
23:08fdobridge: <karolherbst🐧🦀> like...
23:08fdobridge: <karolherbst🐧🦀> nvidia puts that infront of a loop continue e.g.
23:09fdobridge: <karolherbst🐧🦀> well
23:09fdobridge: <karolherbst🐧🦀> sometimes
23:09fdobridge: <karolherbst🐧🦀> it's one example
23:09fdobridge: <gfxstrand> Which means my subgroup implementation may not be 100% correct. I may need to throw in some warpsync/bsync to ensure things don't get too far out of sync.
23:10fdobridge: <karolherbst🐧🦀> yeah...
23:10fdobridge: <karolherbst🐧🦀> I think that's what nvidia is doing for subgroup ops.. at least Ben threw some of those in front of them for codegen because I think that's what nvidia was doing
23:11fdobridge: <karolherbst🐧🦀> but that brings us the question: what's the actual purpose of bssy+bsync
23:12fdobridge: <karolherbst🐧🦀> though as long as you have nothing nested going on it's good enough
23:15fdobridge: <karolherbst🐧🦀> but yeah.. that kinda depends on what happens if you have subsets of masks going on
23:17fdobridge: <karolherbst🐧🦀> @gfxstrand ohhh.. I have an idea
23:17fdobridge: <karolherbst🐧🦀> what if at arrival it determines what happens
23:18fdobridge: <karolherbst🐧🦀> because that would entirely explain the semantics
23:18fdobridge: <karolherbst🐧🦀> like.. the thread either blocks or it unlocks the group
23:18fdobridge: <karolherbst🐧🦀> so if you arrive in an inner bsync, the outer one does nothing
23:18fdobridge: <karolherbst🐧🦀> because there is no active thread arriving
23:18fdobridge: <karolherbst🐧🦀> (in case some lonely thread waits on the outer one since forever)
23:20fdobridge: <karolherbst🐧🦀> and you can't have multiple threads being executed in different places within a subgroup afaik, so there is only one active group of threads per subgroup
23:20fdobridge: <karolherbst🐧🦀> so if half the threads arrive at the outer bsync, they block and transfer execution to the other half
23:21fdobridge: <karolherbst🐧🦀> and they loop until they arrive at the outer one as well, converging and unblocking everything
23:21fdobridge: <gfxstrand> Nope! Not on Turing+. Turing can totally have multiple groups of threads in different parts of the program going at the same time.
23:22fdobridge: <gfxstrand> Or maybe Volta?
23:22fdobridge: <karolherbst🐧🦀> in a subgroup?
23:22fdobridge: <gfxstrand> Yup
23:22fdobridge: <karolherbst🐧🦀> mhhh
23:22fdobridge: <gfxstrand> There might be an enable bit for it somewhere
23:22fdobridge: <karolherbst🐧🦀> might explain why they have the barrier file now...
23:22fdobridge: <gfxstrand> And it might be compute-only
23:22fdobridge: <karolherbst🐧🦀> but yeah.. that's kinda funky
23:23fdobridge: <karolherbst🐧🦀> maybe that's why they added WARPSYNC?
23:23fdobridge: <karolherbst🐧🦀> because it didn't exist before
23:23fdobridge: <gfxstrand> I think that's more because they want independent forward progress where things may not be nicely nested.
23:24fdobridge: <karolherbst🐧🦀> they added YIELD for forward progress
23:24fdobridge: <gfxstrand> It's certainly why they added `__syncwarp()`
23:25fdobridge: <karolherbst🐧🦀> maybe bssy+bsync is just good enough for 99% of all cases and they simply ignore you can get very unlucky timing, and for the cases where it really matters you use WARPSYNC just to be sure
23:26fdobridge: <butterflies> Volta
23:28fdobridge: <gfxstrand> I kinda suspect that `warpsync` and `bsync` are the same under the hood, just targeting different registers and with `bsync` having the deadlock detection.
23:29fdobridge: <karolherbst🐧🦀> mhh
23:29fdobridge: <gfxstrand> I've definitely seen `warpsync` hang the GPU
23:29fdobridge: <karolherbst🐧🦀> I doubt it, because warpsync also acts as a memory barrier
23:29fdobridge: <gfxstrand> What kind of memory barrier?
23:30fdobridge: <karolherbst🐧🦀> shared I think... let's see
23:30fdobridge: <karolherbst🐧🦀> mhhh
23:31fdobridge: <karolherbst🐧🦀> it doesn't say
23:31fdobridge: <karolherbst🐧🦀> just that memory ordering of participating threads is identical as if you'd execute MEMBAR
23:31fdobridge: <gfxstrand> I expect it's `membar.cta` then
23:31fdobridge: <karolherbst🐧🦀> well
23:32fdobridge: <karolherbst🐧🦀> only participating threads
23:32fdobridge: <gfxstrand> Sure
23:32fdobridge: <karolherbst🐧🦀> so even weaker I'd say
23:32fdobridge: <karolherbst🐧🦀> but maybe it's just membar.cta
23:32fdobridge: <karolherbst🐧🦀> whatever `__syncwarp` says 😄
23:33fdobridge: <karolherbst🐧🦀> there is a funky `.EXCLUSIVE` flag on `WARPSYNC` though
23:33fdobridge: <gfxstrand> What's that do?
23:34fdobridge: <karolherbst🐧🦀> only a single set of threads pass at once
23:34fdobridge: <karolherbst🐧🦀> like..
23:34fdobridge: <karolherbst🐧🦀> `EXCLUSIVE` operates on a gpr only
23:34fdobridge: <karolherbst🐧🦀> so there can be different sets on threads
23:34fdobridge: <karolherbst🐧🦀> *of
23:35fdobridge: <karolherbst🐧🦀> apparently allows for funky access controls for `REDUX`, `SHFL`, `VOTE`, etc...
23:36fdobridge: <karolherbst🐧🦀> so you can prevent different set of threads to mess up their subgroup ops
23:50fdobridge: <mhenning> yeah, I was thinking about this yesterday, and I think a reasonable guess for the semantics is something like:
23:50fdobridge: <mhenning>
23:50fdobridge: <mhenning> each thread is in one of three states:
23:50fdobridge: <mhenning> - active (executing the current instruction),
23:50fdobridge: <mhenning> - inactive (eligible to execute instructions, but not executing the current instruction), or
23:50fdobridge: <mhenning> - blocked (waiting on a bsync)
23:50fdobridge: <mhenning>
23:50fdobridge: <mhenning> and a bsync is something like:
23:50fdobridge: <mhenning> ```
23:50fdobridge: <mhenning> bsync(int mask) {
23:50fdobridge: <mhenning> if all threads in mask are either blocked or active {
23:50fdobridge: <mhenning> unblock all threads in mask
23:50fdobridge: <mhenning> } else {
23:50fdobridge: <mhenning> block all active threads
23:50fdobridge: <mhenning> }
23:50fdobridge: <mhenning> }
23:50fdobridge: <mhenning> ```
23:50fdobridge: <mhenning> which then wouldn't require any kind of checking for "is this the same barrier?" - that's all implicit from the masks
23:50fdobridge: <mhenning> yeah, I was thinking about this yesterday, and I think a reasonable guess for the semantics is something like:
23:50fdobridge: <mhenning>
23:50fdobridge: <mhenning> each thread is in one of three states:
23:50fdobridge: <mhenning> * active (executing the current instruction),
23:50fdobridge: <mhenning> * inactive (eligible to execute instructions, but not executing the current instruction), or
23:50fdobridge: <mhenning> * blocked (waiting on a bsync)
23:50fdobridge: <mhenning>
23:50fdobridge: <mhenning> and a bsync is something like:
23:50fdobridge: <mhenning> ```
23:50fdobridge: <mhenning> bsync(int mask) {
23:50fdobridge: <mhenning> if all threads in mask are either blocked or active {
23:50fdobridge: <mhenning> unblock all threads in mask
23:50fdobridge: <mhenning> } else {
23:50fdobridge: <mhenning> block all active threads
23:50fdobridge: <mhenning> }
23:50fdobridge: <mhenning> }
23:50fdobridge: <mhenning> ```
23:50fdobridge: <mhenning> which then wouldn't require any kind of checking for "is this the same barrier?" - that's all implicit from the masks (edited)
23:51fdobridge: <karolherbst🐧🦀> yeah
23:51fdobridge: <karolherbst🐧🦀> but also
23:51fdobridge: <karolherbst🐧🦀> the active thread's mask decides which threads are relevant
23:51fdobridge: <karolherbst🐧🦀> so you enter a bsync and that mask is the only one that matters
23:51fdobridge: <karolherbst🐧🦀> ohhh wait
23:52fdobridge: <karolherbst🐧🦀> what if it cascades?
23:52fdobridge: <karolherbst🐧🦀> like..
23:52fdobridge: <karolherbst🐧🦀> if a thread was blocked on a bsync
23:52fdobridge: <karolherbst🐧🦀> and it gets woken up
23:52fdobridge: <karolherbst🐧🦀> does it check again?
23:52fdobridge: <karolherbst🐧🦀> and blocks if one of the threads in that mask is running?
23:53fdobridge: <mhenning> I would guess not - that sounds like it's harder to implement and I'm not sure what it buys you
23:53fdobridge: <mhenning> but also I don't know that we know those details
23:53fdobridge: <karolherbst🐧🦀> nested bsyncs actually working
23:54fdobridge: <karolherbst🐧🦀> maybe
23:54fdobridge: <mhenning> You don't need to re-check for nested to work
23:54fdobridge: <karolherbst🐧🦀> well
23:54fdobridge: <karolherbst🐧🦀> if you have some threads on an outer, and some in an inner
23:54fdobridge: <karolherbst🐧🦀> mhhh
23:54fdobridge: <karolherbst🐧🦀> though they'd still all pass regardless if you are unlucky
23:55fdobridge: <karolherbst🐧🦀> so yeah.. probably doesn't change anything
23:55fdobridge: <mhenning> Only the inner will be able to arrive with all threads active or blocked
23:55fdobridge: <karolherbst🐧🦀> well
23:55fdobridge: <karolherbst🐧🦀> depends on how unlucky the timing is
23:55fdobridge: <karolherbst🐧🦀> the inner one could arrive later
23:55fdobridge: <karolherbst🐧🦀> ehh
23:55fdobridge: <karolherbst🐧🦀> or earlier
23:55fdobridge: <karolherbst🐧🦀> depends on the code really
23:56fdobridge: <mhenning> If the inner arrives first, then those threads get woken up again and the outer doesn't pass. If the outer arrives first, the inner is still active and the outer doesn't pass
23:57fdobridge: <karolherbst🐧🦀> mhhh
23:57fdobridge: <karolherbst🐧🦀> yeah so they could only arrive all at the same time
23:57fdobridge: <karolherbst🐧🦀> but yeah.. the outer wouldn't be able to, because the inner ones would already be unblocked at some point