00:38fdobridge: <mhenning> Based on the way nvcc handles nested if statements, I'm pretty confident that a BSYNC will wait for threads in the mask to reach the "same" BSYNC and not just any BSYNC. What isn't clear to me is what "same" means in this context. Equivalence by thread mask and equivalence by program counter both sound like plausible theories.
00:41fdobridge: <karolherbst🐧🦀> wouldn't the same barrier register make more sense?
00:42fdobridge: <karolherbst🐧🦀> but keep in mind, that it's not just about threads waiting on a BSYNC
00:43fdobridge: <karolherbst🐧🦀> like.. how would you model it if also threads waiting on a `YIELD` will be woken up
00:43fdobridge: <karolherbst🐧🦀> if "waiting" is even the right term here
00:45fdobridge: <karolherbst🐧🦀> mhh.. though the exact same barrier reg would also be very cursed...
01:03fdobridge: <mhenning> Yeah, same barrier reg is also possible, although I really hope they didn't do that
01:07fdobridge: <mhenning> For the interactions with YIELD, EXIT, etc - I don't know yet, but I think it's reasonable to assume the ISA docs are describing those correctly
01:11fdobridge: <gfxstrand> I think threads waiting on `YIELD` being woken up is about making sure all the threads get to the barrier, not that those participate in the barrier.
01:11fdobridge: <gfxstrand> Yeah, that would be ROUGH
01:12fdobridge: <gfxstrand> This is why I really need access to the docs. I need to see the exact text or I'm not going to be able to parse any of this.
01:12fdobridge: <gfxstrand> There's too much subtlety for the telephone game
01:46fdobridge: <mhenning> Okay, I wrote a kernel with enough nested if statements that it spills barriers, and nvcc is happy to spill and reuse barrier regs
01:46fdobridge: <mhenning> so I don't think equivalence is based on barrier reg number
02:02fdobridge: <karolherbst🐧🦀> no, if all relevant threads (marked by the barrier) are either blocked (wait on bsync), yielded or exited or any combination of those, the barrier releases and every thread gets unlocked
02:03fdobridge: <gfxstrand> That makes sense. That's just deadlock prevention
02:03fdobridge: <karolherbst🐧🦀> yeah
02:03fdobridge: <karolherbst🐧🦀> but nothing states it has to be the exact same instruction
02:03fdobridge: <karolherbst🐧🦀> so if a random thread yields within your CF structure and all other threads reached the end, you still won't have converged threads, because that one yielded one still hasn't reached the bsync
02:04fdobridge: <karolherbst🐧🦀> but I have no idea why it's like that
02:05fdobridge: <karolherbst🐧🦀> @mhenning do you actually manage to get nvidia to emit nested bssy+bsync pairs operating of subsets of threads? because all I managed was that they converged on the outer structure
02:07fdobridge: <mhenning> Yes: https://gitlab.freedesktop.org/mhenning/re/-/commit/aee1babfab8e85ab9c7d8ccc064d00b266f20375
02:10fdobridge: <karolherbst🐧🦀> oh wow, funky
02:12fdobridge: <karolherbst🐧🦀> the issue here is, that there are hints that the barrier reg contains more than just the threadmask, but that's also uhh.. weird, because you can only spill to a 32 bit reg
02:14fdobridge: <mhenning> What does it have other than threadmask?
02:14fdobridge: <karolherbst🐧🦀> there is some aliasing bit apparently
02:15fdobridge: <karolherbst🐧🦀> like when BSSY gets called on a barrier which already has a value, then that bit is set
02:15fdobridge: <karolherbst🐧🦀> but that might also live somewhere else.. dunno
02:15fdobridge: <karolherbst🐧🦀> I also don't know what that bit even does
02:18fdobridge: <karolherbst🐧🦀> myyy
02:18fdobridge: <karolherbst🐧🦀> mhhh
02:18fdobridge: <karolherbst🐧🦀> I wonder if something else is going on
02:18fdobridge: <karolherbst🐧🦀> like..
02:19fdobridge: <karolherbst🐧🦀> bsync might keep threads blocked if not all threads are blocked/yielded/whatever, but if bsyncs across the code operate on different masks, it doesn't really matter what mask there is as long as all threads are in such a state
02:20fdobridge: <gfxstrand> Yeah, if all threads are in a bsync, you're stuck and it makes sense to start unblocking at random.
02:21fdobridge: <karolherbst🐧🦀> well.. that's a different thing though
02:21fdobridge: <karolherbst🐧🦀> I mean, if you have a bsync with thread 1+2+4 and another with 1+3+4, as long as all threads 1-4 reach both, they both unblock
02:21fdobridge: <karolherbst🐧🦀> ehh
02:21fdobridge: <karolherbst🐧🦀> all threads reach eitehr
02:23fdobridge: <karolherbst🐧🦀> that wouldn't contradict @mhenning example as they are clearly subsets going inner
02:23fdobridge: <karolherbst🐧🦀> *deeper or whatever
02:23fdobridge: <karolherbst🐧🦀> mhh
02:23fdobridge: <karolherbst🐧🦀> or would it?
02:24fdobridge: <karolherbst🐧🦀> I think somebody needs to run some code :ferrisUpsideDown:
02:26fdobridge: <karolherbst🐧🦀> it's kinda weird, because the wording is pretty unambiguous..
02:28fdobridge: <mhenning> If I'm understanding you right, these semantics are what I was calling "equivalence by program counter", which is compatible with what I've seen from nvcc
02:28fdobridge: <mhenning> To be honest, whether hardware judges equivalence by program counter vs by mask might not matter much because we can write a compiler that works under both models.
02:29fdobridge: <mhenning> but yeah, our best bet for more reverse engineering might be to start throwing instruction sequences at it
02:29fdobridge: <karolherbst🐧🦀> yeah.. dunno.. but the docs don't really make any statement on that, it's simply about the state of the threads associated with the barrier
02:29fdobridge: <mhenning> but yeah, our best bet for more reverse engineering might be to start throwing instruction sequences at hardware (edited)
02:29fdobridge: <karolherbst🐧🦀> nothing else
02:29fdobridge: <karolherbst🐧🦀> and as I said, the wording on WARPSYNC is way stronger
02:30fdobridge: <karolherbst🐧🦀> so either the doc on BSYNC are just bad
02:30fdobridge: <karolherbst🐧🦀> or something funky is going on
07:22fdobridge: <gobrosse> does the bsync instruction provide feedback ? if there was a way to tell between (1):
07:22fdobridge: <gobrosse> ```md
07:22fdobridge: <gobrosse> | Thread 0 | Thread 1 |
07:22fdobridge: <gobrosse> | --------- | --------- |
07:22fdobridge: <gobrosse> | bsync 11 | ... |
07:22fdobridge: <gobrosse> | <blocked> | ... |
07:22fdobridge: <gobrosse> | <blocked> | bsync 01 | <- T1 doesn't care about T0, probably reconverging something else
07:22fdobridge: <gobrosse> | proceed | proceed |
07:22fdobridge: <gobrosse> ```
07:22fdobridge: <gobrosse> and (2)
07:22fdobridge: <gobrosse> ```md
07:22fdobridge: <gobrosse> | Thread 0 | Thread 1 |
07:22fdobridge: <gobrosse> | --------- | --------- |
07:22fdobridge: <gobrosse> | bsync 11 | ... |
07:22fdobridge: <gobrosse> | <blocked> | ... |
07:22fdobridge: <gobrosse> | <blocked> | bsync 11 | <- T1 syncs with T0 too - matching masks
07:22fdobridge: <gobrosse> | proceed | proceed |
07:22fdobridge: <gobrosse> ```
07:22fdobridge: <gobrosse> you could write code so that in (1) T0 just wait again until T1 syncs with it explictly
07:23fdobridge: <gobrosse> maybe it doesn't need to do feedback and you can just rely on this weird non-blocking behavior when masks overlap, and then do subgroup stuff to figure out whether the barrier is an actual pass
07:24fdobridge: <gobrosse> i have a more in-depth writeup but the validity of this scheme hinges on the precise behaviour of bsync (or other similar instructions that might exist, idk)
07:41fdobridge: <gobrosse> https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar-warp-sync
07:42fdobridge: <gobrosse> > Syntax
07:42fdobridge: <gobrosse> >
07:42fdobridge: <gobrosse> > bar.warp.sync membermask;
07:42fdobridge: <gobrosse> >
07:42fdobridge: <gobrosse> > Description
07:42fdobridge: <gobrosse> >
07:42fdobridge: <gobrosse> > bar.warp.sync will cause executing thread to wait until all threads corresponding to membermask have executed a bar.warp.sync with the same membermask value before resuming execution.
07:42fdobridge: <gobrosse> pretty sure this is the PTX equivalent and yup, no mention of what happens in a "deadlock" situation
09:01fdobridge: <karolherbst🐧🦀> it isn't
09:01fdobridge: <karolherbst🐧🦀> `bar.warp.sync` is `WARPSYNC`
09:02fdobridge: <karolherbst🐧🦀> which has the semantics as described in the PTX docs
09:02fdobridge: <karolherbst🐧🦀> `BSSY` and `BSYNC` are not something developers can emit, because that's an internal detail of nvidias compiler
13:59fdobridge: <karolherbst🐧🦀> https://blog.rust-lang.org/2024/03/21/Rust-1.77.0.html :ferrisBongo: I need almost everything
14:03fdobridge: <esdrastarsis> C-string literals is very interesting
14:08fdobridge: <karolherbst🐧🦀> and `offset_of` as well
14:08fdobridge: <karolherbst🐧🦀> I need both 😄
16:54fdobridge: <gfxstrand> Yeah, both are good
17:07fdobridge: <karolherbst🐧🦀> @airlied @gfxstrand what was the solution we settled with in regards to VM_BIND and modifiers?
17:32fdobridge: <gfxstrand> I don't know that we ever have. :blobcatnotlikethis:
17:32fdobridge: <karolherbst🐧🦀> :blobcatnotlikethis:
17:38fdobridge: <karolherbst🐧🦀> wasn't it something like if there is metadata on a bo we assume old-gl behavior and make it kinda work and for the VM_BIND case we just assume it works correctly?
18:09fdobridge: <Joshie with Max-Q Design> :blobcatnotlikethis:
18:20fdobridge: <airlied> Yeah I don't think we figured out a good answer for the container problem, tbh I'm not sure I want old containers with old nouveau nvc0 drivers being used
18:20fdobridge: <airlied> The driver isn't exactly high quality
18:44fdobridge: <airlied> I think I would accept a kernel patch to take metadata on bo creation for vmbind Bo's we know would be shared, I'm not sure I want to write it myself :-p
18:46fdobridge: <karolherbst🐧🦀> could require the flags at bind time to be the same
18:56fdobridge: <airlied> Nah I don't think we want that, the kernel wouldn't use the metadata
18:57fdobridge: <airlied> Though I think for cpu maps we might need to on kepler
18:57fdobridge: <gfxstrand> We need to figure out what's going on with CPU maps on Kepler
18:57fdobridge: <gfxstrand> Or not. I mean, we can always just not...
18:57fdobridge: <airlied> Just missing metadata
18:58fdobridge: <airlied> Not in front of code, but we have I think tile mode and kind
18:58fdobridge: <airlied> On older hw I think tile mode might still need to be metadata
18:58fdobridge: <airlied> Or we need to pass it in on cpu maps
19:34fdobridge: <gfxstrand> I'm fine with CPU maps just always being metadata on those platforms
19:35fdobridge: <gfxstrand> IDK that there's much point in tiled maps. We can't use them on modern hardware so we need swizzle code anyway.
19:35fdobridge: <gfxstrand> Unless modern hardware can tile maps in which case that's kinda cool
19:35fdobridge: <gfxstrand> But also potentially problematic so IDK
23:50airlied: bridge test
23:50fdobridge: <airlied> other bridge test
23:51fdobridge: <airlied> just thought I missed some msgs but seems fine
23:51fdobridge: <karolherbst🐧🦀> yeah..
23:51fdobridge: <karolherbst🐧🦀> though
23:51fdobridge: <karolherbst🐧🦀> I do plan to test a new bridge at some point
23:51fdobridge: <karolherbst🐧🦀> one which looks way nicer