00:17fdobridge: <airlied> It will be a while, since at least one fix is queued for Linus rc1 and won't be backported until next week
00:20fdobridge: <zmike.> Oof
00:20fdobridge: <zmike.> Revert it is then
00:36fdobridge: <gfxstrand> It's really quite nice. @zmike. is just using it in the most painful way possible. (Not his fault. Just that reconstructing SPIR-V is painful.)
00:49fdobridge: <zmike.> Yeah I suppose if I didn't have that restriction I might enjoy it more
00:51fdobridge: <airlied> Not sure which crash you are seeing though, there are two different fixes for different BAR problems
00:54fdobridge: <zmike.> I get an unrecoverable hang that only triggers during full CTS runs
00:54fdobridge: <zmike.> Can't repro any other way I've tried
00:55airlied: ah okay that sounds like the fix that isn't in, the other fix was for an oops
03:55fdobridge: <gfxstrand> I really wish I knew why the `dispatch_base` tests are crashing. They only crash in parallel runs and they're the only ones that crash and they segfault. Everything's fine when I run with valgrind
03:57fdobridge: <gfxstrand> I suppose I could turn coredumps back on
03:58fdobridge: <airlied> do they segfault or device lost and kernel msg logs it?
04:03fdobridge: <gfxstrand> segfault
04:03fdobridge: <gfxstrand> dmesg logs the segfault
04:03fdobridge: <gfxstrand> no GPU errors
04:03fdobridge: <gfxstrand> If I turned on coredumps, I could probably gather some data
04:03fdobridge: <gfxstrand> but ugh... coredumps...
04:10fdobridge: <airlied> doesn't coredumpctrl show them?
05:51fdobridge: <airlied> @gfxstrand I think there is either a CTS bug of a runner bug around the device id
05:52fdobridge: <airlied> or maybe both
05:52fdobridge: <airlied> getVKDeviceId usage is inconsistent
05:52fdobridge: <airlied> some tests call getVKDeviceId() - 1 which with the command line we pass in translate to -1
05:53fdobridge: <airlied> so I think changing your run script to use 1 base instead of 0 will fix some crashes
06:14fdobridge: <airlied> @gfxstrand fixing that seems to make things a lot less crashy here
08:44fdobridge: <!DodoNVK (she) 🇱🇹> There are 9 extensions with non-draft MRs for :triangle_nvk: (hopefully all of these can be merged to compete with Turnip)
09:13fdobridge: <valentineburley> Turnip has a ton of trivial and easy to implement extensions left, I kind of wish I had some hardware to take a swing at them
09:14fdobridge: <valentineburley> It's a close race tho 😄
09:20fdobridge: <valentineburley> Has anyone tried Path of Exile with NVK? A couple of years ago it needed a trivial Google extension on RADV, I wonder if it's still the case?
09:20fdobridge: <valentineburley> I have a MR for it: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28155
09:21fdobridge: <!DodoNVK (she) 🇱🇹> I included your MRs in the list
13:13fdobridge: <gfxstrand> I've been running with 1 for a while
13:15fdobridge: <gfxstrand> Once I get my compiler MR posted, I'm going to dig through the backlog. There's some compiler improvements from @mhenning I need to review, too.
13:15fdobridge: <gfxstrand> IDK if I'll get to them this morning but hopefully this afternoon
13:48fdobridge: <!DodoNVK (she) 🇱🇹> What will that MR contain?
13:49fdobridge: <gfxstrand> That was it! This test doesn't -2
13:49fdobridge: <gfxstrand> That was it! This test doesn't -1 (edited)
13:55fdobridge: <tom3026> perhaps not exactly nouveau related but since you guys are pretty much gpu driver gurus, pcie_option=pcie_bus_perf ,
13:55fdobridge: <tom3026> ```
13:55fdobridge: <tom3026> Set device MPS to the largest allowable MPS
13:55fdobridge: <tom3026> based on its parent bus. Also set MRRS (Max Read Request Size)
13:55fdobridge: <tom3026> to the largest supported value (no larger than the MPS that the device or bus can support) for best performance.
13:55fdobridge: <tom3026> ```
13:55fdobridge: <tom3026>
13:56fdobridge: <tom3026> what is this? O_o just noticed it kernel parameters manual
13:56fdobridge: <tom3026> *it in
14:02fdobridge: <gfxstrand> @airlied https://gitlab.khronos.org/Tracker/vk-gl-cts/-/issues/3232
14:02fdobridge: <gfxstrand> Okay, now I know I can safely ignore those fails
14:07fdobridge: <!DodoNVK (she) 🇱🇹> What does that issue say?
14:17fdobridge: <Sid> pcie max payload size
14:17fdobridge: <Sid> technically sets MPS and MRRS to the max value permitted by the hardware
14:18fdobridge: <Sid> or, well, permitted/supported by a pcie device's parent bus
14:24fdobridge: <Sid> basically should allow larger data transfers where possible
14:25fdobridge: <gfxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28300
14:26fdobridge: <gfxstrand> Okay, now all the uniform convergence tests pass
14:27fdobridge: <tom3026> so why isnt it always set to that? heh
14:27fdobridge: <Sid> nooo idea
14:28fdobridge: <tom3026> oh well il turn it on, lets see if it blows up
14:28fdobridge: <Sid> would be interesting to see perf benchmarks
14:28fdobridge: <Sid> with and without it
14:30fdobridge: <tom3026> its the holy grail to all fps dips
14:31fdobridge: <gfxstrand> It's an issue for a bunch of CTS tests where `--deqp-vk-device-id` and `--vk-deqp-device-group-id` interact badly.
14:32fdobridge: <gfxstrand> The problem is that the `*dispatch_base*` tests are technically device group tests and so they try to treat `--deqp-vk-device-id` as the id within the group but we only report individual groups with 1 instance each so it indexes OOB and blows up.
14:34fdobridge: <tom3026> oh nvm, found a mailing list. seems a bit misleading it depends on bunch of things and in general, it can even reduce perf compared to the sane default heh
14:34fdobridge: <tom3026> i think
14:34fdobridge: <tom3026> oh well no fun without trying
14:38fdobridge: <Sid> heh
14:46fdobridge: <gfxstrand> @airlied Not out of the woods yet. 😩
14:46fdobridge: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1220020593621991496/dmesg.txt?ex=660d6bb8&is=65faf6b8&hm=e1a9ed9e0ba90941dd650ddc3d50466c1512a918b70c9c909bf87ebb2b5e7fbf&
14:47fdobridge: <gfxstrand> I don't have an easy reproducer for that. AFAIK I saw it for the first time just now
15:01fdobridge: <marysaka> Sad to see we need to go unstructured
15:01fdobridge: <marysaka> Sad to see that we need to go unstructured (edited)
15:01fdobridge: <gfxstrand> The hardware is unstructured. 🤷🏻♀️
15:02fdobridge: <gfxstrand> On Maxwell, we just won't run the new pass and we'll use the structured merge intrinsics
15:02fdobridge: <marysaka> make sense yeah...
15:03fdobridge: <gfxstrand> I joked to Jeff (NVIDIA) one time that while AMD has spent the last decade trying to get LLVM to work better for their hardware, NVIDIA's solution seems to be to make their hardware better for LLVM. Jeff's response was something like "Yeah, well, it seems to be working out for us."
15:04fdobridge: <marysaka> lol
15:06fdobridge: <gfxstrand> IDK if that's a joke about AMD SW architects, LLVM, or NVIDIA hardware. All three, maybe?
15:07HdkR: Considering it took me like a week to get a working Ampere backend in to LLVM. It's a pretty good fit :P
15:07HdkR: Then a month to clean it up so it supported more than basic code blocks, but whatever
15:08HdkR: Ampere? No, Volta
15:18fdobridge: <gfxstrand> Yeah, the fact that you can just throw totally unstructured control flow at it is pretty neat
15:18fdobridge: <gfxstrand> Of course, once your threads diverge, there's no getting them back so there is that...
15:20fdobridge: <karolherbst🐧🦀> it's funky how all that CL work we were doing becomes useful for a lot of other things 😄
15:20fdobridge: <gfxstrand> Yeah
15:20fdobridge: <gfxstrand> And with the new validation rule for NIR that I added in my MR, I think we can just flip on unstructured for a LOT of passes.
15:21fdobridge: <gfxstrand> The other thing we need to do to make it work is to have a pass which auto-converts nir_if to unstructured.
15:21fdobridge: <gfxstrand> Which could be part of the block sorting pass, honestly.
15:21fdobridge: <gfxstrand> Possibly integrated with nir_builder somehow
15:25fdobridge: <karolherbst🐧🦀> yeah... though maybe we need a different way of defining nir passes at some point, and have it more declarative on what they require, metadata invalidation they do, etc.. but most of it can also be done in the entry function, so not sure it's even all that helpful
15:26fdobridge: <karolherbst🐧🦀> but I think we'll arrive at a place where a pass will have to state it only works on structured or unstructured CF
15:30fdobridge: <redsheep> I'm not certain I'm clear on what this means, and it doesn't seem to be an easy Google search, but do you think this might have been motivated by raytracing performance? Seems like that required a lot of crazy changes around control flow
15:32fdobridge: <karolherbst🐧🦀> nvidia hardware was always like that.. more or less
15:32fdobridge: <karolherbst🐧🦀> I think they just wanted to get rid of the internal stack because it was a real pita
15:32fdobridge: <karolherbst🐧🦀> and more complex programs required you to spill it to VRAM
15:36fdobridge: <dadschoorse> I've seen their compiler get control flow with subgroups wrong too, tbf not as often as amdvlk's llvm backend, but it happens
15:38fdobridge: <karolherbst🐧🦀> sums up their mindsets pretty neatly tbh
15:40fdobridge: <gfxstrand> Unstructured control-flow is where you just have branch and conditional branch instructions. Structured is where the control-flow is represented as ifs, loops, and other high-level constructs
15:40fdobridge: <gfxstrand> What motivates it? CUDA.
15:41fdobridge: <gfxstrand> Hopefully NAK is now correct. I'm not sure how well tested it all is but I'm fairly convinced that my lowering pass re-converges at all the right places.
15:45fdobridge: <redsheep> Hmm interesting, I thought all those high level concepts were just implemented through branches, wasn't aware hardware having more awareness than that was an option. What little I have learned down at this low level was mostly about old CPUs though.
15:45fdobridge: <karolherbst🐧🦀> depends on the hardware
15:46fdobridge: <karolherbst🐧🦀> I think some hardware has like structured CF in the ISA
15:47fdobridge: <dadschoorse> yeah, it's either that, nvidia's solution or uniform branches + explicit exec mask like amd
15:49fdobridge: <dadschoorse> x86+avx512 is kind of like amd gpu hw, you have uniform branches and exec masks for simd ops
15:50fdobridge: <redsheep> Oh interesting, didn't realize anything like that existed in cpu land
15:50fdobridge: <redsheep> I guess it makes sense for axv512 though
15:51HdkR: SVE on ARM also has predication masks
15:51fdobridge: <redsheep> *avx
15:52HdkR: Matches AVX512 behaviour relatively well. At least for AVX512F
15:52fdobridge: <dadschoorse> zen4's avx512 implementation makes actually using the exec masks like you would on gpus unattractive though, because they have worse performance if you want to preserve unused lanes
15:52fdobridge: <dadschoorse> zen4's avx512 implementation makes actually using the exec masks like you would on gpus unattractive though, because they have worse performance if you want to preserve inactive lanes (edited)
15:53fdobridge: <redsheep> Yeah I mean it's not a GPU. As a whole their implementation seems to be working out for them though.
15:55fdobridge: <redsheep> They don't need to make all cases performant if it saves power and such.
15:57fdobridge: <redsheep> AMD has drifted towards nvidia's philosophy a bit with rdna, right? Wave32 was intended to help with this kind of thing from what I gathered
15:57fdobridge: <redsheep> Trying to rely less on compiler engineering to get good performance
15:58fdobridge: <dadschoorse> if anything rdna made compiler engineering more important
15:58fdobridge: <dadschoorse> with gcn, you didn't have to care about alu latency at all
15:59fdobridge: <redsheep> Just because it didn't vary, right?
16:02fdobridge: <redsheep> Well I suppose it probably still doesn't, that whole thing was kind of confusing to be honest
16:02fdobridge: <dadschoorse> with gcn, the latency was always hidden because the SIMD16 unit runs each instruction 4 times. wave32 on rdna one instruction is issued in one cycle, but it takes another 4 cycles until you can use the result
16:04fdobridge: <redsheep> Ah right, so you end up with interleaving where you wouldn't have it before?
16:06fdobridge: <redsheep> Not sure if that's the right word here.
16:13fdobridge: <redsheep> Ok I went and reread that part of the rdna whitepaper and I think I understand now.
16:13fdobridge: <redsheep> I can see how that would actually lead to more compiler work
16:15fdobridge: <mohamexiety> @gobrosse may find it interesting too. I know he has some cursed stuff that breaks some compilers
16:15fdobridge: <gobrosse> Faith pinged me earlier 🙂
16:16fdobridge: <gobrosse> Gonna try to review it tonight, but should avoid making promises (publish or perish, all that ...)
16:17fdobridge: <gobrosse> Faith pinged me earlier today, I DM'd her following the masto post a few days ago 🙂 (edited)
17:18fdobridge: <gobrosse> had a quick look, so far seems sound, but I need to study how the nvidia sync stuff works more to be 100% positive
17:20fdobridge: <gobrosse> the design where you reconverge before breaking out of a loop is super heavy-handed but to not do that you'd need threads to talk to each other in different points of control-flow, somehow
17:22fdobridge: <karolherbst🐧🦀> the reconvergence happens on the sync point though, no?
17:23fdobridge: <gobrosse> i mean, ideally you'd have threads that break out only reconverge once, where they're breaking/returning to
17:23fdobridge: <gobrosse> but here the threads that didn't break need to know of the other ones, and to know that they gotta ask the ones who are breaking out
17:24fdobridge: <gobrosse> and since this stuff lives in per-lane gprs ... you gotta talk to 'em
17:25fdobridge: <gobrosse> i _think_ you can do it another way (ie have two barriers at two locations talk, possibly exchanging depth info?) but I actually don't know and I wouldn't want to stall anything
17:25fdobridge: <gobrosse> even if I figure this out I think this design is safer and there's also the question of what happens with pre-volta or whatever
17:27fdobridge: <gobrosse> and since this stuff lives in per-lane gprs ... you gotta talk to 'em (which means sync, which means you need to know who takes part ... cyclical dependency if that changed since you entered the currentt scope!) (edited)
17:28fdobridge: <gobrosse> wait does nak support that
17:28fdobridge: <gobrosse> wait does nak support that even (edited)
19:04fdobridge: <airlied> Ughh I think we have one of those in gitlab report already
19:07fdobridge: <karolherbst🐧🦀> @airlied _any_ idea on this bug? It seems like the bar mapping is either screwed up or... dunno 🙂 `boot0` contains `0xbad0ac00`
19:09fdobridge: <gobrosse> @gfxstrand so hum what are the semantics of `bar_sync_nv` / where can I read about them ? I believe `set` and `break` are basically just modifying the barrier bitmask.
19:09fdobridge: <gobrosse>
19:09fdobridge: <gobrosse> I'd like to know what happens if there are two outstanding `bar_sync_nv` with one having a mask that's a subset of the other one, depending on what happens I might have a nifty idea
19:12fdobridge: <airlied> @karolherbst either my thread is broke or I didn't get a link to a bug
19:20fdobridge: <mhenning> @karolherbst do you have docs on BSYNC that explain this case?
19:23fdobridge: <gobrosse> full open-access NV ISA docs when 🐸
19:26fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/drm/nouveau/-/issues/342
19:26fdobridge: <karolherbst🐧🦀> only for turing+
19:27fdobridge: <gobrosse> that's fine
19:27fdobridge: <karolherbst🐧🦀> well.. what they store is implementation defined, so it's not documented
19:27fdobridge: <karolherbst🐧🦀> just the semantics on how they are supposed to be used
19:30fdobridge: <airlied> oh that bug is pretty special, esp as a regression, I can' t think of anything nouveau related, I wonder if it's pci or runpm somehow
19:30fdobridge: <karolherbst🐧🦀> yeah.. maybe...
19:31fdobridge: <karolherbst🐧🦀> runpm would be kinda weird, because it looks like the GPU responds
19:31fdobridge: <karolherbst🐧🦀> `0xbad.....` is such a common pattern in nvidia
19:31fdobridge: <karolherbst🐧🦀> basically means there was an error accessing the mmio range
19:31fdobridge: <karolherbst🐧🦀> either because it's not there, or because of other reasons
19:32fdobridge: <karolherbst🐧🦀> those codes even mean something, but I have no idea if we ever got docs on that
19:32fdobridge: <airlied> yeah it's just so early to get that, like we don't even touch fw or anything
19:32fdobridge: <karolherbst🐧🦀> yeah...
19:32fdobridge: <karolherbst🐧🦀> it's like the first mmio read we do
19:32fdobridge: <karolherbst🐧🦀> maybe second
19:32fdobridge: <airlied> it could be aspm or some other pcie thing also, but never seen that particular behaviour
19:33fdobridge: <karolherbst🐧🦀> yeah.. dunno.. at least _something_ is up, and the GPU is responding...
19:33fdobridge: <karolherbst🐧🦀> we should ask for docs on those codes tbh
19:50fdobridge: <mhenning> @karolherbst Right. The question is "what happens if there are two outstanding BSYNCs with one synchronizing a group of threads that's a subset of the other one?"
19:51fdobridge: <karolherbst🐧🦀> what means "outstanding"?
19:51fdobridge: <karolherbst🐧🦀> there is no stack
19:51fdobridge: <karolherbst🐧🦀> the only input is the barrier
19:53fdobridge: <mhenning> I read it as: Some lanes are waiting on one BSYNC and other lanes are waiting on a different BSYNC
19:53fdobridge: <karolherbst🐧🦀> sounds like dead locking situation
19:54fdobridge: <karolherbst🐧🦀> `BSYNC` waits until all threads specified through the barrier arrive
19:54fdobridge: <karolherbst🐧🦀> mhh well
19:54fdobridge: <karolherbst🐧🦀> actually
19:55fdobridge: <karolherbst🐧🦀> that's not true
19:55fdobridge: <karolherbst🐧🦀> `BSYNC` just checks if all relevant threads are yielded, blocked or exited
19:56fdobridge: <karolherbst🐧🦀> and all threads will be unblocked
19:57fdobridge: <karolherbst🐧🦀> so I don't think it actually matters on which `BSYNC` those threads are blocked on
19:58fdobridge: <karolherbst🐧🦀> `YIELD` indicates that as well
19:59fdobridge: <karolherbst🐧🦀> `YIELD` is just there to guarantee forward progress
19:59fdobridge: <gobrosse> wait I am asking about the case where one _is_ the subset of the other...
19:59fdobridge: <gobrosse> i thought of bsync as a barrier that blocks until all the threads in the mask arrive at a barrier too but it's still a bit fuzzy in my head
20:00fdobridge: <karolherbst🐧🦀> there is no relation between those barriers
20:00fdobridge: <karolherbst🐧🦀> but anyway...
20:00fdobridge: <karolherbst🐧🦀> `BSYNC` waits until all threads in that barrier are blocked (waiting via `BSYNC`, are yielded via `YIELD` or exited)
20:01fdobridge: <karolherbst🐧🦀> and then releases _all_ threads in that barrier
20:03fdobridge: <karolherbst🐧🦀> it apparently doesn't matter where those threads are waiting
20:03fdobridge: <karolherbst🐧🦀> and on what
20:03fdobridge: <karolherbst🐧🦀> @gfxstrand ^^ in case you didn't know
20:05fdobridge: <mhenning> @karolherbst really? So then if we
20:05fdobridge: <mhenning> ```
20:05fdobridge: <mhenning> if (divergent cond) {
20:05fdobridge: <mhenning> if (divergent cond) {
20:05fdobridge: <mhenning> }
20:05fdobridge: <mhenning> // BSYNC on threads A, B
20:05fdobridge: <mhenning> }
20:05fdobridge: <mhenning> // BSYNC on threads A, B, C
20:05fdobridge: <mhenning> ```
20:05fdobridge: <mhenning> If C reaches the last BSYNC first, then A, B reaching the inner sync will unblock C?
20:05fdobridge: <karolherbst🐧🦀> no
20:05fdobridge: <karolherbst🐧🦀> only the threads participating
20:06fdobridge: <karolherbst🐧🦀> but like
20:06fdobridge: <karolherbst🐧🦀> if threads A and B are waiting on the inner BSYNC, and C on the outer one
20:06fdobridge: <karolherbst🐧🦀> all threads will unblock
20:06fdobridge: <gobrosse> which all threads? does C proceed ?
20:06fdobridge: <karolherbst🐧🦀> yes
20:06fdobridge: <karolherbst🐧🦀> all participating threads
20:07fdobridge: <karolherbst🐧🦀> and on the outer one, all threads participate
20:07fdobridge: <gobrosse> hum what does participating mean exactly
20:07fdobridge: <karolherbst🐧🦀> part of the barrier
20:07fdobridge: <gobrosse> so like, threads that reached _a_ bsync
20:08fdobridge: <karolherbst🐧🦀> no
20:08fdobridge: <gobrosse> so like, threads that reached _a_ bsync ? (edited)
20:08fdobridge: <karolherbst🐧🦀> the _barrier_ not the instruction
20:08fdobridge: <karolherbst🐧🦀> the input to those instructions
20:08fdobridge: <gobrosse> ah by barrier you mean like the threadmask right?
20:08fdobridge: <karolherbst🐧🦀> well.. it might be a threadmask, but that's considered opaque in the docs
20:09fdobridge: <gobrosse> pretty sure it is iirc, i have a couple CUDA/nv experts in my group with whom I discussed the topic a while ago, plus what the heck else could it be
20:09fdobridge: <karolherbst🐧🦀> `BREAK` e.g. also modifiers the barrier because it's cursed
20:10fdobridge: <karolherbst🐧🦀> it is very likely to be a mask
20:10fdobridge: <karolherbst🐧🦀> but again.. I wouldn't know
20:10fdobridge: <karolherbst🐧🦀> it's also not relevant
20:11fdobridge: <karolherbst🐧🦀> what is relevant, that instructions like `BSSY` mark the threads as participating in the barrier which gets returned
20:12fdobridge: <karolherbst🐧🦀> apparently there is an aliasing bit on the barrier 🙂
20:12fdobridge: <karolherbst🐧🦀> which happens if you use a barrier with existing threads on `BSSY`
20:12fdobridge: <karolherbst🐧🦀> *which is set
20:13fdobridge: <karolherbst🐧🦀> I have no idea if that bit can even read out and if it matters for anything
20:13fdobridge: <karolherbst🐧🦀> 😄
20:14fdobridge: <mhenning> I'm struggling to understand how we can ever reconverge nested control flow if this is true
20:14fdobridge: <karolherbst🐧🦀> good question
20:15fdobridge: <karolherbst🐧🦀> it might be that in hardware it's different
20:15fdobridge: <gobrosse> The way the MR does it is by reconverging one construct at a time
20:15fdobridge: <karolherbst🐧🦀> I can only tell what I know
20:15fdobridge: <gobrosse> The way the MR does it is by reconverging one construct at a time, never multiple levels a tonce (edited)
20:15fdobridge: <gobrosse> The way the MR does it is by reconverging one construct at a time, never multiple levels at once (edited)
20:16fdobridge: <gobrosse> The way the MR does it is by reconverging one construct at a time, never multiple levels at once, so there never is ambiguity with what threads you're waiting on, but this also means that you can't just jump arbitrarily far out in one go (edited)
20:16fdobridge: <gobrosse> pretty ironic when the HW's claim to fame is supporting unstructured CF 🙃
20:17fdobridge: <gobrosse> but I _think_ you can do better by basically spinning on `bsync` and checking that you have indeed reconverged enough to proceed
20:17fdobridge: <gobrosse> i have a draft writeup about it but I'm holding on posting it in a comment until I do some whiteboarding to convince myself it works at all
20:18fdobridge: <gobrosse> (AFAIU) The way the MR does it is by reconverging one construct at a time, never multiple levels at once, so there never is ambiguity with what threads you're waiting on, but this also means that you can't just jump arbitrarily far out in one go (edited)
20:18fdobridge: <gobrosse> (AFAIU) The way the MR does it is by reconverging one construct at a time, never multiple levels at once, so there never is ambiguity with you're reconverging with and what threads you're waiting on, but this also means that you can't just jump arbitrarily far out in one go (edited)
20:19fdobridge: <gobrosse> but I _think_ you can do better by basically spinning, doing `bsync` and checking that you have indeed reconverged enough to proceed (edited)
20:19fdobridge: <gobrosse> but I _think_ you can do better by basically spinning, doing `bsync` and checking that other threads have indeed reconverged enough to proceed (edited)
20:20fdobridge: <mhenning> Unless I'm misunderstanding karol's description, one constuct at a time isn't enough to handle my example above - the inner reconvergence will restart thread C
20:21fdobridge: <gobrosse> yes, so,
20:21fdobridge: <gobrosse> C would still need to check that A and B's depth is low enough, since it's not it bsyncs again until A and B get done with the inner if and reach the outer sync
20:22fdobridge: <karolherbst🐧🦀> you could use `YIELD` with a vote
20:22fdobridge: <karolherbst🐧🦀> but yeah.. _maybe_ it makes sense to do what nvidia is doing
20:22fdobridge: <karolherbst🐧🦀> but knowing nvidia, they just loop merge it into one construct
20:22fdobridge: <karolherbst🐧🦀> done and done
20:24fdobridge: <gobrosse> wdym? what do they loop merge ? the example ?
20:24fdobridge: <gobrosse> i feel like they need to have a general solution, or did you mean it's similar to what I propose?
20:27fdobridge: <karolherbst🐧🦀> they merge nested loops into one loop e.g.
20:27fdobridge: <karolherbst🐧🦀> and ifs are just predicates
20:28fdobridge: <karolherbst🐧🦀> so all threads run in lock step within that loop until they are all done
20:29fdobridge: <karolherbst🐧🦀> and one thread can break out of it at any time, because there is just one barrier to sync on to begin with
20:33fdobridge: <mhenning> This all sounds bizarre to me. I might spend some time figuring out what nvcc emits
20:34fdobridge: <karolherbst🐧🦀> yeah.. check what nvidia is doing. But whenever I checked more complex control flow, they used a tons of predicates and loop merged the heck out of everything
20:35fdobridge: <karolherbst🐧🦀> though I never checked what they'd do with optimizations disabled tbh
20:53fdobridge: <gfxstrand> That doesn't make any sense.
20:54fdobridge: <karolherbst🐧🦀> well...
20:54fdobridge: <karolherbst🐧🦀> in the end only the hardware is actual truth
20:54fdobridge:<karolherbst🐧🦀> but
20:55fdobridge: <karolherbst🐧🦀> yeah.. no idea really... because with `YIELD` in the mix it's kinda hard to require the same `BSYNC`
20:55fdobridge: <karolherbst🐧🦀> unless it's the same `BYSNC` or any `YIELD`
21:08fdobridge: <gfxstrand> According to the NVIDIA blog (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/):
21:08fdobridge: <gfxstrand> > The `__syncwarp()` primitive causes the executing thread to wait until all threads specified in mask have executed a `__syncwarp()` (with the same mask) before resuming execution. It also provides a memory fence to allow threads to communicate via memory before and after calling the primitive.
21:10fdobridge: <gfxstrand> So it doesn't care about it being the same instruction, it cares about them all being on a sync with the same mask. As long as you don't screw up your masks, the mask should uniquely identify sync instruction within the set of active syncs
21:13fdobridge: <gfxstrand> My understanding of `bssy` is that it's basically `ballot(true)` and `break` is basically `bar &= ~ballot(pred)`
21:14fdobridge: <pac85> I'm curious about this unstructured cf in nir, how do we retain the reconvergence information?
21:21fdobridge: <karolherbst🐧🦀> sounds about right
21:21fdobridge: <karolherbst🐧🦀> I suspect there is no `YIELD` thing for cuda, because that's something the compiler inserts apparently
21:22fdobridge: <karolherbst🐧🦀> who is saying anything about the same mask?
21:22fdobridge: <karolherbst🐧🦀> ohh wait
21:22fdobridge: <karolherbst🐧🦀> mhhh
21:22fdobridge: <karolherbst🐧🦀> that blog does...
21:22fdobridge: <karolherbst🐧🦀> well.. my docs don't :ferrisUpsideDown:
21:23fdobridge: <karolherbst🐧🦀> soo
21:23fdobridge: <karolherbst🐧🦀> the question is.. is it something CUDA guarantees by lowering it or if it's something the hardware actually checks
21:23fdobridge: <karolherbst🐧🦀> but it would make sense if it works like that
21:24fdobridge: <karolherbst🐧🦀> let's check ptx...
21:25fdobridge: <karolherbst🐧🦀> `bar.warp.sync will cause executing thread to wait until all threads corresponding to membermask have executed a bar.warp.sync with the same membermask value before resuming execution.`
21:25fdobridge: <karolherbst🐧🦀> `For .target sm_6x or below, all threads in membermask must execute the same bar.warp.sync instruction in convergence, and only threads belonging to some membermask can be active when the bar.warp.sync instruction is executed. Otherwise, the behavior is undefined.` 😄
21:25fdobridge: <karolherbst🐧🦀> figures
21:26fdobridge: <karolherbst🐧🦀> so yeah.. assume my docs are trashy, but it could also be something that nvidia deals internally with
21:27fdobridge: <gfxstrand> For these sorts of things, PTX tends to match the hardware
21:27fdobridge: <gfxstrand> I'm willing to make that assumption
21:27fdobridge: <gfxstrand> What I don't know is what this looks like pre-Volta
21:27fdobridge: <gfxstrand> Or does that text mean you already have to be re-converged
21:28fdobridge: <karolherbst🐧🦀> I think you have to push/pop in order
21:28fdobridge: <karolherbst🐧🦀> there is a hierarchy though
21:28fdobridge: <gfxstrand> Yes, that's fine
21:28fdobridge: <karolherbst🐧🦀> like... I think a break also pops all precont entries
21:28fdobridge: <karolherbst🐧🦀> ehh maybe not all
21:28fdobridge: <karolherbst🐧🦀> but anyway.. something like that was going on
21:30fdobridge: <karolherbst🐧🦀> mhhhh
21:30fdobridge: <karolherbst🐧🦀> actually...
21:30fdobridge: <karolherbst🐧🦀> what if it needs to be the same _barrier_
21:30fdobridge: <karolherbst🐧🦀> because that's trivial to check in hardware
21:31fdobridge: <karolherbst🐧🦀> or rather.. easier then checking if each thread passed the same mask
21:31fdobridge: <karolherbst🐧🦀> because then it would make sense
21:32fdobridge: <karolherbst🐧🦀> which also explains why break doesn't return a barrier
21:32fdobridge: <karolherbst🐧🦀> @gfxstrand ^^ I'd verify this theory if I were you
21:33fdobridge: <mhenning> Yes, pre-volta a break will pop the precont entries for you
21:34fdobridge: <karolherbst🐧🦀> and if that means if you have two waiting points, if it needs to get the exact same barrier passed in as well (which I'd assume you'd have to do)
21:34fdobridge: <mhenning> Note that for that blog post, I'd guess that `__syncwarp()` becomes WARPSYNC in hardware, which could plausibly have different semantics from BSSY
21:35fdobridge: <mhenning> Note that for that blog post, I'd guess that `__syncwarp()` becomes WARPSYNC in hardware, which could plausibly have different semantics from ~BSSY~ BSYNC (edited)
21:35fdobridge: <karolherbst🐧🦀> ohh yeah..
21:35fdobridge: <mhenning> Note that for that blog post, I'd guess that `__syncwarp()` becomes WARPSYNC in hardware, which could plausibly have different semantics from ~~BSSY~~ BSYNC (edited)
21:35fdobridge: <karolherbst🐧🦀> `WARPSYNC` waits explicitly on the same instruction
21:36fdobridge: <karolherbst🐧🦀> it also has way stronger wording than `BSYNC`
21:36fdobridge: <karolherbst🐧🦀> like it explicitly guarantees that the active mask of threads executing the _next_ instruction is the same as the mask passed to `WARPSYNC`
21:37fdobridge: <karolherbst🐧🦀> (- threads who have exited)
21:37fdobridge: <karolherbst🐧🦀> yeah...
21:37fdobridge: <karolherbst🐧🦀> I think `__syncwarp` == `WARPSYNC`
21:37fdobridge: <karolherbst🐧🦀> it also has this memory barrier thing going on
21:39fdobridge: <karolherbst🐧🦀> `BSYNC` also states that it doesn't wait on sleeping threads as they are considered to be yielded...
21:40fdobridge: <karolherbst🐧🦀> and `NANOSLEEP` doesn't even have an input barrier
21:43fdobridge: <gfxstrand> Can you throw a __syncwarp at the cuda compiler and find out?
21:45fdobridge: <karolherbst🐧🦀> never done that...
21:46fdobridge: <karolherbst🐧🦀> let's see...
21:47fdobridge: <karolherbst🐧🦀> `cuda_runtime.h: No such file or directory` 🥲
21:49fdobridge: <karolherbst🐧🦀> `error: #error -- unsupported GNU version! gcc versions later than 12 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.` 🥲
21:49fdobridge: <karolherbst🐧🦀> let's see if 12.3 is any better
21:50fdobridge: <karolherbst🐧🦀> ehh 12.4 actually
21:51fdobridge: <karolherbst🐧🦀> ahh that worked
21:53fdobridge: <karolherbst🐧🦀> `/usr/local/cuda-12.4/bin/nvcc -arch=sm_75 test.cu --cubin`
21:54fdobridge: <karolherbst🐧🦀> @gfxstrand perfect... nvidia optimizes it to a `NOP ;` 🥲
21:54fdobridge: <karolherbst🐧🦀> even with `-O0`
21:54fdobridge: <gfxstrand> Awesome!
21:55fdobridge: <karolherbst🐧🦀> lemme grab some demo code 😄
21:55fdobridge: <mhenning> It's a little tricky to get it to avoid it DCEing the warpsync, but I get
21:55fdobridge: <mhenning> ```
21:55fdobridge: <mhenning> /*00c0*/ MOV R4, 0xffffffff ; /* 0xffffffff00047802 */
21:55fdobridge: <mhenning> /* 0x000fe40000000f00 */
21:55fdobridge: <mhenning> /*00f0*/ CALL.ABS.NOINC `(__cuda_sm70_warpsync) ; /* 0x0000000000007943 */
21:55fdobridge: <mhenning> /* 0x000fea0003c00000 */
21:55fdobridge: <mhenning> ```
21:55fdobridge: <mhenning> where that function is:
21:55fdobridge: <mhenning> ```
21:55fdobridge: <mhenning> __cuda_sm70_warpsync:
21:55fdobridge: <mhenning> /*0000*/ WARPSYNC R4 ; /* 0x0000000400007348 */
21:55fdobridge: <mhenning> /* 0x000fe80003800000 */
21:55fdobridge: <mhenning> /*0010*/ RET.ABS.NODEC R20 0x0 ; /* 0x0000000014007950 */
21:55fdobridge: <mhenning> /* 0x000fea0003e00000 */
21:55fdobridge: <mhenning> ```
21:56fdobridge: <gfxstrand> Right, so we could implement it with vote and warpsync
21:57fdobridge: <karolherbst🐧🦀> okay.. sooo
21:57fdobridge: <karolherbst🐧🦀> yeah...
21:58fdobridge: <karolherbst🐧🦀> @mhenning yes.. just have it inside an if :ferrisUpsideDown:
21:58fdobridge: <karolherbst🐧🦀> apparently
21:58fdobridge: <karolherbst🐧🦀> ah!
21:58fdobridge: <karolherbst🐧🦀> https://gist.github.com/karolherbst/67cda39755f35b86372786addd2e73dc
21:58fdobridge: <karolherbst🐧🦀> it also uses BSYNC
21:58fdobridge: <karolherbst🐧🦀> after the if/else
21:58fdobridge: <gfxstrand> So... crazy plan... We have warpsync on Maxwell, right? We have vote, too. We can do the same thing for both.
21:58fdobridge: <karolherbst🐧🦀> we don't
21:58fdobridge: <karolherbst🐧🦀> it's Volta+
21:59fdobridge: <karolherbst🐧🦀> WARPSYNC is the wrong thing here anyway
21:59fdobridge: <karolherbst🐧🦀> BSSY+BSYNC is the right thing to converge around CF
21:59fdobridge: <karolherbst🐧🦀> it just has funky semantics which don't matter for nvidia, because their compiler just optimizes the hell out of it
22:01fdobridge: <gfxstrand> Sure but I'm still confused on the semantics.
22:02fdobridge: <karolherbst🐧🦀> anyway
22:02fdobridge: <karolherbst🐧🦀> nvidia loop merges
22:03fdobridge: <karolherbst🐧🦀> uhh
22:03fdobridge: <karolherbst🐧🦀> and unrolls
22:03fdobridge: <karolherbst🐧🦀> funky
22:03fdobridge: <karolherbst🐧🦀> nvidia optimizes from the C++ side :ferrisUpsideDown:
22:03fdobridge: <karolherbst🐧🦀> like constant arguments to the kernel
22:03fdobridge: <karolherbst🐧🦀> what a pain
22:05fdobridge: <karolherbst🐧🦀> omg is this impressive
22:10fdobridge: <karolherbst🐧🦀> yeah.. so the issue is that nvidia only inserts BSSY+BSYNC when they absolutely have to
22:10fdobridge: <karolherbst🐧🦀> and they optimize loops and ifs in a way that they don't really break up threads
22:11fdobridge: <karolherbst🐧🦀> like nested loops are just one loop with predication inside, and one bra to break out
22:11fdobridge: <karolherbst🐧🦀> that's it
22:11fdobridge: <karolherbst🐧🦀> well and if threads diverge inside it doesn't matter because there is nothing needing converged threads in the first place
22:11fdobridge: <karolherbst🐧🦀> so they only sync once, even it's all nested and everything
22:12fdobridge: <karolherbst🐧🦀> so yeah.. as I said: loop merging and predication, that's what nvidia is doing
22:15fdobridge: <karolherbst🐧🦀> but anyway.. cuda is too much to actually RE anything here, because they are doing crazy shit
22:17fdobridge: <karolherbst🐧🦀> but they also blow up my single line loop into ~500 instructions?
22:18fdobridge: <karolherbst🐧🦀> maybe 200..
22:18fdobridge: <karolherbst🐧🦀> anyway
22:18fdobridge: <karolherbst🐧🦀> a lot of things are going on there and I think ptx is the easier target 😄
22:21fdobridge: <karolherbst🐧🦀> mhhh.. maybe I shouldn't have used an idiv...