12:46 cat_fights[d]: Now that ReactOS is able use the windows XP Nvidia drivers could it help reveal some of the missing information needed for power management/reclocking of NV50 - NV130 family of Nvidia GPUs?
15:59 marysaka[d]: Mesh shader MR should be ready for review now, hopefully the documentation I typed isn't too bad 😅 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27196
16:24 zmike[d]: does it pass glcts?
16:37 mohamexiety[d]: oh god i forgot gl has ext_mesh now
17:00 zmike[d]: CI is not testing it
17:01 zmike[d]: and also you'll need to pick https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37408 to fix nir explosions because I merged an MR 2 months ago which broke everything and nobody noticed
17:03 esdrastarsis[d]: zmike[d]: Is there any plans for zink video?
17:04 zmike[d]: someday
17:36 marysaka[d]: zmike[d]: didn't test but I can give that a shot tomorrow, the more tests the better
19:08 marysaka[d]: mhenning[d]: Talking about the whole shared atomics lowering madness, if you have a bit of time I wouldn't mind if you could run a full CTS (or `dEQP-VK.mesh_shader.*` and `dEQP-VK.glsl.atomic_operations.*mesh*`) on Blackwell
19:09 marysaka[d]: I *think* the lowering should be behaving nicely even on SM90+ but want to be sure it's not doing any weird things...
19:19 mhenning[d]: Sure, I can try that
19:19 marysaka[d]: Thanks!
19:35 mhenning[d]: `dEQP-VK.mesh_shader.*` and `dEQP-VK.glsl.atomic_operations.*mesh*` both pass on blackwell here. Can do a full cts run tonight
19:40 marysaka[d]: Okay that's reassuring at least, thanks for testing
21:55 mhenning[d]: karolherbst[d]: Do you have any documentation on warpsync.collective and endcollective? (I think they're sm90+)
21:55 karolherbst[d]: mhenning[d]: a bit and they are a bit weird
21:59 karolherbst[d]: mhenning[d]: have you checked for new barrier stats?
22:00 mhenning[d]: new barrier stats? I don't know what that means
22:00 karolherbst[d]: the thing that is called `TSSemantic` in codegen
22:01 karolherbst[d]: the special barrier registers
22:02 mhenning[d]: I suppose I haven't checked. Should I?
22:02 karolherbst[d]: there should be new ones
22:03 mhenning[d]: okay, that's another thing to look at then
22:04 karolherbst[d]: well it's related is my point rather
22:04 karolherbst[d]: there should be a new one for the collective state
22:04 mhenning[d]: sure
22:06 karolherbst[d]: I know that you aren't allowed to cause any divergency in a collective section
22:06 karolherbst[d]: not sure what the hw does if you do
22:08 mhenning[d]: cuda generates them in order to write uregs from nonuniform control flow on sm90+ (see https://gitlab.freedesktop.org/mhenning/re/-/blob/main/uniform_write_nonuniform_control_flow/notes?ref_type=heads ), and I suspect they prevent diverged threads from trampling on the ureg there
22:08 mhenning[d]: and I suspect they'll also be necessary for the mesh shader shared atomic lowering for a similar reason
22:09 karolherbst[d]: mhhhhh
22:09 mhenning[d]: and yeah, I wouldn't do any control flow in the middle of a collective block
22:13 karolherbst[d]: the input is a thread mask, right?
22:15 mhenning[d]: for warpsync, yes
22:15 karolherbst[d]: probably a result of a MATCH?
22:16 mhenning[d]: no, in cuda the mask is user-provided
22:16 karolherbst[d]: ahh
22:16 karolherbst[d]: what's the ptx construct that generates this?
22:17 mhenning[d]: The cuda program is calling __reduce_xor_sync from divergent control flow
22:17 karolherbst[d]: form my understanding, the sm80 code is a bit dodgy already tbh
22:20 karolherbst[d]: so from what I've know is, that .EXCLUSIVE makes sure that the next instruction only is executed based on the input thread mask
22:22 mhenning[d]: right, my best guess is that the sm80 code it generates is legal because there's no control flow between the warpsync and when the ureg is last used
22:23 karolherbst[d]: mhhh
22:23 karolherbst[d]: not sure
22:23 karolherbst[d]: I suspect if the UREG would be read later it would be invalid
22:24 mhenning[d]: my current theory is that before sm90 there's no switching without a YIELD or control flow, but sm90+ needs the additinal COLLECTIVE to prevent that
22:25 mhenning[d]: karolherbst[d]: right, the point is that it doesn't live for more than an instruction or two
22:25 karolherbst[d]: there is a new thread mask for collective sections and I suspect it just guarnatees a set of threads executes between the warpsync and endcollective
22:26 mhenning[d]: that wouldn't explain why the cuda compiler can emit the code that it does
22:26 karolherbst[d]: why not?
22:27 karolherbst[d]: I mean it also prevents other threads to enter that section
22:27 mhenning[d]: It must prevent other threads from writing the ureg somehow
22:28 karolherbst[d]: how could it? I'm sure you could thrash the value by writing to the same ureg from a different part of the shader
22:29 mhenning[d]: right, the point is that nothing can write the ureg between REDUX.XOR and the MOV or else the code is illegal
22:29 karolherbst[d]: right
22:29 mhenning[d]: but anyway yes, still in the middle of reverse engineering this
22:30 karolherbst[d]: my understanding is that the warpsync just blocks other threads to enter the section and only if all the threads in the provided mask have reached it, they are allowed to enter collectively until they reach an endcollective
22:31 karolherbst[d]: but I think the new thread mask barrier special reg will help to figure this out as well
22:33 karolherbst[d]: but it's not really clear to me what that CALL part in your example is changing...
22:34 mhenning[d]: Yeah, it's not clear to me if the CALL is necessary or not
22:34 karolherbst[d]: except WARPSYNC.COLLECTIVE just allows for more aggressive inlining
22:34 karolherbst[d]: because of more guarnatees
22:35 karolherbst[d]: like the cuda functions are often provided as SASS, right? So a compiler doesn't really know if it can inline things safely or not
22:35 karolherbst[d]: at least that's my assumption
22:36 karolherbst[d]: but yeah dunno...
22:38 karolherbst[d]: anyway, I've seen the shader nvidia generates for the shared memory lowering and it's.... how do I put it: very smart
22:40 mhenning[d]: Is there a shader I can dump to see the equivalent for blackwell?
22:41 karolherbst[d]: uhh would have to ask Mary, I only saw the dumps
22:42 mhenning[d]: Okay, is the existing dump anywhere I can see it?
22:42 karolherbst[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1481422207874891837/snippet1.asm?ex=69b34171&is=69b1eff1&hm=121a99d617c392c7a0c0350a4459d8f36cd337e0d6e273cb81b331cbde74695c&
22:43 karolherbst[d]: there was another one somewhere
22:44 karolherbst[d]: but they have a fast path based on uniform ops
22:44 karolherbst[d]: and the code there checks for it and falls back to the non uniform one you see there
22:44 karolherbst[d]: I think that's an atomic add there
22:45 mhenning[d]: yeah, sounds a bit like nir_opt_uniform_atomics
22:46 karolherbst[d]: the uniform one is counting the active threads
22:46 karolherbst[d]: multiplying the add with the thread count and does a single load +store pair without looping
22:47 karolherbst[d]: mhenning[d]: yeah.. that was my thought as well
22:47 karolherbst[d]: I was wondering if we should use that
22:47 karolherbst[d]: for mesh shaders
22:47 mhenning[d]: yeah, might be worth it
22:48 karolherbst[d]: though MATCH allows us to optimize more, so there is that
22:49 mhenning[d]: I mean, match is just a nir_intrinsic_vote_ieq
22:49 karolherbst[d]: but anyway.. it's super cursed that nvidia didn't add proper atomics here 🙃
22:49 karolherbst[d]: but they also wanted to implement it only with 32 threads..
22:49 karolherbst[d]: mhenning[d]: not quite
22:50 mhenning[d]: match.all is
22:50 karolherbst[d]: right
22:50 karolherbst[d]: match.any tells you which threads also have the same value
22:51 mhenning[d]: I'm well aware
22:58 karolherbst[d]: anyway.. kinda feels like with match.any we could have more optimized code there, but dunno.. atomics are funky
23:02 marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1481427022331904225/debug.mesh.glsl?ex=69b345ed&is=69b1f46d&hm=5c6c550e2c44223e10611752066572761b3e72971c82ded1463e577c4834aeb0&
23:02 marysaka[d]: mhenning[d]: This one have the pattern for the second atomic (line 52) it's from `dEQP-VK.glsl.atomic_operations.add_unsigned_mesh_shared`
23:04 marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1481427529326657718/shader_data.asm?ex=69b34666&is=69b1f4e6&hm=ab3654300a5c64414c69418a1afb6a13fbefae6f441828b700958e2c29d280f7&
23:04 marysaka[d]: output for SM86
23:04 marysaka[d]: for reference the RED at the start is for the mesh invocations counter
23:06 mhenning[d]: do you use nvdump to generate the asm?
23:06 marysaka[d]: I have my own tooling but it's essentially the same
23:06 marysaka[d]: https://github.com/marysaka/usami
23:07 marysaka[d]: it's basically split with an http server that output the shader binary ect
23:07 karolherbst[d]: we really should use nir_opt_uniform_atomics 😄
23:07 karolherbst[d]: I wonder if nvidia does it in more places
23:07 karolherbst[d]: or if they even do it with normal atomics
23:08 marysaka[d]: karolherbst[d]: I was using it initially but wasn't sure how to handle detection of it being done or not for certain atomics
23:08 karolherbst[d]: but I don't think I ever saw nvidia optimize uniform atomics?
23:08 karolherbst[d]: I wonder if the hw is smart enough
23:08 mhenning[d]: karolherbst[d]: I think I've seen cuda generate a similar pattern, although I forget for what
23:08 karolherbst[d]: yeah.... not sure myself tbh
23:09 marysaka[d]: marysaka[d]: Ah I just realized that nvdump would change nvdisasm output so I guess my thing is different 😄
23:11 marysaka[d]: but if you are interested, shader-dump is the http server and compile_glsl script is the boilerplate that will do various things
23:11 marysaka[d]: hardcoded of course to various host configuration I have here but I should probably make that more generic 😄
23:12 mhenning[d]: Okay, I'll probably try nvdump first but will look at the other repo if I have issues 😛
23:14 marysaka[d]: yeah no need to poke with my thing it's basically the same but remote 😄
23:15 marysaka[d]: One thing that would be nice to add to nvdump would be the ability to set the shader flags so we could test DGC indirect or without task shaders (my tool have some very dirty python parser that grab the info from the start of the file and pass it to the http request)