02:01 karolherbst[d]: okay.. `dEQP-VK.reconvergence.maximal.compute.nesting4.7.38` is a test where `nir_convert_to_lcssa` makes some stuff divergent that was convergent before... mhh but in the worst way
02:05 karolherbst[d]: I wonder if I should enable `nir_opt_licm` 🙃
02:05 karolherbst[d]: because that actually fixes a bunch of cases
02:05 karolherbst[d]: well with what I'm doing here it actually makes sense..
02:06 karolherbst[d]: but also hides bugs I hate it 🙂
02:07 karolherbst[d]: well not bugs, because legalize would clean it all up, it's really just nir validation at this point
02:08 Mangix: does nouveau work for a T1000?
02:09 Mangix: I haven't used nouveau since the GTX 980
02:10 karolherbst[d]: it should
10:34 karolherbst[d]: mhh maybe I should just give up on validating the uniform source, because various passes always have very smart ways to fuck it up for me lol
10:34 karolherbst[d]: or I do the thing after lower_cf..., but that's potentially very invalid
10:35 karolherbst[d]: mhhh copy_prop is legal after `nir_convert_to_lcssa`, so I should at least be allowed to do it after that
10:37 karolherbst[d]: but that's after int64 lowering, so that's not great 🥲
10:37 karolherbst[d]: though should be fine?
10:44 karolherbst[d]: `nak_nir_rematerialize_load_const` never reports progress btw
11:15 karolherbst[d]: okay okay...
11:15 karolherbst[d]: it's not `nir_convert_to_lcssa`
11:15 karolherbst[d]: it's `nak_nir_rematerialize_load_const`
11:15 karolherbst[d]: ohhhhhh
11:16 karolherbst[d]: ahhhh
11:16 karolherbst[d]: I found a bug 🙃
11:16 karolherbst[d]: `nak_nir_rematerialize_load_const` preserves `nir_metadata_control_flow | nir_metadata_divergence` but it totally doesn't preserve `nir_metadata_divergence` actually
11:17 karolherbst[d]: okay but what's going wrong here...
11:18 karolherbst[d]: `div 32 %34 = load_const (0x00000000)`
11:19 karolherbst[d]: even removing the preserve is still causing issues...
11:20 karolherbst[d]: ohh right..
11:20 karolherbst[d]: because it doesn't report progess at all
11:20 karolherbst[d]: ahh
11:24 karolherbst[d]: maybe a simple fix would be to just mark the consts as convergent and move on...
12:55 karolherbst[d]: okay soo.. `nak_nir_rematerialize_load_const` turns certain blocks divergent, because how it rematerializes constant sources of phis...
13:35 karolherbst[d]: gfxstrand[d]: is this transformation intended behavior? And if not, any ideas how to improve it? https://gist.github.com/karolherbst/ae8489787005675451a9daea7b225931
13:56 karolherbst[d]: okay.. I think I have a fix for this...
13:58 karolherbst[d]: looks much better (b is with my fix) https://gist.github.com/karolherbst/33f68c6c07586d8cf2bcd253
13:59 karolherbst[d]: I hope the stats agree...
14:19 karolherbst[d]: okay the fix regresses the stats a bit ...
14:24 karolherbst[d]: but I also have a few shaders that are helped significantly.. I'm wondering if I can come up with better heuristics there..
14:55 gfxstrand[d]: karolherbst[d]: Not sure what you think is wrong there
14:56 karolherbst[d]: gfxstrand[d]: it turns certain blocks into divergent ones, which seems preventable. And that makes it more difficult to do the GPR + UGPR thing for load/stores
14:56 karolherbst[d]: because that happens before that pass
14:56 karolherbst[d]: and then I have the situation where I end up with two divergent sources
14:56 gfxstrand[d]: wait, what? That shouldn't be happening
14:56 karolherbst[d]: well it does
14:56 karolherbst[d]: compare the both
14:57 karolherbst[d]: tldr: phi selects constants, the constants gets moved from convergent into divergent blocks
14:57 karolherbst[d]: and then it propagates
14:57 karolherbst[d]: and makes the blocks divergent
14:57 karolherbst[d]: karolherbst[d]: here you see how I got a better result with a little tweak
14:57 gfxstrand[d]: Uh... Divergence analysis should know better
14:58 karolherbst[d]: but that is still worse on avarage
14:58 gfxstrand[d]: karolherbst[d]: I can't see that. Your gists are locked or something
14:58 karolherbst[d]: heh..
14:58 karolherbst[d]: it's gone?
14:58 marysaka[d]: it's gone yeah
14:58 gfxstrand[d]: 404
14:59 karolherbst[d]: https://gist.github.com/karolherbst/33f68c6c07586d8cf2bcd25370b62fa6
14:59 karolherbst[d]: and fix my fix: ae8489787005675451a9daea7b225931
14:59 karolherbst[d]: https://gist.github.com/karolherbst/ae8489787005675451a9daea7b225931
14:59 gfxstrand[d]: karolherbst[d]: That's what I can't see. I can see the gists
15:00 gfxstrand[d]: Oh, you don't have a commit to look at
15:01 karolherbst[d]: ahh no, it's not a commit 😄
15:01 karolherbst[d]: fix is just a `if (block->divergent && !succ->divergent) continue;` thing
15:01 karolherbst[d]: anyway.. kinda wondering what to do about that
15:03 gfxstrand[d]: Is it? This happens before we lower control-flow
15:03 gfxstrand[d]: The convergence information in a looks wrong
15:04 gfxstrand[d]: It claims the 2nd loop is convergent but there's a continue in a divergent block.
15:04 karolherbst[d]: I was force requesting the metadata to print it, and I wouldn't be surprised if some passes forget to set it
15:05 karolherbst[d]: *invalidate it
15:05 karolherbst[d]: though if it runs it should be properly generated.. dunno
15:05 gfxstrand[d]: What shader is this?
15:05 karolherbst[d]: `dEQP-VK.graphicsfuzz.cov-large-int-array-nested-loops-set-ivec-index-component-sum`
15:05 gfxstrand[d]: Any patches I need on top or is main okay?
15:06 karolherbst[d]: I have turned the progress thing to `return nir_progress(true, impl, nir_metadata_control_flow);` in `nak_nir_rematerialize_load_const`, because it turns out it caused those issues and never reported progress
15:07 karolherbst[d]: but besides that, I'm on kinda main
15:07 karolherbst[d]: `177e712728dc613e8e1d72d9f94a94e865f7482b` is what I'm at atm
15:08 gfxstrand[d]: Oh, yeah, progress is clearly wrong.
15:12 karolherbst[d]: I first blamed `nir_convert_to_lcssa` for things, because it changed divergency, because after `nak_nir_rematerialize_load_const` I never got something printed
15:12 karolherbst[d]: but also: `if (OPT(nir, nak_nir_rematerialize_load_const)) OPT(nir, nir_opt_dce);` 🙃
15:15 gfxstrand[d]: Okay, one is easy to fix: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40464
15:19 karolherbst[d]: I almost started to fix this and... somehow I didn't see this easy solution to this problem 😄
15:26 gfxstrand[d]: wait... I see what's going wrong. 🤦🏻‍♀️
15:27 gfxstrand[d]: :xenia_sob:
15:27 karolherbst[d]: now I'm curious
15:27 karolherbst[d]: I had an idea, but that idea would be painful to implement, so I'm wondering what you came up with 😄
15:27 gfxstrand[d]: `con 1 %9 = phi b1: %6 (false), b10: %5 (true), b14: %5 (true)`
15:28 karolherbst[d]: yeah.. the constant rematerializes in a divergent block
15:28 gfxstrand[d]: That starts off with the outside value false and all inside values true. So it doesn't add to divergence.
15:28 gfxstrand[d]: But then we make it two different SSA values which are both true
15:28 karolherbst[d]: ohhh
15:28 karolherbst[d]: big sad
15:29 gfxstrand[d]: Yeah...
15:29 karolherbst[d]: so I was considering to find the first common predecessor of both paths to place it there...
15:29 karolherbst[d]: guess that would also solve that issue
15:30 gfxstrand[d]: I think we just need to be more careful with phis somehow
15:31 gfxstrand[d]: Most of the time, putting the load_const in the same block as the use is harmless
15:31 gfxstrand[d]: It can never add to divergence if you do that.
15:31 gfxstrand[d]: But phis are special and weird
15:31 karolherbst[d]: right
15:31 karolherbst[d]: my patch did cut instruction count by over 10% in that shader
15:32 karolherbst[d]: I did try skipping it for phis, but that resulted in worse stats on avg
15:34 karolherbst[d]: I think for phis that pass needs to work in reverse, looking from the phi and picking a good block to place the constant, but...
15:34 karolherbst[d]: might as well write a new pass
15:44 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40465
16:54 karolherbst[d]: okay, let's see if my UGPR branch doesn't regress the CTS with that one 😄
16:54 karolherbst[d]: ~~I also could just remove the `nir_validate` bits~~
17:14 karolherbst[d]: okay.. `dEQP-VK.reconvergence.maximal.compute.nesting4.7.38` might have a different issue with divergency 🙃
17:21 karolherbst[d]: okay.. this is... weird
17:27 karolherbst[d]: okay.. this is sad
17:29 karolherbst[d]: con 32x2 %15 = @ldc_nv (%6 (0x1), %7 (0x0)) (base=16, access=none, align_mul=16, align_offset=0)
17:29 karolherbst[d]: con 64 %36 = pack_64_2x32 %15
17:29 karolherbst[d]: ...
17:29 karolherbst[d]: con 64 %453 = phi b17: %36
17:29 karolherbst[d]: ...
17:29 karolherbst[d]: div 64 %743 = phi b72: %453
17:29 karolherbst[d]: @store_global_nv (%752 (0x100a2), %751, %743) (base=0, access=coherent|writeonly, align_mul=16, align_offset=0)
17:29 karolherbst[d]: gfxstrand[d]: still some brain left for another weird issue? 😄
17:30 karolherbst[d]: mhh it's `nir_convert_to_lcssa` generating this, soo..
17:30 mhenning[d]: that looks right to me, depending on what the control flow is
17:30 karolherbst[d]: well it's convergent prior `nir_convert_to_lcssa`
17:31 karolherbst[d]: so `%751` is supposed to be the divergent source and `%743` the convergent one
17:31 karolherbst[d]: and that's all fine until `nir_convert_to_lcssa` gets ran and changes things
17:31 karolherbst[d]: *run
17:33 karolherbst[d]: https://gist.github.com/karolherbst/2849787aa09f4c09342ff7a2e1b64eda the full thing
17:35 mhenning[d]: It looks like the loop has a divergent break, so %743 being divergent seems right to me
17:37 mhenning[d]: how are you determining if it's divergent before lcssa? Note that testing the variable isn't enough, you need to use nir_def_is_divergent_at_use_block or nir_src_is_divergent (or convert to lcssa)
17:37 karolherbst[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/1762c32e1f295993dc81661bdc8bbd82089d525f
17:37 karolherbst[d]: but yeah..
17:38 karolherbst[d]: I guess I should use the helpers
17:38 karolherbst[d]: `nir_def_is_divergent_at_use_block` sounds like good?
17:38 karolherbst[d]: let's see if those helper fix it
17:39 mhenning[d]: yeah, the `iadd->src[0].src.ssa->divergent` is wrong unless you're already in lcssa
17:39 karolherbst[d]: okay
17:40 marysaka[d]: mhenning[d]: About the divergent threads for shared atomics lowering, I tried to get the NVIDIA compiler in that specific case to see what would be emitted but they emit the same thing in both path, the only sync that happen is that they have a BSSY and BSYNC around both loops but nothing more
17:40 marysaka[d]: The read + operation is also done by all threads
17:42 mhenning[d]: marysaka[d]: Which architecture are you on, again?
17:43 marysaka[d]: Ada
17:43 mhenning[d]: also, did you disassemble sched info? do they do anything special with the yield flag?
17:43 marysaka[d]: that's a good point let me dig that
17:44 mhenning[d]: If you want to send me the test that you did, I'd like to also run it on blackwell to see if it's any different
17:45 mhenning[d]: but yeah, if they don't do anything special on either ada or blackwell then the current lowering is probably fine
17:46 marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1483521979117076583/debug2.mesh.glsl?ex=69bae502&is=69b99382&hm=fc9c7dde744680f29b12566e058a97476a978cf64a2924bc074570a6253872fb&
17:46 marysaka[d]: mhenning[d]: This is the shader
17:47 marysaka[d]: I just noticed that we never added support in nvdump for mesh/task so let me quickly fix that 🙃
17:49 marysaka[d]: okay pushed a fix for mesh/task in nvdump
17:51 marysaka[d]: oh they yield on every instructions of both path huh mhenning[d]
17:51 marysaka[d]: or actually no I'm not too sure if it's because of the divergence they yield even at the start quite a bit
17:52 karolherbst[d]: uhhh... doesn't seem like `nir_src_is_divergent` is good enough 🥲
17:52 karolherbst[d]: ohh wait... uhm...
17:52 karolherbst[d]: yeah. I have to use `nir_def_is_divergent_at_use_block` because I change the block it's used.. duh
17:55 karolherbst[d]: okay.. that works
18:05 marysaka[d]: marysaka[d]: discussed with karol a bit and well yield is unrelated to the warp scheduling
18:09 karolherbst[d]: `Pass: 197889, ExpectedFail: 25, Skip: 207586, Duration: 12:25, Remaining: 1:16:40` looking good so far
18:29 mhenning[d]: marysaka[d]: Not sure what that means - the yield bit should prevent scheduling other threads which is very relevant here
18:29 karolherbst[d]: there is no yield bit tho
18:30 karolherbst[d]: at least not since volta
18:30 mhenning[d]: sure, nvidia calls it something different. that's not the point
18:30 karolherbst[d]: it's not a bit either way
18:30 karolherbst[d]: it's part of the wait count and it's an enum
18:31 mhenning[d]: okay? the enum is still relevant
18:31 karolherbst[d]: it's a perf hint in the end anyway, and not required for correctness
18:32 mhenning[d]: in what way? you mean they're allowed to switch anyway?
18:32 karolherbst[d]: however and this is important, the values with the high bit set can only wait up to 11 cycles
18:32 karolherbst[d]: not 15
18:32 karolherbst[d]: afaik, yes
18:33 karolherbst[d]: I should probably look into this, because there are other performance implications in all of this
18:35 karolherbst[d]: e.g. registers can't be reused
18:35 karolherbst[d]: so even if we'd set the .reuse flags, it wouldn't do anything
18:36 mhenning[d]: right, we'd need to start setting yield if we wanted to use .reuse
18:36 karolherbst[d]: yeah.. I really should throw it into code, but the big question to answer here is: "would it make sense to switch to a different warp?"
18:36 karolherbst[d]: and then set it all
18:37 mhenning[d]: marysaka[d]: Could you post the disasm?
18:37 karolherbst[d]: the annoying part is just that it reduces the max wait to 11
18:37 karolherbst[d]: so it requires some changes
18:38 karolherbst[d]: anyway.. maybe I should look into it.. it probably all matters for performance :d
18:41 marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1483535756419928104/shader_data_nvdump.asm?ex=69baf1d7&is=69b9a057&hm=ea10f1c858e6768f2b7f1e7cd0f221fbcff211681445cbcc65440d9a46d9feca&
18:41 marysaka[d]: mhenning[d]:
18:44 karolherbst[d]: oh another thing.. the special thing happens when the bit isn't set
18:45 mhenning[d]: karolherbst[d]: yeah, I know
18:46 mhenning[d]: marysaka[d]: If I'm reading that right, it looks like there's a mix of .yld on and off between each isbewr and isberd pair, so yeah .yld probably isn't important
18:49 karolherbst[d]: anyway yeah, I have found no indication that this is required for correctness in any way, but that might be different on other generations besides ampere
18:53 karolherbst[d]: not sure I'll ever become friends with opt_licm:
18:53 karolherbst[d]: Totals from 11875 (2.24% of 529141) affected shaders:
18:53 karolherbst[d]: CodeSize: 524447552 -> 525145408 (+0.13%); split: -0.03%, +0.17%
18:53 karolherbst[d]: Number of GPRs: 1397848 -> 1402091 (+0.30%); split: -0.03%, +0.34%
18:53 karolherbst[d]: SLM Size: 153532 -> 156948 (+2.22%)
18:53 karolherbst[d]: Static cycle count: 806764290 -> 805394937 (-0.17%); split: -0.32%, +0.15%
18:53 karolherbst[d]: Spills to memory: 3120 -> 5036 (+61.41%); split: -0.10%, +61.51%
18:53 karolherbst[d]: Fills from memory: 3120 -> 5036 (+61.41%); split: -0.10%, +61.51%
18:53 karolherbst[d]: Spills to reg: 15199 -> 16912 (+11.27%); split: -0.22%, +11.49%
18:53 karolherbst[d]: Fills from reg: 12740 -> 13792 (+8.26%); split: -1.69%, +9.95%
18:53 karolherbst[d]: Max warps/SM: 214012 -> 213132 (-0.41%); split: +0.04%, -0.46%
18:59 mhenning[d]: marysaka[d]: what command are you using to compile that? glslang is giving me: `ERROR: 0:4: 'GL_EXT_mesh_shader' : not supported for current targeted SPIR-V version`
19:00 mhenning[d]: okay nvm I just needed `--target-env vulkan1.4`
19:01 marysaka[d]: yeah was going to say
19:09 gfxstrand[d]: mhenning[d]: Yeah, so part of the problem is that NIR's divergence analysis doesn't have a concept of scopes. It only has divergent at the point of generation. So it can't tell the difference between a `load_const` or a UBO load and a global memory load with a constant address. The former two are going to be globally uniform, regardless of when they were executed while they later is uniform at the
19:09 gfxstrand[d]: load site but might not be once you break out of the loop.
19:14 mhenning[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1483544088211886284/blackwell.disasm?ex=69baf999&is=69b9a819&hm=881184277d6f89427a78c2bab2b7d2346c75f9ffa65992c342abfb9b0cebff70&
19:14 mhenning[d]: marysaka[d]: Okay, blackwell looks pretty similar, so I guess we don't need to do anything special
20:14 karolherbst[d]: I think the disaster GPR + UGPR MR is ready, passes the CTS on my TU102 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39384
23:09 mhenning[d]: I have a hacky version of the root_table stuff working, although it doesn't seem to help the games I've tried so far https://gitlab.freedesktop.org/mhenning/mesa/-/tree/root_table2?ref_type=heads
23:26 phomes_[d]: I will start testing that and Karol's MR
23:31 phomes_[d]: and btw I started a branch to rebase and fix some small things for dlss. I hide it behind an env var NVK_EXPERIMENTAL_DLSS for now. Maybe it would be nice to to land it and find/fix the remaining issues there?
23:31 phomes_[d]: https://gitlab.freedesktop.org/phomes/mesa/-/tree/dlss-rebase?ref_type=heads
23:55 mhenning[d]: Yeah, it might be reasonable to land it behind an env var