00:02karolherbst[d]: I have a shader here whose `total_gpr` value is... 922?!?
00:07karolherbst[d]: `SLM size: 2728` 💀
00:08karolherbst[d]: yeah.. if we are hitting shaders like that in games, no wonder why some things are slow 🙃
00:09karolherbst[d]: something really silly seems to be going on there
00:11karolherbst[d]: what is this shader 🙃
00:16karolherbst[d]: okay okay...
00:17karolherbst[d]: can we move load_globals closer to their usage? 🙃
00:17karolherbst[d]: this shader here loads like 4k of data per thread and then does some really deep if/else fuckery
00:18karolherbst[d]: and then most of the loaded values are just never used
00:22karolherbst[d]: ...
00:22karolherbst[d]: Instruction count: 9720 -> 7774
00:22karolherbst[d]: Static cycle count: 150265 -> 139526
00:22karolherbst[d]: Max warps/SM: 8 -> 12
00:22karolherbst[d]: Spills to mem: 670 -> 0
00:22karolherbst[d]: Spills to reg: 4
00:22karolherbst[d]: Fills from mem: 726 -> 0
00:22karolherbst[d]: Fills from reg: 4
00:22karolherbst[d]: Num GPRs: 255 -> 142
00:22karolherbst[d]: SLM size: 2728 -> 24
00:23karolherbst[d]: 🙃
00:23karolherbst[d]: behold my magic trick
00:38karolherbst[d]: looks like this is a bit controversial in other shaders..
00:46karolherbst[d]: mhhhh...
01:09mhenning[d]: karolherbst[d]: not if there's a write in-between
01:10mhenning[d]: in theory the scheduler could do some of that but it doesn't have the logic for it yet
01:10karolherbst[d]: mhenning[d]: yeah.. nir_opt_sink only does it with reordable
01:10karolherbst[d]:but
01:10karolherbst[d]: somehow that's worse on avarage
01:10karolherbst[d]: Totals from 28494 (7.51% of 379184) affected shaders:
01:10karolherbst[d]: CodeSize: 833480368 -> 846908272 (+1.61%); split: -0.11%, +1.72%
01:10karolherbst[d]: Number of GPRs: 2273590 -> 2394544 (+5.32%); split: -1.03%, +6.35%
01:10karolherbst[d]: SLM Size: 158752 -> 169152 (+6.55%); split: -15.60%, +22.15%
01:10karolherbst[d]: Static cycle count: 1347166251 -> 1363978770 (+1.25%); split: -0.56%, +1.81%
01:10karolherbst[d]: Spills to memory: 60785 -> 67217 (+10.58%); split: -13.45%, +24.03%
01:10karolherbst[d]: Fills from memory: 60785 -> 67217 (+10.58%); split: -13.45%, +24.03%
01:10karolherbst[d]: Spills to reg: 86080 -> 334782 (+288.92%); split: -0.86%, +289.78%
01:10karolherbst[d]: Fills from reg: 65353 -> 282393 (+332.10%); split: -0.65%, +332.75%
01:10karolherbst[d]: Max warps/SM: 837404 -> 816416 (-2.51%); split: +0.84%, -3.35%
01:10mhenning[d]: yeah, I think I've tried nir_opt_sink before
01:11karolherbst[d]: but it does really help in this one shader.. so I'm wondering if we can tweak it a bit somehow...
01:11karolherbst[d]: I also don't see why some of the shaders actually get worse..
01:11karolherbst[d]: well except blaming RA for being weird
01:11karolherbst[d]: in one of the ones that regress a lot, the load_globals just get moved a bit further down
01:12karolherbst[d]: and no idea why that would double live values really
01:12karolherbst[d]: 4 different addresses were used...
01:12karolherbst[d]: for like 30 loads
01:13karolherbst[d]: and like 40 -> 80 regs
01:14karolherbst[d]: ohh wait.. I missed a couple of thinks.. it's indeed a lot of loads with different addresses mhh...
01:14karolherbst[d]: yeah I guess if they load 4 bytes, keeping the 8 byte addresses alive is worse...
01:15karolherbst[d]: I do wonder if we want to add some heuristics there
01:16karolherbst[d]: the shader where it helps a lot is doing this: https://gist.githubusercontent.com/karolherbst/fe2b58b71dded35359cf07b4dc53e443/raw/bec296e7bf0fb907362859530c06c1ebcb9d2c63/gistfile1.txt
01:16karolherbst[d]: you might see why that helps there 🙃
01:17karolherbst[d]: maybe only sink if the base address is used in other loads? Only sink if the load bigger >= address?
01:17karolherbst[d]: I should play around with that tomorrow
01:19karolherbst[d]: yeah... I think that would work out... if the address is smaller than all collective uses are loading, sink it, if the address is bigger, then don't
01:20karolherbst[d]: so if the address is used only in this single load then don't sink..
01:27karolherbst[d]: ehh maybe I also want to sink the things used by the loads 🙃
01:37karolherbst[d]: ohh I think I figured it out 🙂
01:39karolherbst[d]: yeah.. much better
01:39karolherbst[d]: now even the shader that regressed heavily is better
01:40karolherbst[d]: just had to be smarter using the pass
01:40karolherbst[d]: what killed the stats where the bound check that were left where it was, so the predicate was also causing issues
01:42karolherbst[d]: still running but..
01:42karolherbst[d]: Totals from 13966 (15.56% of 89752) affected shaders:
01:42karolherbst[d]: CodeSize: 386947712 -> 386530288 (-0.11%); split: -0.36%, +0.25%
01:42karolherbst[d]: Number of GPRs: 1128655 -> 1113501 (-1.34%); split: -2.40%, +1.05%
01:42karolherbst[d]: SLM Size: 139628 -> 113836 (-18.47%); split: -18.53%, +0.06%
01:42karolherbst[d]: Static cycle count: 824277518 -> 832221772 (+0.96%); split: -0.65%, +1.62%
01:42karolherbst[d]: Spills to memory: 59705 -> 51469 (-13.79%); split: -14.04%, +0.24%
01:42karolherbst[d]: Fills from memory: 59705 -> 51469 (-13.79%); split: -14.04%, +0.24%
01:42karolherbst[d]: Spills to reg: 59678 -> 59721 (+0.07%); split: -4.09%, +4.16%
01:42karolherbst[d]: Fills from reg: 46018 -> 45605 (-0.90%); split: -4.05%, +3.15%
01:42karolherbst[d]: Max warps/SM: 414288 -> 416060 (+0.43%); split: +1.40%, -0.97%
01:43karolherbst[d]: I'll let it run and figure something out tomorrow, but this looks much much better now
09:23marysaka[d]: mhenning[d]: for the ZCULL issue, I'm not too sure if you saw this but we need to handle clearing in some image transitions cases (when the old layout is undefined for example but I am not too sure if there isn't any other cases)
09:26marysaka[d]: I have been doing a little bit of testing with barriers on the blobs and noticed that (nothing too fancy, just trying to understand what things we have that could avoid a WFI apart from pixel barriers)
12:44karolherbst[d]: okay.. it looks like we are in a good position to enabline opt_sink indeed
12:45karolherbst[d]: having a few shaders that regress in weird ways, but that all seems solvable
12:45karolherbst[d]: that's gonna boost perf by a lot :ups
13:50karolherbst[d]: mhenning[d]: wanna look at this MR? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40897 It's already reviewed, but you had some opinions on that matter (copy prop for ftz values)
14:14karolherbst[d]: mhh but also.. the static cycle count the postpass scheduler is generating isn't reliable 😢 like it accounts for cycles multiple times due to dependencies and things
14:14karolherbst[d]: guess I just ignore that metric
14:15karolherbst[d]: like I have a shader that doesn't really change in regards to cycle counts (no loops, no increase in latencies overall, etc..) and the count increases from 210 to 320
14:16karolherbst[d]: I think the issue is the `ready_cycle` thing that can have big values, and then the block returns something bigger than the block itself needs
15:33karolherbst[d]: This is such a can of worms 🙃
15:45esdrastarsis[d]: I think this issue and the batman origins issue should be reopened: https://gitlab.freedesktop.org/mesa/mesa/-/work_items/14776
15:47karolherbst[d]: esdrastarsis[d]: tried with most recent proton experimental?
15:54esdrastarsis[d]: karolherbst[d]: I tried with most recent ProtonGE on Heroic (I have the Epic Games version)
15:59snowycoder[d]: orowith2os[d]: I'm not really, right now I'm working around panfrost/mali, but I can help!
16:00snowycoder[d]: If you're searching for Kepler work, the scheduling bits are most likely worng, I copied them from the old compiler but they aren't complete at all
16:02karolherbst[d]: esdrastarsis[d]: I wonder if there is a difference between the Epic game version and the steam one?
16:04phomes_[d]: I can test batman origins tonight
16:18mhenning[d]: marysaka[d]: Yeah, I have that part mostly figured out already. Out of curiosity, what does the blob set the zcull region to on layout transitions? I've just been clearing it to all zeros but I don't actually know what that representation means.
16:18mhenning[d]: karolherbst[d]: Yes, can you wait for me to take a look at it on monday?
16:20karolherbst[d]: mhenning[d]: sure!
16:29marysaka[d]: mhenning[d]: From what I saw it was all 0s
16:30marysaka[d]: I also saw some FLUSH_AND_INVALIDATE_ROP_MINI_CACHE but I'm not too sure if it was related to it
17:40karolherbst[d]: phomes_[d]: what's up with the surge2? I see that my spill MR speeds it up a lot, but I'm also concerned why it was n/a in previous runs?
17:40karolherbst[d]: maybe I'm papering over an issue there?
18:16phomes_[d]: it was n/a because the game broke with a steam runtime update. It was fixed in a new update today. I will recheck the baseline and add results for the other MR's
18:42phomes_[d]: updated now
18:47phomes_[d]: 40903 and 40904 each take it to 64 fps. Combining the two MR's however also gives 64 fps
18:48karolherbst[d]: and the base is still at 59?
18:48karolherbst[d]: okay.. nice
18:48karolherbst[d]: kinda hoped it would help x4 though...
18:49phomes_[d]: yes baseline is still at 59
18:50karolherbst[d]: but I'm still reworking 40904 and I think I'll have a better version soonish
18:50karolherbst[d]: phomes_[d]: what might help if you could create a fossil for x4
18:50karolherbst[d]: then I could maybe take a look and see if something stands out
18:51karolherbst[d]: but +10% in the surge 2 with no known regression is kinda nice
18:54phomes_[d]: esdrastarsis[d]: Batman Origins still crash here. With current mesa main and proton experimental
18:58karolherbst[d]: karolherbst[d]: or well.. create fossils for all the benchmarks you are running, then I have an easier time to tell how likely those are impacted by patches
19:15mhenning[d]: karolherbst[d]: I'm not sure that's actually useful. We're not trying to optimize for these games specifically, we're trying to optimize general performance. Specifically trying to align shaderdb with other testing makes that testing a less useful indicator
19:15mhenning[d]: In ML there's this idea of a train/validation/test split. You don't want your validation and test sets to overlap because that biases predictions from the test set.
19:15karolherbst[d]: mhh right...
19:18karolherbst[d]: anyway, I've finished the opt_sink work for loads and it has quite nice results now:
19:18karolherbst[d]: Totals from 128823 (10.62% of 1212873) affected shaders:
19:18karolherbst[d]: CodeSize: 2867289392 -> 2812664720 (-1.91%); split: -2.08%, +0.17%
19:18karolherbst[d]: Number of GPRs: 8538611 -> 8467897 (-0.83%); split: -1.25%, +0.43%
19:18karolherbst[d]: SLM Size: 426004 -> 399628 (-6.19%); split: -6.22%, +0.03%
19:18karolherbst[d]: Static cycle count: 3056398965 -> 3100050134 (+1.43%); split: -0.41%, +1.84%
19:18karolherbst[d]: Spills to memory: 65870 -> 57100 (-13.31%); split: -13.83%, +0.51%
19:18karolherbst[d]: Fills from memory: 65870 -> 57100 (-13.31%); split: -13.83%, +0.51%
19:18karolherbst[d]: Spills to reg: 121290 -> 124287 (+2.47%); split: -3.24%, +5.71%
19:18karolherbst[d]: Fills from reg: 99831 -> 100900 (+1.07%); split: -3.12%, +4.19%
19:18karolherbst[d]: Max warps/SM: 4316376 -> 4336404 (+0.46%); split: +0.76%, -0.29%
19:18karolherbst[d]: phomes_[d]: you could do another round of testing. Mainly want to verify it's not removing any perf gains
19:34karolherbst[d]: wait a second.....
19:34karolherbst[d]: mhhh
19:35karolherbst[d]: wasn't there some funny relation between sin and cos?
19:36karolherbst[d]: mhh guess there is no simple way to calculate `cos(x)` based on `sin(x)`
19:39mhenning[d]: I mean there's the pythagorean theorem thing `cos(x) = sqrt(1 - sin(x)^2)`
19:41karolherbst[d]: I doubt that's any cheaper than just executing `sin` and `cos` tho...
19:43karolherbst[d]: yeah... looking at this shader, the only obvious thing left is `R2P` and `P2R`...
19:43karolherbst[d]: but not sure what's a good way to wire that up
19:44karolherbst[d]: RA spills more than one predicate at once in this shader, and if I'd pick a "pred spilling reg" I could spill up to 32 predicate values with a single register
19:46karolherbst[d]: it can even spill multiple preds at once
19:47karolherbst[d]: but especially `R2P` has weird semantics
19:49karolherbst[d]: so `R2P` writes the entire register file and I can select which byte to read from from the input register, and I can also specify which bits to write to, but it can't swizzle the bits
19:49karolherbst[d]: such a pain
19:52mhenning[d]: karolherbst[d]: oh yeah, it's not useful for us as a compiler optimization
19:53mhenning[d]: karolherbst[d]: yeah, I looked at that a bit and couldn't really think of a good way to use it
19:55karolherbst[d]: we could use it with lots of compiler smart, because we'd have to ensure to unspill the predicate into the same index, but...
19:55karolherbst[d]: I'm sure there are more lower hanging fruits 🙃
19:56mhenning[d]: yeah, I don't consider it a priority
19:57phomes_[d]: karolherbst[d]: done. Perf is the same. There is a build warning about an unused var though
19:57karolherbst[d]: kinda annoying they got rid of `ISET`...
19:59karolherbst[d]: phomes_[d]: thanks!
20:00karolherbst[d]: anyway, good to know that some games are slow due to heavy spilling 🙃
20:01esdrastarsis[d]: phomes_[d]: With DX11 features or without them?
20:03phomes_[d]: I just launch it from steam with no specific options
20:24karolherbst[d]: anyway... maybe I should stop, already enough MRs open 🙃