00:31 karolherbst[d]: okay.. uhm.. I think I have another perf regression on main 🥲
00:45 karolherbst[d]: ahh nvm.. I messed up
00:53 karolherbst[d]: ohhhh I have a fun idea...
00:55 karolherbst[d]: noooot again...
00:56 karolherbst[d]: I think my shared occupancy code is slightly wrong still 🥲
00:57 karolherbst[d]: ohh no
00:57 karolherbst[d]: it's fine
01:05 karolherbst[d]: okay.. this loop executes exactly once: https://gist.githubusercontent.com/karolherbst/25f0dfafafec9406158547570203726e/raw/9cde7ef01a24dd55fab75b12a011f75cf1431944/gistfile1.txt
01:06 karolherbst[d]: or well.. never
01:06 karolherbst[d]: I wonder if uub can help here if `@load_sysval_nv (access=reorderable, base=33, divergent=1)` gets restricted...
01:06 karolherbst[d]: I had a patch for that...
01:15 karolherbst[d]: so one thing that's interesting here is, that through shared memory this hader is restricted to 4 warps, but only uses 96 registers -> 5 warps
01:15 karolherbst[d]: I _wonder_ if it makes sense to teach the prepass about this
02:03 hatfielde1: yeah i think fmul_pdiv_nv is like fma
02:13 hatfielde1: im pretty sure it's commutative in the first 2 operands though which is fun
02:55 karolherbst[d]: I actually do wonder how common that pattern still is in shaders
03:08 karolherbst[d]: ur6 = iadd3 ur2 -ur6 rZ
03:08 karolherbst[d]: ur8 = iadd3 ur6 -ur3 rZ
03:08 karolherbst[d]: could be: `ur8 = iadd3 ur2 -ur3 -ur6`
03:08 karolherbst[d]: but mhh not sure if we'd optimize towards that if it's not a single use (ur6 that is)
03:16 karolherbst[d]: ohh right.. I wanted to do something about `imnmx + r2ur`
03:21 karolherbst[d]: mhhh... `UCLEA`....
03:22 karolherbst[d]: weird thing
03:25 karolherbst[d]: ohh I found something
03:26 karolherbst[d]: we can vectorize `f2f16`
15:12 hatfielde1: How did you find that
15:23 karolherbst[d]: there is F2FP on Ampere+ which is a packed f2f16
15:24 karolherbst[d]: we already support it, but apparently there are a few cases there weren't optimized
15:58 orowith2os[d]: hope you don't mind if I poke my head in here.
15:58 orowith2os[d]: how do y'all figure out what to optimize for shaders?
15:58 orowith2os[d]: it *is* allowed to compare the finished output of shader code from prop and nvk, right?
15:59 orowith2os[d]: do you just try to mimic what prop does?
16:09 karolherbst[d]: nah.. most of the things are just obvious
16:10 karolherbst[d]: more or less
16:10 karolherbst[d]: like you see something and you go "huh.. that looks like it could be done better"
16:20 karolherbst[d]: it's kinda like.. you see that something does `c = a + b; d = c + e;`, but you know the hardware can do it in one op: `d = a + b + e`
16:27 orowith2os[d]: karolherbst[d]: so there's not really any point in checking shaders?
16:28 karolherbst[d]: well.. it depends
16:28 karolherbst[d]: sometimes they have ideas you wouldn't come up with
16:28 karolherbst[d]: or use unknown instructions
16:28 orowith2os[d]: I would assume it would be good to also look around for instructions that haven't been touched in NVK yet
16:28 orowith2os[d]: yeah
16:34 glehmann: if you want some easy task regarding new instructions, I think nvidia can support has_fneo_fcmpu and has_ford_funord
16:35 glehmann: and maybe has_bitfield_select using the ternary logic op?
16:36 glehmann: or maybe 16bit texture loads using nir_opt_16bit_tex_image
16:36 karolherbst[d]: there is a lot
17:29 mhenning[d]: glehmann: oh, I have a branch for this one. Never did figure out how to test it though
17:30 mhenning[d]: the 16bit texture loads, that is
17:30 mhenning[d]: For bitfield_select we might already optimize it okay, not sure
17:33 karolherbst[d]: oh there was this issue
17:33 karolherbst[d]: BMSK as well
17:33 gfxstrand[d]: Wow. Since when did DA:TV hit 60 FPS?
17:33 karolherbst[d]: DA:TV?
17:34 karolherbst[d]: are you testing my MR maybe? 😛
17:34 gfxstrand[d]: DragonAge: The Veilguard
17:34 gfxstrand[d]: I am
17:34 karolherbst[d]: well
17:34 karolherbst[d]: 😄
17:34 karolherbst[d]: maybe because of that
17:34 gfxstrand[d]: But I was testing before your MR
17:34 gfxstrand[d]: to get a baseline
17:34 karolherbst[d]: heh
17:34 karolherbst[d]: is compression enabled?
17:34 karolherbst[d]: but yeah...
17:35 gfxstrand[d]: Yeah, It's probably compression
17:36 karolherbst[d]: if it's compute heavy, then prolly something I've been doing
17:36 gfxstrand[d]: It's everything heavy
17:38 glehmann: mhenning[d]: it's not rare in d3d12 games, iirc for example doom dark ages and helldivers
18:05 karolherbst[d]: gfxstrand[d]: anyway, let me know how much my MR does, because it seems to do between 0% and 10% in games 🙃
18:09 karolherbst[d]: well I have some compute stuff that went +35%, but it's doing a lot of memory operations, so that's not really surprising
18:12 gfxstrand[d]: Last time I saw probably 5-10%
18:12 gfxstrand[d]: But it's at 60 FPS now. 🙃
18:13 gfxstrand[d]: It did reduce compile times 20%, though.
18:14 karolherbst[d]: ahh yeah...
18:14 karolherbst[d]: get yourself a 4K@165 display 😛
18:18 gfxstrand[d]: The game has a vsync option
18:19 karolherbst[d]: I wished I could trust those options
18:19 gfxstrand[d]: Okay, 70 FPS without your patches
18:19 karolherbst[d]: I've tested borderlands 3 on ultra settings on my ga102 yesterday.. 22 fps on 4k 🥲
18:19 karolherbst[d]: we are almost there with 4k gaming
18:20 gfxstrand[d]: Star Wars Fallen Order was running at probably 20-25 FPS at 4k on a laptop last week. I'm sure my 4090 could hit 4k@60 easy
18:21 karolherbst[d]: yeah.. I mean I could have reduced the quali settings as well 😄
18:21 karolherbst[d]: but it's stuttering in a very weird way
18:22 gfxstrand[d]: gfxstrand[d]: 70 FPS with your patches. No measurable change. 😢
18:22 karolherbst[d]: maybe more 3D <-> compute WFI nonsense or something
18:22 karolherbst[d]: gfxstrand[d]: yeah...
18:22 karolherbst[d]: lego builders got +10%, but we already know it's very compute heavy
18:23 karolherbst[d]: but faster compiles are nice.. prolly some huge compute shader they run 5 times in total
18:29 gfxstrand[d]: No. It's across the board
18:29 gfxstrand[d]: Using a predicate means we don't have a new basic block for every load. Fewer basic blocks is much nicer on RA.
18:30 gfxstrand[d]: So literally every shader that's using robustness on SSBOs (everything in DXVK/VKD3D) compiles faster.
18:30 gfxstrand[d]: It also means we can schedule loads
18:30 karolherbst[d]: ahh
18:30 karolherbst[d]: fair
18:31 karolherbst[d]: but yeah.. just another data point that our shaders might not be the reason why things are slow 🥲
18:34 karolherbst[d]: ohh right that reminds me.. we have this weird issue with the prepass scheduler 🥲
18:35 karolherbst[d]: sometimes unfortunate things happen and the live value prediction is wrong and we end up with +1 in actual RA
18:35 karolherbst[d]: and then reduce warps/SM
18:35 karolherbst[d]: saw it in a shader that was also runing out of preds and upreds
18:36 mhenning[d]: yeah, I've been meaning to get to that. other things have felt higher priority though
18:38 gfxstrand[d]: karolherbst[d]: NVIDIA seems well balanced enough that it all matters somewhere. That just also means we're running out of silver bullets.
18:38 karolherbst[d]: yeah...
18:38 karolherbst[d]: I mean.. for compute it's impressive tho.. 😄
18:39 karolherbst[d]: I still have a couple of address calc things to get done, but that's also like "benefits compute, games? prolly not"
18:39 karolherbst[d]: but.. root constants might help, but... who knows
18:40 karolherbst[d]: ohh SCG will help for some games, so that might be something
18:40 karolherbst[d]: and the tiled rendering thing
18:44 gfxstrand[d]: Tiled rendering probably matters. Root constants definitely do
18:44 karolherbst[d]: mhhhh I wonder about root constants, because their benefit over ubo isn't _that_ big
18:45 gfxstrand[d]: Jeff made it sound significant
18:45 karolherbst[d]: mhhh
18:45 karolherbst[d]: afaik they are pre cached, but not sure if there is more to it
18:46 karolherbst[d]: but that alone could also make a difference
18:47 mhenning[d]: In synthetic benchmarks I've also seen fewer stalls from LOAD_ROOT_TABLE compared to cbufs which also helps
18:47 karolherbst[d]: but they also don't require messing with actual memory, because they are always? updated inline?
18:47 karolherbst[d]: ahh
18:47 karolherbst[d]: mhenning[d]: stalls between jobs? or stalls generally?
18:48 karolherbst[d]: like is "sm hits a cbuf that's not cached, and moves to different warps" a stall here or not?
18:48 mhenning[d]: I mean you can execute more LOAD_ROOT_TABLE commands (between jobs) before preventing the jobs from overlapping
18:49 karolherbst[d]: okay
18:49 karolherbst[d]: but yeah, that sounds significant indeed
18:49 mhenning[d]: I suppose I don't actually know if that's because LOAD_ROOT_TABLE is faster or if it's less likely to wfi
18:49 karolherbst[d]: could be a mix of both..
18:50 karolherbst[d]: root tables don't live in actual memory afaik, and I'm sure that matters
18:50 karolherbst[d]: like they probably allow for more efficient ubo versioning
18:51 karolherbst[d]: like.. ubos don't change that often, but root tables generally change every invocation, so having them available right away is surely making a difference
18:53 karolherbst[d]: I'm just wondering if the difference is _that_ huge
18:53 mhenning[d]: we'll see once I finish implementing it
18:54 karolherbst[d]: nice, looking forward to it
18:55 karolherbst[d]: I might look into shared variable aliasing soon.. I found a shader where we could run 5 warps instead of 4 if we'd alias shared memory properly 🙃 but that's going to be a mess to implement
18:57 orowith2os[d]: glehmann: glehmann: I was just curious
19:06 airlied[d]: karolherbst[d]: there is an MR for nir shared aliasing open
19:07 karolherbst[d]: ohh where
19:08 airlied[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33914
19:08 karolherbst[d]: thanks!
19:08 karolherbst[d]: I want to reflect that in our shader stats and occupancy calculation as well
19:09 karolherbst[d]: because it should be interesting for prepass scheduling
21:09 karolherbst[d]: gfxstrand[d]: oh right.. I was running into a super annoying issue with the UGPR + GPR loads and that is that the last `nir_convert_to_lcssa` invocation can make blocks be divergent that weren't before and it kinda messes up the load store lowering. I wonder if you have any good ideas? I kinda want very predictable behavior here so I don't have to spill the UGPR + GPR combo to an `iadd`. That
21:09 karolherbst[d]: becomes even worse with `ULDC` with a global VA, because it could make the entire usage of it invalid...
21:10 gfxstrand[d]: Right
21:11 gfxstrand[d]: I don't even remember what cases we get where blocks end up extra divergent.
21:12 karolherbst[d]: I know I was running into it with a shader, but I didn't write down which one 🥲 but I kinda want to look into ULDC soonish, because with the bounded one that's gonna prevent the upred -> pred spilling
21:12 karolherbst[d]: as well
21:13 karolherbst[d]: the really painful part about ULDC is that its max width is half of LDG
21:13 hatfielde1: karolherbst[d]: `fmul (fmul a, pdiv) b == fmul_pdiv_nv (a, b, pdiv)` when `(a * pdiv) * b` does NOT go to INF and back, does NOT go to denormal and back (rounding only effects denormal results when multiplying by a power of 2). If FTZ is on, then going to denormals and back will be wrong whether or not there's rounding. If DNZ is on, then I think it's a similar situation. Rounding mode would still cause issues if we are going back
21:13 hatfielde1: and forth from denormals. I can't check right now but if saturate just prevents underflows in denormals then I could see a case where fmul_pdiv is wrong and saves some bits from being clobbered, similarly for the overflow case. I think there may be places where we could mix/match some of these flags in the first/second fmul's to guarantee correctness. But we're getting closer, i think i'm getting closer to writing some kind of
21:13 hatfielde1: table that will tell me all the valid substitutions
21:14 karolherbst[d]: cool!
21:15 karolherbst[d]: so yeah.. I kinda want to know with close to 100% reliability that I can use ULDC on the nir side so it all stays pretty optimal.. though even the sub optimal situation the stats were good: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40232
21:16 karolherbst[d]: (also need to reverse engineer the caching behavior here..)
21:17 karolherbst[d]: _maybe_ I just convert back to ldg after the last `nir_convert_to_lcssa` and call it a day...
21:17 karolherbst[d]: or I just reorder things..
21:17 karolherbst[d]: dunno..
21:29 mhenning[d]: karolherbst[d]: how does nir_convert_to_lcssa make blocks divergent?
21:30 karolherbst[d]: mhenning[d]: if I remember correctly it's making some phis divergent and they are used as sources
21:30 karolherbst[d]: but I can maybe take a look tomorrow again and see if I find it
21:35 mhenning[d]: that doesn't really make sense to me. I'd expect lcssa to mean fewer divergent variables, not more
21:42 karolherbst[d]: yeah.. dunno.. let me rebase the branch and dig it out
21:48 karolherbst[d]: ohh and also I should wire up `UCLEA`
21:48 karolherbst[d]: which is like `ULEA` but that it also does a bound check and outputs a predicate 🙃
21:49 karolherbst[d]: so we can have a bound check ULDC in two instructions
21:50 karolherbst[d]: though I guess the same is true for a LDG, because it would take the 64 bit UGPR base address + a 32 bit GPR offset, soo...
21:50 karolherbst[d]: the worst part about ULDC is that as you have guess the input predicate sets the load to 0 on true, not false like LDG does
21:52 karolherbst[d]: heh... wait a second...
21:53 karolherbst[d]: `UCLEA` is a 64 bit + 32 bit add
21:54 karolherbst[d]: mhhhh
21:54 karolherbst[d]: I wonder if the 32 bit source needs to be 32 or 64 bit aligned