00:04gfxstrand[d]: karolherbst[d]: So here's a thought... What if they're doing uniform float ops and we aren't? Those registers are free and the ops are a bit faster.
00:06gfxstrand[d]: Depending on the calculation, it could make a lot of difference. They had to have added uniform float ops and extra registers on Blackwell for some reason and good money says that reason was AI.
00:19karolherbst[d]: well, I'm doing my testing on ampere atm
00:19gfxstrand[d]: 😕
00:22karolherbst[d]: so in the shader there are three big differences:
00:22karolherbst[d]: 1. predication
00:22karolherbst[d]: 2. scheduling (which also includes using IMAD instead of SHL/MOV/IADD)
00:22karolherbst[d]: 3. more alu optimizations
00:22karolherbst[d]: 3... is kinda I dunno.. the loop is back to back and almost no room for improvements in terms of opts
00:22karolherbst[d]: but there is still stuff outside the loop
00:22karolherbst[d]: I doubt it matters
00:23karolherbst[d]: the gpr+ugpr stuff cut out 200 instructions, static cycle count got reduced by 25% (including nir_opt_licm), perf difference? almost 0
00:24karolherbst[d]: but then the question is why would predication of uniform control flow and scheduling matter much
00:24karolherbst[d]: I think the shader is good enough already
00:24karolherbst[d]: the compute MME patches improve perf by 15% which is kinda massive
00:25karolherbst[d]: maybe the way nvidia uses LAUCH_DMA + semaphores in the QMD + other things I couldn't identify as related help achieve higher throughput?
00:26karolherbst[d]: and there is also `.reuse` still... I wonder if that matters at this point.. my gpu does get kinda hot, and reducing load on the register file might indeed help? but then why do the other opts not help with anything
00:28karolherbst[d]: maybe I should check how the benchmark calculates the TFLOPS number 🙃 maybe it's doing something weird which just triggers bad paths or something
00:28karolherbst[d]: benchmarks calculating perf numbers in weird ways isn't even unusual
00:28karolherbst[d]: yeah well.. they use `std::chrono`
00:29karolherbst[d]: mhhh
00:29karolherbst[d]: start_time, `vkQueueSubmit`, `vkQueueWaitIdle`, end_time
00:30karolherbst[d]: maybe I should push it through a profiler and see if there are weirdo stalls
00:30karolherbst[d]: anyway.. that's for tomorrow
01:28gfxstrand[d]: karolherbst[d]: Could be. Depends on what dependencies between compute jobs look like. It's possible they are using semaphores to get more parallelism rather than just doing WFI between things.
02:31gfxstrand[d]: That's on my list of perf things to look at.
02:34gfxstrand[d]: I have similar concerns about copies
03:34airlied[d]: are there any mme opts for turing we can do?
03:38mhenning[d]: there's stuff we can do to rely on the mme less, yes
03:51airlied[d]: commenting out the cs invocations on turing didn't help the benchmark I was playing with at least
04:21gfxstrand[d]: airlied[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13789
04:21gfxstrand[d]: Kind of annoying, though.
04:23airlied[d]: I tried commenting out cs invocs for the coop mat test I was using, and it didn't make any difference
04:28gfxstrand[d]: Yeah. I think we might be stalling anyway
04:31gfxstrand[d]: Hmm... I thought there was something suspicious but I'm not seeing it now.
04:34steel01[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36719
04:34steel01[d]: So, uh. The auto-mod thinks me posting a link to lkml is spam. Can someone poke buttons on my pending post, please?
04:38gfxstrand[d]: I'm not sure who has buttons they can poke. I'm not seeing anything.
04:39steel01[d]: Hmm. I'm not even sure what it was auto-modding on. I put the link in a quote block to prevent the auto-link part, but it still blocked me posting normally. So I had to post it 'for review'.
04:40steel01[d]: Which I presume is a giant black hole that no one looks at in normal day-to-day.
04:40gfxstrand[d]: What's your handle?
04:41steel01[d]: `@webeek1234`
04:41steel01[d]: Yes sorry, my default username on discord isn't what I use for dev. And by the time I noticed I'd forgotten to change it, conversations had already started, so it was too late to change.
04:42gfxstrand[d]: I'm not seeing that as a GitLab handle
04:42steel01[d]: https://gitlab.freedesktop.org/webgeek1234 ?
04:43chikuwad[d]: steel01[d]: ~~there's a typo here~~
04:43steel01[d]: I mean... context, I guess. The header on the pending post says:
04:43steel01[d]: Aaron Kling @webgeek1234 Pending
04:44steel01[d]: Everyone else's name also has the @ prefix.
04:44steel01[d]: Oh.
04:45steel01[d]: Not that difference. I'd blame it on the time being 11:45 PM here, but... that's supposed to be my more active brain time...
04:46gfxstrand[d]: They're. I think I just made you a "guest". Maybe that's enough to cut through the filters.
04:46steel01[d]: Alright, lemme try to post again.
04:47steel01[d]: Nope, still got blocked. So I tossed it back in the review queue.
04:47steel01[d]: https://lore.kernel.org/all/CALHNRZ8uHmx3nqpg1-F6RCprDavx3nY55en5gJds54RU8MDR5Q@mail.gmail.com/
04:47steel01[d]: Regarding missing firmware, I've had a open request for seven months to get the missing stuff released to linux-firmware, but it hasn't gotten anywhere. Maybe if it was mentioned on list that this is holding up mesa userspace progress, the priority could be raised.
04:47steel01[d]: This is all I'm trying to post, fwiw.
04:49gfxstrand[d]: Yeah, IDK what's going on. Maybe it's because you started with a link? 🤷🏻♀️ In any case, I suspect people do check the queue eventually.
04:50steel01[d]: I guess we'll see.
05:06airlied[d]: karolherbst[d]: the 60 submits are 10 per round, it does 5 warmup rounds, and one timed one (in non-correctness)
05:06airlied[d]: so maybe they can merge them?
05:07airlied[d]: repeat count is only 1 for correctness
05:10airlied[d]: the most I ever got on my turing was 19TFlops, but I've no idea what branch it was in, but I have the shader that did t
05:13airlied[d]: I've attached the shader log to https://gitlab.freedesktop.org/mesa/mesa/-/issues/12817
05:54airlied[d]: karolherbst[d]: your ldsm-opts branch + membar removed get me 19TF on the same turing
06:22airlied[d]: I think getting rid of that shifts with i2i64 in between them was something I'd hacked up in my git stash
06:39karolherbst[d]: I get 85 TFLOPs for fp16 and 110 for int on ampere
06:42karolherbst[d]: pushed my branch again
06:43airlied[d]: well the number don't mean much, I'm just using this one shader since I know nvidia did 27Tf on it
06:44airlied[d]: with the shader that is linked in the issue
06:45airlied[d]: and it was the first test in the vk_cooperative_matrix_perf which made it easy to hack on 😛
06:46airlied[d]: there looks to still be scope to get more constant values into the ld/st instructions
06:46airlied[d]: instead of them being buried under shl/shr/u2u64
06:46karolherbst[d]: yeah, use my branch
06:46karolherbst[d]: I think the stg at the end could get some constants but...
06:47karolherbst[d]: this part of the shader doesn't matter
06:47karolherbst[d]: the loop gets executed like 1000 times or more
06:47karolherbst[d]: what does help me get perf is to unroll the whole thing 😄
06:48airlied[d]: yeah but NVIDIA don't seem to do that, but if it makes it faster, maybe it's fine :0
06:49karolherbst[d]: mhh well nvidia is also smarter with predication and scheduling
06:50karolherbst[d]: but...
06:52karolherbst[d]: airlied[d]: https://gist.github.com/karolherbst/ad689e7f88e6fd5c44e05c01d973bdea
06:53karolherbst[d]: instruction count 600 -> 10380
06:53karolherbst[d]: that's why I don't think anything outside the loop matters much 😄
06:53airlied[d]: I think we should aim for equiv to NVIDIA then make it faster 🙂
06:54airlied[d]: unrolling when they don't seems like it will make it harder to do comparisons
06:54karolherbst[d]: yeah...
06:54karolherbst[d]: they are better in the loop
06:54karolherbst[d]: less pointless movs
06:54karolherbst[d]: less 64 bit math
06:54airlied[d]: div 32 %60 = iadd %53, %59 (0x40000)
06:54airlied[d]: div 32 %61 = ushr %60, %16 (0x3)
06:54airlied[d]: div 64 %62 = u2u64 %61
06:54airlied[d]: div 64 %63 = ishl %62, %7 (0x4)
06:54airlied[d]: div 32x4 %64 = @load_global_nv (%63, %26) (base=0, access=none, align_mul=16, align_offset=0)
06:55karolherbst[d]: yep
06:55airlied[d]: like it seems like we should be able to get that const into that load
06:55karolherbst[d]: well...
06:55karolherbst[d]: you can't just do that
06:56airlied[d]: while I understand you can't generically, I expect for and align_mul=16 load and a 0x40000 offset, we should be able
06:56karolherbst[d]: but it's also outside the loop
06:56karolherbst[d]: so it doesn't matter perf wise
06:57airlied[d]: indeed, it's just more of a nice to have
06:57karolherbst[d]: right
06:57karolherbst[d]: I'm more interested in the shifts
06:57airlied[d]: there is one of those in the loop though
06:57karolherbst[d]: I have this pattern inside the loop as well
06:58karolherbst[d]: div 32 %195 = iadd3 %190, %194, %39
06:58karolherbst[d]: div 32 %197 = ushr %195, %196 (0x3)
06:58karolherbst[d]: div 64 %198 = u2u64 %197
06:58karolherbst[d]: div 64 %200 = ishl %198, %199 (0x4)
06:58karolherbst[d]: div 32x4 %201 = @load_global_nv (%200, %26) (base=0, access=none, align_mul=16, align_offset=0)
06:58karolherbst[d]: if I can proof this ishl can be done in 32 bits
06:58karolherbst[d]: then the u2u64 can go
06:59airlied[d]: do we only have 32 bits of const in the load instr?
06:59karolherbst[d]: and the ldg could take a 32 bit offset
06:59karolherbst[d]: airlied[d]: 24
06:59karolherbst[d]: but...
06:59karolherbst[d]: LDG can do a 32 bit gpr + 64 bit ugpr thing
07:00karolherbst[d]: and demoting 64 bit int math to 32 does help
07:00karolherbst[d]: well could help
07:00airlied[d]: I think pushing the const closer to the load is worth trying, it might be easier to prove range then
07:00karolherbst[d]: the iadd is also potentially interesting
07:01airlied[d]: and if you do that, some of the other maths might disappear
07:01airlied[d]: or become common up further
07:01karolherbst[d]: the issue is `iadds` with those
07:01karolherbst[d]: because if you extract the constant you might change the result
07:01airlied[d]: becuase the const isn't in there
07:01karolherbst[d]: I already do range analysis and do the opt
07:01karolherbst[d]: but not if I can't proof it's not safe
07:01airlied[d]: yeah so somehow we have to work that out, and I think it might even be infomation you have earlier and it gets lost
07:02karolherbst[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/7e4529d1044db5acb36fb03b3be555e568a36233#203b86d7d0bd6811cdfd089faac4ce0552ec4151_4150_4150
07:02karolherbst[d]: that's what got me all the constants into all the other load/stores
07:02karolherbst[d]: airlied[d]: yeah... something like that might be going on
07:03karolherbst[d]: maybe we should set some `nuw` flags in lower cmat 🙃
07:19airlied[d]: do you think we neded more in the descriptor lowering?
07:32karolherbst[d]: mhhhh maybe?
07:34karolherbst[d]: I have a silly idea 🙃
07:35karolherbst[d]: like the u2u64 comes from the ones inside `load_store_vec_addr`
07:35karolherbst[d]: the offsets are all 32 bits
07:35karolherbst[d]: but for the derefs it needs to be 64
07:38karolherbst[d]: mhhhh
07:38airlied[d]: oh I kinda remember hacking some of that
07:39karolherbst[d]: I was considering using `ior` instead of `iadd` in a few places
07:40karolherbst[d]: makes it easier to reason about overflows
07:43karolherbst[d]: mhhh
07:44karolherbst[d]: thing is... ior optimizes worse
07:49airlied[d]: doing all the row/col calcs in 64-bit doesn't seem to make it worse
07:49airlied[d]: and I seem to have less instructions
07:49karolherbst[d]: heh...
07:50karolherbst[d]: that's kinda odd
07:54karolherbst[d]: ohh I might have an idea...
08:05karolherbst[d]: right...
08:05karolherbst[d]: soooo
08:05karolherbst[d]: the problem was that uub can't help in all cases because some of the math depends on the loop counter
08:05karolherbst[d]: airlied[d]: maybe the issue is more obvious this way: https://gist.githubusercontent.com/karolherbst/be58a68be37a3dee2389e0191ba8a4a1/raw/7c12d9268149b55952a307b1f91853f4ad407343/gistfile1.txt
08:06karolherbst[d]: uub won't be able to give us an upper bound for `con 32 %438 = iadd.nuw %437, %87 (0x20)`
08:06karolherbst[d]: the .nuw I've set in loop analysis
08:06karolherbst[d]: loop analysis knows, but not uub
08:06karolherbst[d]: so `div 32 %446 = iadd %444, %445` _might_ wrap because we can't proof it doesn't
08:07karolherbst[d]: there is a `con 1 %439 = ult %438, %34 (0x1000)` to check when to leave the loop
08:07karolherbst[d]: `con 1 %604 = uge %438, %34 (0x1000)` actually
08:08karolherbst[d]: so the upper bound of `%438` is clearly 0xfff
08:08karolherbst[d]: mhhhh
08:09karolherbst[d]: https://gist.githubusercontent.com/karolherbst/be58a68be37a3dee2389e0191ba8a4a1/raw/b675e7370971c2bd2555cc1ba1a989ffefac149e/gistfile1.txt
08:10karolherbst[d]: ehh actually this one: https://gist.githubusercontent.com/karolherbst/be58a68be37a3dee2389e0191ba8a4a1/raw/92b1e9c25f38709315fc8d69e08b19a313567dbb/gistfile1.txt
08:10karolherbst[d]: mhhhhhhhhhhhhhhhhhh
08:10karolherbst[d]: so the block is only entered when %439 is below 0x1000
08:12karolherbst[d]: mhhh I wonder..
09:01karolherbst[d]: okay, got something that might help a bit
09:11karolherbst[d]: mhh it's not helping
09:43karolherbst[d]: okay.. got something that works
09:59karolherbst[d]: ahhhhh
09:59karolherbst[d]: I was able to tell the bound is `0x100000018` 😭
09:59karolherbst[d]: where is a frigging bit I can nuke
10:01karolherbst[d]: the issue is that the range depends on a `ldc_nv` and that's just annoying
10:02karolherbst[d]: yeah.....
10:02karolherbst[d]: it might not be provable at all
10:02karolherbst[d]: `0xfffc0018 + 0x40000 = 0x100000018` is the bound composition
10:05karolherbst[d]: ohhh.. mhhh
10:05karolherbst[d]: I found the bit
10:05karolherbst[d]: ugly...
10:07karolherbst[d]: so there is the bound of `0xfff80000 + 0x1000 = 0xfff81000` where `(0x1000)` is the loop variable, but the value is only used when it's `< 0x1000` but we don't know it there...
10:14karolherbst[d]: I think the loop can be optimized a bit and that's gonna help with that
10:14karolherbst[d]: what a nasty shader
10:16karolherbst[d]: https://gist.githubusercontent.com/karolherbst/f2de647c44148f7c8146e5c71c5f32ed/raw/d71a95f0e42510446399212d45ebdf5010118482/gistfile1.txt
10:17karolherbst[d]: so block 2 and 5 are visited in each iteration unless in the last one
10:17karolherbst[d]: *and not
10:23marysaka[d]: karolherbst[d]: hmm shouldn't those two blocks be merged trivially by nir_opt_if tho?
10:23karolherbst[d]: marysaka[d]: maybe
10:23marysaka[d]: (had something similar on panfrost last week)
10:23karolherbst[d]: there is some math in between
10:23marysaka[d]: ah so it will not merge them
10:24marysaka[d]: if there was nothing it would do it tho
10:24karolherbst[d]: https://gist.githubusercontent.com/karolherbst/f11f7482d776c6e6fb8fbd04d99a202e/raw/7864a26b44ae3c1e14c5970c798cb0b2596b6d62/gistfile1.txt is the full thing
10:24marysaka[d]: so maybe nir_opt_sink/nir_opt_move could help a bit?
10:24karolherbst[d]: yeah I mean.. there is almost nothing in between 🙃
10:24karolherbst[d]: but sadly it depends on the first if
10:24marysaka[d]: yeah probably var_copies and alu motion could help idk
10:24marysaka[d]: right
10:25karolherbst[d]: though ... I think we could merge them..
10:25karolherbst[d]: maybe
10:26karolherbst[d]: all those phis could be moved after the second if
10:26karolherbst[d]: and the alu before the first one
10:26karolherbst[d]: and then it's empty
10:26marysaka[d]: so I guess sink and move could help a bit there
10:26karolherbst[d]: can they move phis?
10:26karolherbst[d]: mhhh
10:26marysaka[d]: to at least move them near their actual usage
10:27karolherbst[d]: let me try that
10:27marysaka[d]: idk for phi but the alu side will be fine to move at the very least
10:27karolherbst[d]: mhh it can't move phis...
10:29karolherbst[d]: ohh wait
10:30karolherbst[d]: it's used in the second if 😢
10:30karolherbst[d]: but only there
10:30karolherbst[d]: why isn't the if moved inside the if block 🙃
10:37karolherbst[d]: pain..
10:37karolherbst[d]: it doesn't do it because it could make the if divergent 🥲
10:39karolherbst[d]: marysaka[d]: https://gist.githubusercontent.com/karolherbst/7fbfa0cdc9eafe9acbcdf840a9ee9c94/raw/75caedd03b272930116b64e06ef0c4bc228f23e3/gistfile1.txt 🥲
10:58karolherbst[d]: okay.. got more perf
10:59karolherbst[d]: `Instruction count: 488` yo
11:00karolherbst[d]: so apparently using the .U32 thing helps: https://gist.github.com/karolherbst/9d542a33f46ab92ce710889077a56e3c#file-gistfile1-txt-L109
11:02karolherbst[d]: but "more perf" is really relative here 🙃 it's almost nothing
11:03karolherbst[d]: lemme do perf profiling
11:31karolherbst[d]: marysaka[d]: did you had some patches to not make submit copy the push buffer all the time or something? Dunno what we've discussed there
11:32marysaka[d]: karolherbst[d]: hmm what do you mean? :aki_thonk:
11:32karolherbst[d]: something with push buffers in VRAM or something dunno actually. I haven't checked out the code nor what it all does 🙃
11:33karolherbst[d]: ohhh
11:34karolherbst[d]: we create a new push buffer each submit?
11:34marysaka[d]: oh you mean not composing the pushbuf with the GART mapping?
11:34karolherbst[d]: marysaka[d]: right
11:34marysaka[d]: unsure I think we allocate it on demand and allocate more chunks if we are out of memory (64KiB per alloc)
11:36marysaka[d]: karolherbst[d]: I don't have the patch with me but it was basically just allocating a shadow mapping of 64KiB and copying it when flushing the pushbuf
11:36marysaka[d]: flush being nvk_cmd_buffer_flush_push
11:36marysaka[d]: had not much of a difference tho
11:38karolherbst[d]: I see
11:58karolherbst[d]: mhh the entire thing takes like `24706`us
12:07karolherbst[d]: 😮
12:08karolherbst[d]: okay okay okay
12:08karolherbst[d]: `TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 53.132931 TFlops` => `TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 93.821389 TFlops`
12:09karolherbst[d]: and what I did was to set `TARGET_SM_CONFIG_SHARED_MEM_SIZE` in the QMD to smem_max insted of smem_size
12:09karolherbst[d]: 🙃
12:18karolherbst[d]: okay my GPU has indeed 100k of shared mem 😄
12:18karolherbst[d]: interesting...
12:19karolherbst[d]: but it's only affecting one of those tests...
12:56phomes_[d]: snowycoder[d]: I did the usual game perf test on your cross-block MR. Results are in the performance sheet in https://docs.google.com/spreadsheets/d/1RuHD3Z_nBKCp618HHC5I9hOu0lqCoFYwQ4FM69M-Ajg
13:18gfxstrand[d]: karolherbst[d]: Requesting more makes it go faster?!? I guess I could see it.
13:21karolherbst[d]: gfxstrand[d]: maybe something with the cache management is broken? dunno...
13:21karolherbst[d]: it's kinda weird that it doesn't help with all the configs
13:21karolherbst[d]: actually.. is it only the first or...
13:22karolherbst[d]: mhhh
13:23karolherbst[d]: no idea.. but the 128x128 test goes faster but not the others
13:29karolherbst[d]: I should do more dumps on nvidia with all the stuff removed from the test that's not needed...
13:30gfxstrand[d]: karolherbst[d]: Those feed into the hardware occupancy calculations somehow. I really wish that weren't such s black box. 🫤
13:30karolherbst[d]: good that we also don't set the occupancy fields 🙃
13:31gfxstrand[d]: 🙃
13:31gfxstrand[d]: Another thing I'm starting to wonder. Does generic memory compression work on buffers?
13:32djdeath3483[d]: not always on Xe2
13:32djdeath3483[d]: 😉
13:33gfxstrand[d]: Using some of the compression patches and increasing alignments should at least get you bigger pages which might help, too.
13:36karolherbst[d]: ohh yeah.. that might be a good idea
13:37karolherbst[d]: but.. not sure that will give me 2x perf which the shared mem size does for weird reasons?!?
13:40gfxstrand[d]: <a:shrug_anim:1096500513106841673>
13:40gfxstrand[d]: karolherbst[d]: Maybe we should be setting the occupancy fields?
13:43karolherbst[d]: yeah... I'll play around with those
15:16karolherbst[d]: so that's the parsed QMD: https://gist.githubusercontent.com/karolherbst/effe5235135e2c2ecf6782ba902dafd6/raw/3f1cdd89d574d5390f4641f4e8cfa13cc41b4ebe/gistfile1.txt
15:42karolherbst[d]: mhhh
15:43karolherbst[d]: the hw totally doesn't like what I'm sending
16:15karolherbst[d]: okay....
17:58ermine1716[d]: Is hw mad at you?
18:19karolherbst[d]: yes
19:03leftmostcat[d]: mhenning[d]: I'm fine with just closing that MR if it looks like it's a performance penalty.
19:28mhenning[d]: I'm currently questioning my benchmarking - it's not being super consistent between runs right now
19:28mhenning[d]: so it's possible I'm just seeing noise
19:34karolherbst[d]: CPU profiling? prolly want to engage power safe mode or something silly so the clocks aren't jumpy
19:34karolherbst[d]: and taskset if you are on a system with cores of different performance characteristics
19:39gfxstrand[d]: leftmostcat[d]: I'm fine with either way if it's not hurting anything. I thought boxing would be better since it means we aren't memcpying as much. But pointer chasing is also expensive. <a:shrug_anim:1096500513106841673>
20:03karolherbst[d]: how big is the Instr type anyway 🙃
20:05karolherbst[d]: `296` oof
20:11gfxstrand[d]: oof
20:11gfxstrand[d]: Maybe we should change the box threshold for SSARef to vec3+ instead of vec5+
20:11gfxstrand[d]: Then we'd still get inline for 64-bit stuff but wouldn't pay the cost every time.
20:14karolherbst[d]: there are a few nice tricks to cut down on sizes quite a bit tho...
20:15karolherbst[d]: but I kinda wished rust could tell what's making the struct so big 😄
20:18mhenning[d]: There's a way, described in the rust perf book
20:19mhenning[d]: I've been wondering for a while about doing something like that MR + hiding the largest ops behind boxes to reduce the size penalty
20:20karolherbst[d]: ohh nightly rustc
20:21mhenning[d]: yeah it's a little annoying to set up
20:22karolherbst[d]: yeah...
20:37mhenning[d]: Okay, I managed to dump the type sizes: https://gitlab.freedesktop.org/mhenning/mesa/-/snippets/7858
20:38karolherbst[d]: `16800 bytes` yo what
20:39karolherbst[d]: IAdd3X being the biggest one 🙃
20:39karolherbst[d]: ohh arrays
20:43karolherbst[d]: yeah... changing `SMALL_SIZE` to 2 might help a lot
20:43karolherbst[d]: that probably will cut down sizes quite a bit
20:43karolherbst[d]: `SSARef::SMALL_SIZE` I mean
20:44mhenning[d]: yeah, was just looking at that
21:10gfxstrand[d]: I'm also a little annoyed at how much enum(enum(enum))) we have going on with refs etc. Hopefully, Rust is shrinking most of those down to a u8 and not over-aligning on us too badly
21:26snowycoder[d]: mhenning[d]: In more recent versions of rust-analyzer it also tells you how much stack memory a variable uses just by hovering, it works on vscode and forks
21:28mhenning[d]: Is there any way to get that working with mesa? Last time I tried I found that meson didn't support it
21:29snowycoder[d]: You mean rust-analyzer?
21:29snowycoder[d]: It seems to work, more or less
21:29mhenning[d]: yeah, I haven't managed to get rust-analyzer to do anything
21:30snowycoder[d]: Have you tried using `rust-project.json`?
21:30snowycoder[d]: You need that for non-cargo projects
21:31mhenning[d]: I think I tried to set that up, yes
21:34karolherbst[d]: gfxstrand[d]: nah.. it's just `Src` being 32 bytes big and everything else explodes around that
21:38airlied[d]: large pages didn't affect the one coopmat benchmark I'm testing with
21:38airlied[d]: just had to fight /boot out of space 5 times to test it 😛
21:43karolherbst[d]: yeah... sooo... it seems like messing with the shared memory allocated changes the numbers a bit
21:43karolherbst[d]: karolherbst[d]: airlied[d] ^^
21:44karolherbst[d]: I wonder if something with caches is going very wrong?
21:44karolherbst[d]: shared memory is L1 cache after all
21:44airlied[d]: that one is wierd, I wonder is there some possiblity of a bank collision or something
21:45karolherbst[d]: yeah...
21:45karolherbst[d]: or that
21:45snowycoder[d]: mhenning[d]: I've set `"rust-analyzer.linkedProjects"` and `"editor.formatOnSave"` and it almost work, everything but tests and procedural macros (and sometimes complex CFGs)
21:45karolherbst[d]: I think before digging further I should just wire up all the perf counter stuff
21:45karolherbst[d]: because that's gonna tell us about those issues
21:46karolherbst[d]: somebody got to do that sooner or later anyway
21:47mhenning[d]: snowycoder[d]: okay, I'm not really trying to fix that right now but maybe I'll to to set it up again soon
21:47airlied[d]: bumping smem to max + removing cs invocations got my Turing from 19 to 20
21:47karolherbst[d]: heh
21:47karolherbst[d]: it doesn't help in all the sub-tests
21:48karolherbst[d]: just that one specifically stood out on my ampere
21:48airlied[d]: that's the first time I've gotten it above 20
21:48karolherbst[d]: 😄
21:48karolherbst[d]: cute
21:48karolherbst[d]: and here I'm hitting 120 in a couple of tests
21:49karolherbst[d]: airlied[d]: what Turing do you have anyway?
21:50airlied[d]: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] some shitty laptop one, just happened to be where I did the nvidia benchmarks
21:51karolherbst[d]: ahh..
21:59karolherbst[d]: is there something that sets up smem on the sub channel?
21:59karolherbst[d]: because some GPUs have more than the supported 48k
22:01karolherbst[d]: mhhh
22:02karolherbst[d]: I think the performance impact is the biggest the fewer shared memory is used... but a very high target helps a lot
22:02karolherbst[d]: it's really odd
22:04mhenning[d]: karolherbst[d]: smem is allocated in nvk_device_ensure_slm
22:04karolherbst[d]: it's not
22:04karolherbst[d]: that's local memory
22:05mhenning[d]: oh, sorry right
22:05karolherbst[d]: shared memory is L1 cache, so the more you take there, the less you leave available for instruction/data caches
22:05karolherbst[d]: but...
22:05karolherbst[d]: why is taking more helping?
22:05mhenning[d]: right
22:05karolherbst[d]: unless we mismanage our caches a lot
22:06mhenning[d]: Maybe changing the smem size is expensive?
22:06karolherbst[d]: mhhhh
22:06karolherbst[d]: then why is launchign the same QMD 60 times with the exact same smem values a problem?
22:07mhenning[d]: it's also worth noting that modern cuda versions always allocate a little smem for driver internal stuff, so there's no longer a cuda path that uses the zero smem case
22:07karolherbst[d]: mhhh
22:08karolherbst[d]: here is the QMD nvidia pushed to the hardware.. or rather the ones I was able to extract with envyhooks: https://gist.github.com/karolherbst/effe5235135e2c2ecf6782ba902dafd6
22:08mhenning[d]: but yeah I don't really know what's going on with that shader
22:08karolherbst[d]: was there something required on the sub channel for occupancy?
22:09karolherbst[d]: or does that only exist on 3d on the sub channel?
22:09karolherbst[d]: but it's still weird...
22:09karolherbst[d]: like
22:09karolherbst[d]: take `NVC7C0_QMDV03_00_MIN_SM_CONFIG_SHARED_MEM_SIZE` being 1
22:09karolherbst[d]: I've tried that
22:09karolherbst[d]: sadly the hw got angy with that
22:09mhenning[d]: karolherbst[d]: I'm not sure what you're talking about
22:10karolherbst[d]: like when using the occupancy QMD fiels e.g. `NVC7C0_QMDV03_00_OCCUPANCY_MAX_REGISTER`
22:10mhenning[d]: Oh, we have no idea
22:10karolherbst[d]: mhhh
22:10karolherbst[d]: `NVC7C0_QMDV03_00_REGISTER_COUNT_V` being 0x10 is also weird
22:11airlied[d]: nvidia vulkan with coopmat2 also reserves some shared mem for workgroup scope matrix
22:11karolherbst[d]: maybe there are a few submissions envyhooks misses and doesn't dump
22:12karolherbst[d]: cooperativeMatrixProps = 8x8x16 A = uint8_t B = uint8_t C = uint32_t D = uint32_t scope = subgroup
22:12karolherbst[d]: TILE_M=128 TILE_N=128, TILE_K=64 BColMajor=1 workgroupSize=256 123.330001 TFlops
22:12karolherbst[d]: yay
22:13karolherbst[d]: but dunno.. maybe something isn't properly set up somewhere
22:13karolherbst[d]: could also be kernel side stuff for all I know
22:19gfxstrand[d]: airlied[d]: Yeah. that's kinda the point of cmat2. Are we not doing that yet?
22:20airlied[d]: Nope have to get nir reviews on the first few parts of it for radv 🙂
22:21airlied[d]: I just started looking at workgroup scope last week
22:22karolherbst[d]: but it's impressive how consistently the perf increases with the smem thing: https://gist.github.com/karolherbst/1af36f572aaffdd7ace836032d9e372a
22:22karolherbst[d]: at least for some of the tests there
22:22karolherbst[d]: I wonder if it has something to do with ldsm....
22:24karolherbst[d]: mhhhhhhhhhh
22:26karolherbst[d]: normal smem, ldsm off/on: 50/52 TFLOPS
22:26karolherbst[d]: max smem target, ldsm off/on: 80/95 TFLOPS
22:27mhenning[d]: Is it possible lower occupancy is actually helping us? How does it compare if you artificially increase the register count?
22:27karolherbst[d]: I'm already at like 105 regs, but...
22:28karolherbst[d]: okay..
22:28karolherbst[d]: soo
22:28karolherbst[d]: going with 252 regs: 50 TFLOPS 🙃
22:28karolherbst[d]: no matter what I set for smem
22:29karolherbst[d]: sooo.. if I burn 2.5x amount of regs the perf doesn't change
22:29karolherbst[d]: fun
22:31karolherbst[d]: and 95 with 96 regs
22:31karolherbst[d]: but 50 again if I use the normal smem target
22:32airlied[d]: so just setting .smem_size to NVK_MAX_SHARED_SIZE?
22:32karolherbst[d]: no
22:33karolherbst[d]: TARGET_SM_CONFIG_SHARED_MEM_SIZE
22:33karolherbst[d]: like you can leave the min value, otherwise some tests fail to execute
22:34karolherbst[d]: it's likely that target has some occupancy considerations or whatever
22:34karolherbst[d]: let me check something...
22:35karolherbst[d]: `128` regs -> 95 TFLOPS
22:35karolherbst[d]: `136` regs -> 50 TFLOPS
22:36karolherbst[d]: it almost feels like that under certain circumstances the hw actually runs twice as many threads?
22:37karolherbst[d]: soo.. nvidia sets min to 1.. max to 0x1a and target to 1...
22:37karolherbst[d]: and occupancy to 0xff
22:38karolherbst[d]: but if I do that the hw screams at me
22:38karolherbst[d]: but 0x1a makes sense
22:39karolherbst[d]: my GPU has 100k of shared memory
22:39karolherbst[d]: so that's certainly the max
22:44karolherbst[d]: ohhh
22:44karolherbst[d]: got more perf
22:45karolherbst[d]: LOOOL
22:45karolherbst[d]: okay
22:45karolherbst[d]: I figured it out
22:45karolherbst[d]: `TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 100.423026 TFlops`
22:46karolherbst[d]: but the hell does this help
22:46karolherbst[d]: just set `SHARED_MEMORY_SIZE` to 0 🙃
22:49karolherbst[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 104.747316 TFlops
22:49karolherbst[d]: TILE_M=128 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 74.605881 TFlops
22:49karolherbst[d]: TILE_M=256 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 72.477432 TFlops
22:49karolherbst[d]: TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 89.315670 TFlops
22:50karolherbst[d]: somehow that occupancy stuff helps
22:51karolherbst[d]: okay..
22:51karolherbst[d]: apparently the hardware is able to dynamically allocate that stuff
22:52karolherbst[d]: there is also this `NVC7C0_QMDV03_00_SHARED_ALLOCATION_ENABLE` thing...
22:54karolherbst[d]: yeah like the hw doesn't care
22:55karolherbst[d]: yeah.............
22:55karolherbst[d]: and I guess the other tests just run over 128 regs
22:56karolherbst[d]: yeah...
22:56karolherbst[d]: oooofff
22:56karolherbst[d]: okay
22:56karolherbst[d]: I figured out the perf gap 🙃
23:05karolherbst[d]: yeah...
23:05karolherbst[d]: looks like 128 registers is a magic threshold here
23:06karolherbst[d]:anyway
23:06karolherbst[d]: now I'm very close to nvidia
23:07karolherbst[d]: and register allocation is the bigger issue with the remaining tests
23:07karolherbst[d]: good good...
23:07karolherbst[d]: wow.. the numbers are good now
23:08karolherbst[d]: anyway
23:09karolherbst[d]: that's the patch: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/882ac189b9f18cb3c52a819f5b1fdfad024fb99f#2b7cb25fc1c7e70a6c342139061c2964cd1e1c24_371_362
23:09karolherbst[d]: but the max value actually matters
23:09karolherbst[d]: so I guess we'll have to properly declare the smem limits per generation
23:09karolherbst[d]: and pipe that through properly
23:11karolherbst[d]: `TILE_M=128 TILE_N=128, TILE_K=64 BColMajor=1 workgroupSize=256 142.497619 TFlops` speed
23:13airlied[d]: nice!
23:14karolherbst[d]: https://gist.githubusercontent.com/karolherbst/464c069b6b5d3cef93be8e614db38700/raw/1b6ff2ba95f599adbefaa9dabcb4e7498d7cb21b/gistfile1.txt
23:14karolherbst[d]: I like how consistent it now is
23:14karolherbst[d]: some slower ones, but the ones I checked out where using more than 128 registers
23:15karolherbst[d]: soo now nvidia
23:15airlied[d]: did you check correctness
23:16karolherbst[d]: not all of them
23:16airlied[d]: correctness break when I change smem size to 0
23:16airlied[d]: but I don't have the other bits
23:16karolherbst[d]: ahh...
23:18karolherbst[d]: funny how that matters but doesn't change the performance
23:20karolherbst[d]: let's see if I still get 140 this way
23:23karolherbst[d]: I wonder if shared mem max is the amount across all currently running workgroups
23:24karolherbst[d]: but I kinda thought the hardware complains if a shared memory access gets out of bounds...
23:27mhenning[d]: didn't we disable some hardware bounds checking?
23:28karolherbst[d]: ohhh...
23:28karolherbst[d]: that explains 🙃
23:29karolherbst[d]: yep
23:34karolherbst[d]: the annoying part is that my GPU rejects the 0x19 value which is 96 * 1024 / 4096 + 1 and it needs to use 100 instead of 96
23:34karolherbst[d]: so `gv100_sm_config_smem_size` needs to be per gen as well somehow
23:34karolherbst[d]: or we need to pass in an array of legit values
23:36karolherbst[d]: guess I'll write a proper patch for that
23:36karolherbst[d]: more shared memory, yay
23:38karolherbst[d]: anyway, only getting `TILE_M=128 TILE_N=128, TILE_K=64 BColMajor=1 workgroupSize=256 127.672042 TFlops` now 😄
23:38karolherbst[d]: but I wonder if reducing register pressure would help...
23:39phomes_[d]: karolherbst[d]: I ran the games test on it. Some wins like AoE4 211->228 fps. In the game Urban Trial Playground the colors are messed up and very bright though
23:39karolherbst[d]: yeah....
23:39karolherbst[d]: what's your GPU?
23:40karolherbst[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/ef5e9331b831f09b8182698fbb6cd82981f67e9c
23:40karolherbst[d]: but that's only gonna work on ampere
23:41karolherbst[d]: phomes_[d]: try https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/ef5e9331b831f09b8182698fbb6cd82981f67e9c if you are on ampere
23:41phomes_[d]: I am on ada
23:41karolherbst[d]: mhh
23:41karolherbst[d]: that's...
23:41karolherbst[d]: SM90?
23:42karolherbst[d]: SM89
23:42karolherbst[d]: phomes_[d]: same patch will do just fine on ada
23:43karolherbst[d]: it's gonna break on turing tho
23:43karolherbst[d]: turing only has 64 max
23:44karolherbst[d]: https://gist.github.com/karolherbst/464c069b6b5d3cef93be8e614db38700 proper numbers now 😄
23:44karolherbst[d]: actually passing validation
23:45karolherbst[d]: I suspect the int8 shaders need more love
23:46phomes_[d]: karolherbst[d]: that fixed the colors
23:46karolherbst[d]: the perf is somewhere in between now?
23:46karolherbst[d]: or closer to the 228?
23:47karolherbst[d]: you also have the compute MME patches, right?
23:47phomes_[d]: I will rerun the tests. I only applied the patch to main
23:47karolherbst[d]: ahh
23:47karolherbst[d]: you want the compute MME patches as well
23:48karolherbst[d]: they both together should work best
23:49phomes_[d]: I will add columns in the sheet for just your patch, compute MME, and combined
23:49karolherbst[d]: nice
23:55karolherbst[d]: nvk vs nvidia coop: https://gist.github.com/karolherbst/c5d33f61ebed6ae4fb06d65f398a0320
23:55karolherbst[d]: fp16 stuff is _super_ close
23:56karolherbst[d]: like sometimes faster even, but the test also has randomly low scores as it doesn't sample
23:56karolherbst[d]: but like.. within 90%?
23:56karolherbst[d]: fp16 16x8x16 is a bit slower for bigger matrices...
23:57karolherbst[d]: probably register pressure as mentioned above
23:57karolherbst[d]: fp16 16x16x16 is lowering... might be supoptimal
23:57karolherbst[d]: fp32 sucks but I haven't optimized it much..
23:58karolherbst[d]: but the very first one was the one I was focusing most at
23:58karolherbst[d]: and it's very close 🙃
23:59gfxstrand[d]: See! AI was useful for something! 😝