00:00gfxstrand[d]: phomes_[d]: See! AI was useful for something! 😝
00:08karolherbst[d]: the bad part is, I only have until Thursday because after Thursday I'm uhm... "busy" with "very important" things
00:08karolherbst[d]: but maybe I manage to clean up the shared mem stuff _and_ the ugpr + gpr stuff...
00:08karolherbst[d]: I need to run it through the shader db and see how big of an impact it has...
00:09karolherbst[d]: but I'm also not very sure about the encoding bits, because I just force every instruction to the UGRP+GPR form...
00:09karolherbst[d]: not sure if that comes with any drawbacks?
00:09mhenning[d]: I've actually seen the cuda compiler do the same thing
00:09karolherbst[d]: I mean it makes the code simpler
00:10karolherbst[d]: the compiler code I mean
00:10mhenning[d]: that is, on any hardware where [ugp + gpr] exists it always uses that form
00:10karolherbst[d]: okay
00:10karolherbst[d]: then I'll just do the same and won't care any further
00:10mhenning[d]: yeah, I wouldn't worry about it
00:11karolherbst[d]: the free 64 + 32|64 addition is huge...
00:11karolherbst[d]: but I'm worried that it won't really do much without all my opt_offset changes as well
00:12karolherbst[d]: also need to teach opt_offset to fold in base for globals
00:13karolherbst[d]: at least on a 32 bit offset it can actually do so..
00:21karolherbst[d]: but man.... 50% perf gap and it was just non optimal shared memory config 🙃
00:23mhenning[d]: yeah, I have a feeling we have a few silly things like that
00:33mhenning[d]: I managed to shave off 40 bytes/instruction. that's a start https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37130
00:35julienardinal: Now those are the consequences of not eliminating abortion leftovers like karolherbst from development. Every now and then the goose puts a stinky discuis into the air.It's a grandfarter similar to our local rate my poo heroes. Ryan is sniffing half fainted at the properties or components of that appearingly deadly fart again. So i come to tell, that no longer I provide details as to how
00:35julienardinal: real code executes in the maps of transitions/indirections very exactly, it's cause i never was intended nor happy to deal with animals alike.
00:44HdkR: Speaking of perf, apparently the Jetson Thor's Neoverse-V3AE cores have significantly lower performance than expected. Hopefully someone figures that out, would be a shame otherwise.
00:45HdkR: Losing to Neoverse-V2 cores at roughly the same clock.
00:47HdkR: My concern is that having the coherent interconnect has introduced some overheads for the sake of not needing to do cache flushes for communication between CPU and GPU.
00:50airlied: do you think any real person is ever going to see a thor board?
00:51HdkR: They're actually shipping from Arrow now
00:51airlied: like I hear spark might be underwhelming too, but they might sell those
00:51HdkR: While Spark is delayed, Thor at least managed to start selling last Tuesday :D
00:52airlied: at the price point for thor, I doubt people are running to buy them :-)
00:52phomes_[d]: karolherbst[d]: updated results for you at https://docs.google.com/spreadsheets/d/1RuHD3Z_nBKCp618HHC5I9hOu0lqCoFYwQ4FM69M-Ajg
00:52HdkR: Yea, the price is kind of nutty, more ethan double the Orin...
00:55HdkR: I'm now waiting for AmpereOne to be available to DIY market so I can get PCIe+Rebar+AArch64 to test NVK. Apparently when Radxa increased the BAR size on their board, it broke a bunch of things :(
00:59phomes_[d]: karolherbst[d]: in both versions of your patch you fixed a big performance issue in the main menu for Atomic heart. Before I got <10 fps. Now it is 60 fps
01:09karolherbst[d]: impressive
01:43karolherbst[d]: I wonder if a low target shared mem size reduces parallelism of the same thing or something.. or just is the actual shared/cache split... I'll need to run some CL benchmarks to see if that has a high impact on low shared mem shaders tomorrow
02:38airlied[d]: karolherbst[d]: so these illegal opts give me a few more instructions out of the loop
02:38airlied[d]: + (('ushr', ('iadd', a, '#b'), '#c'), ('iadd', ('ushr', a, c), ('ushr', b, c))),
02:38airlied[d]: + (('ishl', ('u2u64', a), '#b'), ('u2u64', ('ishl', a, b))),
02:38airlied[d]: + (('ushr', ('iadd3', a, b, c), d), ('iadd3', ('ushr', a, d), ('ushr', c, d), ('ushr', b, d))),
02:38airlied[d]: + (('iadd', ('iadd', ('iadd(no_unsigned_wrap)', a, '#b'), c), d), ('iadd', ('iadd', ('iadd', a, c), d), b)),
02:38airlied[d]: if I also drop lea
02:39airlied[d]: but there is still a few more places the const doesn't make it into the ld opcode
03:39airlied[d]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/nak/ldsm-opts-hacks these hacks rematerialise a bunch more consts in the ld/st instructions and gets rid of a lot of alu
03:40airlied[d]: the one liner to reassociate divergent and convergent adds in address calc opts seems to be the piece I was missing
04:14HdkR: 6
05:47jja2000[d]: HdkR, at least they're trying
05:50HdkR: jja2000[d]: NVIDIA?
05:50jja2000[d]: Radxa/CIX
05:50HdkR: Oh yea, It's nice that they're tinkering with it, and it's a nice upgrade over my Orin
05:51HdkR: It just has some rough edges that are worth pointing out.
07:51karolherbst[d]: airlied[d]: yeah, that helped a lot
07:52karolherbst[d]: the issue with most of the unsafe opts is, that proving their correctness is really hard
07:52airlied[d]: I think a lot of them might be easier to fix earlier
07:52karolherbst[d]: I have some ideas, but...
07:52airlied[d]: Like just do all calcs in 64 bit space
07:53airlied[d]: Then prove they can fit in 32 bit if any are left
07:53karolherbst[d]: the issue is that uub is 32 bot only
07:53karolherbst[d]: so we cant proof they fit with it
07:53airlied[d]: Though I think having load global nv take ugpr/gpr/24bit const might make it easier
07:54karolherbst[d]: yep
07:54karolherbst[d]: that's my plan to have a 32 bit gpr source and then simplify everything
07:55karolherbst[d]: the one issue is the loop variable
07:55karolherbst[d]: like we know its range, just how the CFG is laid out it overflows even though it practically doesn't matter
07:56karolherbst[d]: the two ifs can also be merged
07:57karolherbst[d]: and I think the block at the end can also be moved into the loop
07:57karolherbst[d]: and with predication we'd get a single block for the entire loop
07:57karolherbst[d]: that will help the most probably
07:58karolherbst[d]: anyway, I should clean up what I got first
08:19karolherbst[d]: anyway.. I should clean up that smem stuff first as this is gonna help a lot on its own 😄
08:20karolherbst[d]: I don't think that optimizing the loop helps _that_ much, it's just gonna help with closing the remaining ~10% gap. All the other tests having a massive gap are probably due to occupancy, so we'll need to figure out how to use fewer registers or something
08:20karolherbst[d]: what nvidia seems to do is to rematerialize const buffer loads
08:20karolherbst[d]: we should probably do the same
08:42karolherbst[d]: ohh we don't have the compute headers for kepler c?
08:42karolherbst[d]: they don't have the 16, 32, 48 shared mem split, but 80, 96, 112 🙃
09:07airlied[d]: karolherbst[d]: I wonder if the const should try to be associated with the ugpr src not the gpr
09:07karolherbst[d]: nir_opt_offset can try on both
09:07karolherbst[d]: and my code in from_nir also handles both
09:07karolherbst[d]: so it doesn't really matter
09:07karolherbst[d]: or well shouldn't
09:08karolherbst[d]: just need to wire it all up. The issue with nir_opt_offset is just that it relies on uub to reorder things, and it kinda doesn't want to handle 64 bit pretty well
09:30karolherbst[d]: airlied[d]: there was an MR somewhere which tried to reorganize the entire chains on global memory which _might_ be interesting here on a higher level: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33926
09:31karolherbst[d]: ehh I think it was another one
09:36karolherbst[d]: yeah.. can't find it 🙃
11:04karolherbst[d]: okay.. this also speeds up some OpenCL benchmarks with rusticl, nice
11:04karolherbst[d]: though I should pick the compute mme stuff for benchmarking first...
12:04karolherbst[d]: phomes_[d]: proper MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135
12:58phomes_[d]: I am testing it now. Sadly today the menu is horrible even when I retest with your patches from yesterday. Strange. The 40->48 fps speedup in-game in Atomic heart and 207->217 fps speedup in AoE4 are still there with the MR
12:59phomes_[d]: It is strange that I got even better performance from the first patch that used the unsupported size of 96
13:00phomes_[d]: I also get the same improvement going to just 64 instead of 100 btw
13:11karolherbst[d]: phomes_[d]: huh.. normally that should result in a context error...
13:14karolherbst[d]: but anyway, if it gives you more perf whatever 😄
13:23gfxstrand[d]: karolherbst[d]: Tossed you a smattering of comments, mostly cosmetic.
13:23karolherbst[d]: yeaah.. already reworking the array thing
13:23gfxstrand[d]: Where did the numbers come from?
13:25karolherbst[d]: cuda wiki which has it from the cuda docs. I should probably double check. But tesla/fermi/kepler match class headers/qmds, and volta+ I'd have to double check
13:26karolherbst[d]: ehh fermi/kepler have it in the class headers, tesla is constant
13:26karolherbst[d]: fermi has e.g. `NV90C0_SET_L1_CONFIGURATION_DIRECTLY_ADDRESSABLE_MEMORY_SIZE_16KB` and `NV90C0_SET_L1_CONFIGURATION_DIRECTLY_ADDRESSABLE_MEMORY_SIZE_48KB`
13:26karolherbst[d]: maybe I should document it..
13:26phomes_[d]: https://docs.nvidia.com/cuda/ada-tuning-guide/index.html has numbers for ada
13:27karolherbst[d]: yeah the wikipedia page pulled those all from there
13:27karolherbst[d]: the only odd thing is SM37.. we don't have the headers for it...
13:28karolherbst[d]: it's a2c0 afaik
13:28karolherbst[d]: `The NVIDIA Ada GPU architecture supports shared memory capacity of 0, 8, 16, 32, 64 or 100 KB per SM.` right.. that's for ada
13:28karolherbst[d]: anyway.. I'll double check all those numbers
13:31karolherbst[d]: 6.5 doesn't have online docs 🥲
13:31karolherbst[d]: not that it matters for tesla
13:31karolherbst[d]: `GK210 more than doubles the shared memory capacity versus Fermi and earlier Kepler GPUs. ` okay!
13:32karolherbst[d]: ohhhhhhhhhh
13:32karolherbst[d]: okay
13:32karolherbst[d]: I found something _very_ interesting
13:32karolherbst[d]: GK210 improves on this by increasing the shared memory capacity per multiprocessor for each of the configurations described above by a further 64 KB (i.e., the application can select 112 KB, 96 KB, or 80 KB of shared memory), which provides automatic occupancy improvements for kernels limited by shared memory capacity.
13:32karolherbst[d]: Note: The maximum shared memory per thread block for all Kepler GPUs, including GK210, remains 48 KB.
13:33karolherbst[d]: so my assumption that a higher shared mem target helps with occupancy was right 🙃
13:33karolherbst[d]: and it also explains why 48k is the cap per warp
13:33karolherbst[d]: (that's for kepler only tho)
13:33karolherbst[d]: mhh
13:33karolherbst[d]: guess I'll need to model it a bit differently then
13:50gfxstrand[d]: karolherbst[d]: Yeah, that' bit of documentation would be awesome
13:50karolherbst[d]: I can't verify the 5.3 number 🥲
13:50karolherbst[d]: not that it matters
13:50karolherbst[d]: there is a per warp limit that's different
13:50karolherbst[d]: so the table matters for occupancy only
13:54gfxstrand[d]: Yeah, any citations you can add as comments would be great.
13:55karolherbst[d]: they had an online occupancy calculator spreadsheet at some point but it's gone 😭
14:00karolherbst[d]: `A new feature, Volta enables a single thread block to address the full 96 KB of shared memory. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB, and an explicit opt-in is also required to enable dynamic allocations above this limit. See the CUDA C++ Programming Guide for details.` ah
14:01karolherbst[d]: Sooo now I know why we don't have splits for Maxwell and Pascal.. they move shared memory to dedicated memory and merged it back with Volta
14:07marysaka[d]: karolherbst[d]: I think I have it somewhere let me see :aki_thonk:
14:08karolherbst[d]: it's funny how turing docs say "32 or 64" but somehow we've done all the other values as well and it's fine?
14:13karolherbst[d]: ohhh nice
14:15karolherbst[d]: oh that one is awesome thanks!
14:26gfxstrand[d]: OMG, rebasing on maintenance9 is such a pain. :blobcatnotlikethis:
14:49mohamexiety[d]: ~~sorry~~
15:19gfxstrand[d]: No worries. This branch adds new stuff to `nvk_device` and refactors a bunch of descriptor bits. Lots of stuff gets modified that's now behind `if (dev->nvkmd)` checks.
15:42cubanismo[d]: gfxstrand[d]: We had two similar bugs at NV a very long time ago, back when I first started here. One was on the software side, and one was a similar case of terrible, terrible code being written that worked, but made everyone involved angry whenever it was discussed. Fortunately long gone now.
15:43cubanismo[d]: But I think that's partly why it bothered me so much this time.
15:44gfxstrand[d]: I'm glad you were bothered and went digging for us. I was flying on gut feelings and it's good to have those confirmed rather than just shoving extra stuff in and not being sure.
16:02karolherbst[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1412467720821145760/image.png?ex=68b86693&is=68b71513&hm=17042265f0c665d9c7ff74257bd046ba85b7e0cd76f30b9632319579f8f9c348&
16:02karolherbst[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1412467721366536222/image.png?ex=68b86693&is=68b71513&hm=396e39368e91d03367ea2f7961e9928c1e1735bb88368eba6cb0d568d4ad72dd&
16:02karolherbst[d]: I saw 2x perf, the occupancy calculator tells me twice as many active threads! So it all makes sense now 😄
16:03karolherbst[d]: still a mystery what exactly min/max/target mean in the QMD
16:03karolherbst[d]: maybe we should just ask and maybe somebody from nvidia is able to answer it?
16:04karolherbst[d]: but I get the feeling it might be beneficial to not set it to the max always?
16:04karolherbst[d]: dunno
16:04karolherbst[d]: selecting the 100k config doesn't make any difference
16:05karolherbst[d]: it also tells me that I'm limited by registers...
16:05karolherbst[d]: (after bumping shared memory)
16:05karolherbst[d]: I think at some point we'll have to model occupancy properly...
16:06karolherbst[d]: getting below 80 regs would give me more perf.. mhh
16:06karolherbst[d]: anyway
16:06karolherbst[d]: upstreaming first 🙃
16:12karolherbst[d]: let me try that actually..
16:12mhenning[d]: karolherbst[d]: yeah, `pub fn max_warps_per_sm(gprs: u32)` is a start. Still needs local_size and shared mem for compute
16:12karolherbst[d]: mhenning[d]: for shared mem https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135
16:15karolherbst[d]: mhh tried to set 80 regs, but the local memory usage already kills perf enough
16:34karolherbst[d]: I suspect I might need to figure out if I can keep a lower shared mem target, because otherwise I'm taking away L1 cache for data...
16:35gfxstrand[d]: Yeah, that might come back to burn us
16:35karolherbst[d]: I suspect for workloads with just a tiny bit of shared memory this will cause more cache misses, so the question is if we want to do it properly from the start
16:36karolherbst[d]: or I just do it good enough and do requested smem size * max_warps_per_sm and set it as a limit 😄
16:37karolherbst[d]: mhh
16:37karolherbst[d]: maybe it's a bit more complicated than that
16:37karolherbst[d]: mhh should be easy enough, just some math on top
16:38mhenning[d]: lol and how hard could math possibly get?
16:39mhenning[d]: karolherbst[d]: I think that's a reasonable start, although it's ignoring the local size
16:39karolherbst[d]: yeah...
16:39karolherbst[d]: anyway, I just need to figure how many groups can be ran there
16:40karolherbst[d]: do we have max regs limits already?
16:40karolherbst[d]: mhh we don't
16:40mhenning[d]: what do you mean by "max regs limits"?
16:40karolherbst[d]: e.g. 65k threads per block
16:40karolherbst[d]: ehh
16:41karolherbst[d]: registers per block
16:41karolherbst[d]: and also 65k regs per SM
16:41karolherbst[d]: and that should be used to calculate how many threads/blocks you'll be able to run
16:41mhenning[d]: the 65k regs/sm is modeled in max_warps_per_sm
16:42mhenning[d]: but threads/block is ignored for now
16:42karolherbst[d]: ohh it's inside nak.. mhh
16:42karolherbst[d]: yeah I'll think about how I can have the info available at QMD gen time
16:43karolherbst[d]: but yeah.. I think I only need `max_warps_per_sm` here..
16:43mhenning[d]: what do you mean "oh it's inside nak"? so is qmd generation
16:43mhenning[d]: just call the function
16:43karolherbst[d]: ohh so it us
16:43karolherbst[d]: *is
16:44karolherbst[d]: I kinda thought qmd is its own crate...
16:49karolherbst[d]: `(max_warps_per_sm(regs)/ (local_size / 32) ) * shared_mem`?
16:49karolherbst[d]: should be right?
16:49karolherbst[d]: all divisions round down
16:50mhenning[d]: local_size / 32 sound like something that should round up
16:50karolherbst[d]: ohh right
16:50karolherbst[d]: but yeah.. gives me `2` for my example here
16:51karolherbst[d]: `2 * 18944 == 37888` which is between 32 and 64
16:51karolherbst[d]: so target would be 64 here (and not 100)
16:52karolherbst[d]: though I'm still wondering what `max` means here unless it's really the hardware max which is a bit odd?
16:52karolherbst[d]: maybe limiting max gives me more perf...
16:52karolherbst[d]: I should benchmark this
19:40gfxstrand[d]: HdkR: Do you have any ideas on https://gitlab.freedesktop.org/mesa/mesa/-/issues/13814#note_3081204 ?
19:54jannau: https://github.com/AmpereComputing/linux-ampere-altra-erratum-pcie-65/blob/main/v6.13/0002-ampere-arm64-Work-around-Ampere-Altra-erratum-82288-.patch
19:55jannau: it's one of the busted PCIe controllers on arm platforms
19:57HdkR: Yea, if you don't get that bug on an x86 platform then it looks like Ampere Altra's busted PCIe fabric.
19:58HdkR: To 100% confirm, would need them to test on a platform without borked PCIe although, since technically it could still be ARM specific.
19:58gfxstrand[d]: Is there an Arm without borked PCIe?
19:59HdkR: NVIDIA Orin, Grace.
19:59gfxstrand[d]: Yeah, I assume Grace works, though I think they're betting more on nvlink than PCI for that
19:59gfxstrand[d]: Orin: Working PCIe. Busted CPU atomics. See! It half works!
19:59HdkR: :D
19:59jannau: I'm still astoundish that mvidia's proprietary driver reportedly works without issues on ampere
20:00gfxstrand[d]: I'm a little "press X to doubt" as well. But also, I can believe that Ampere payed them to debug it.
20:00HdkR: I think they have PCIe quirk workarounds for the platforms. Considering they would have needed to implement a ton for X-Gene.
20:02HdkR: AmpereOne and Thor theoretically also have a working PCIe implementation but those are harder to test :D
20:02marysaka[d]: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/6387af3092a0da1a8e67d9d04b3463480756fc84/src/nvidia/src/kernel/platform/chipset/chipset_info.c#L1300
20:02marysaka[d]: For the Ampere workaround
20:04HdkR: :D
20:08jannau: nice, so nvidia proprietary will not work out of the box on apple silicon
20:08HdkR: That much is definitely expected at this point
20:10gfxstrand[d]: Ah, yes, for all those M1s with a dGPU
20:11steel01[d]: gfxstrand[d]: Is there a good writeup on that atomics thing somewhere? I saw the fex blog post, but iirc it didn't really expound. Just a 'it's broken and sucks', so I still don't know what the technical issue is.
20:11gfxstrand[d]: Though I suspect an eGPU setup could probably be cobbled together somehow.
20:11jannau: never_released keeps claiming it'll work whenever it comes up
20:13jannau: no thunderbolt support yet but I have a M2 Ultra Mac Pro with PCIe slots
20:15HdkR: steel01[d]: Look up errata 1951502 for Cortex-A78AE
20:15HdkR: Only effects r0p0 that Orin ships
20:33steel01[d]: Ah, so it's more arm's fault than nvidia's. Though this makes me wonder how 'drop-in' newer revisions are. Like, can nvidia just grab r1p0 for newer silicon runs. Guessing not so simple given that they haven't.
20:50HdkR: Eh, respinning the silicon is expensive so probably not worth it. Most people probably don't even care that the CPUs don't perform as expected, not like there's another platform to jump to :P
21:55karolherbst[d]: okay.. done the shared mem size stuff properly
21:55karolherbst[d]: I got even more perf 🙃
21:58gfxstrand[d]: I'll review again tomorrow
21:59karolherbst[d]: nice
21:59karolherbst[d]: I still need to push tho
21:59karolherbst[d]: doing some testing, but it looks like I didn't mess up
21:59karolherbst[d]: I should test it on turing, because that's kinda a big change there
21:59karolherbst[d]: big as in functional
22:00karolherbst[d]: when I started I barely got over 20, and now I'm like at `TILE_M=128 TILE_N=128, TILE_K=64 BColMajor=1 workgroupSize=256 131.772726 TFlops`
22:03gfxstrand[d]: Nice!
22:04karolherbst[d]: nvidia peaks at 160, but I'll get there 😄
22:04karolherbst[d]: the one shader I focused the most on is 85 vs 95 TFlops