00:20mhenning[d]: airlied[d]: Okay, I don't know what I'm doing. I'm working on integrating review feedback from the zcull patches. dakr suggested https://lore.freedesktop.org/nouveau/Z-aCVZYcEkxJHBle@pollux/ that we need a new nvif object for nvkm_gr, but I don't know how to make that happen.
00:20mhenning[d]: Current WIP version is here: https://gitlab.freedesktop.org/mhenning/linux/-/commits/zcull3 You can see both the uapi side and the rm rpc call there, but I don't know how to use nvif to glue them together.
00:21mhenning[d]: I think I would be able to do it if I were just adding a new method call to an existing object, but since there's no nvif for gr yet I'm getting stuck
00:24mangodev[d]: mhenning[d]: i saw the zcull mr get rebased today
00:24mangodev[d]: what caused that? i recall you saying that the small perf uplift wasn't worth the kernel work and that there's other stuff that's far more important to do
00:26mhenning[d]: I don't remember saying that, but priorities change over time. I'm currently trying to get a simplified version of it working that will hopefully be easier to land
00:27mhenning[d]: For a while it was blocked on userspace review but Mary reviewed a bunch of it.
00:27airlied[d]: mhenning[d]: so nvkm_gr_units kinda hacks around this already
00:29airlied[d]: and maybe doing the same hackery is sufficient here
00:30mangodev[d]: mhenning[d]: ohhhh i see
00:31mangodev[d]: may've been someone else that said something along those lines about zcull
00:31mangodev[d]: but that was also months ago
00:31mangodev[d]: it just felt sudden that we're working to get zcull up and running after it being dormant for so long
00:32mhenning[d]: airlied[d]: okay, I'm allowed to access nvkm_gr from the uapi side like that? dakr pretty directly said he thinks it needs to be an nvif interface
00:32mhenning[d]: here: https://lore.freedesktop.org/nouveau/Z9xb5SABWcwYnV-x@pollux/ "This is bypassing the nvif layer entirely. I think you should store the contents of struct NV2080_CTRL_GR_GET_ZCULL_INFO_PARAMS in struct r535_gr and have an nvif interface to access the data."
00:33airlied[d]: I'd agree with him normally, but then I see nvkm_gr_units and realise we made a bad decision already here, but I'll dig in a little more
00:43airlied[d]: mhenning[d]: I've replied to the thread, I think as long as the nvkm side uses one structure and ioctl uses another one it should be fine
00:45airlied[d]: I'm also not sure you need to bother with version and union on the drm ioctl, but also not a problem to leave it in
00:51mhenning[d]: airlied[d]: That's also something dakr requested. I don't have a strong preference either way
00:54airlied[d]: generally we don't bother, ioctl is a multiplexer, and building multiplexers inside multiplexers is kinda pointless
01:08mhenning[d]: Feel free to respond to this message then https://lore.freedesktop.org/nouveau/Z-aDWWUTN1MBI_wl@pollux/
01:33karolherbst[d]: getting somewhere with the shared shift stuff:
01:33karolherbst[d]: LDS.128 R8, [R2.X16] ;
01:33karolherbst[d]: LDS.128 R12, [R2.X16+0x10] ;
01:33karolherbst[d]: LDS.128 R16, [R2.X16+0x20] ;
01:33karolherbst[d]: LDS.128 R20, [R2.X16+0x30] ;
01:35karolherbst[d]: this has weird interactions with nir_opt_offsets but whatever..
10:04notthatclippy[d]: mhenning[d]: airlied[d] re:ZCULL patch, but also wider - future updates to 570.xx release only maintain ABI level compatibility described here: <https://github.com/NVIDIA/open-gpu-kernel-modules/blob/570/src/nvidia/interface/gsp_abi_check.c>. I don't know if nouveau is planning on ever moving from 570.144, to a later 570 version, but if so, you should PR an addition to the `gsp_abi_check.c` file
10:04notthatclippy[d]: in `570` branch for the stuff that you want to use
11:17karolherbst[d]: I'm wondering if I want to rename the address/offset fields of load/store/atomic instructions, because with the ugpr + gpr encoding this is going to become a bit of a mess
11:19karolherbst[d]: and I'm mildly leaning towards: base_addr (ugpr), offset (gpr), const_offset (immediate), offset_stride/offset_shift (for the shared mem stride on the gpr address). But I'm also not super happy with it
11:19karolherbst[d]: any other thoughts on it?
14:05karolherbst[d]: phomes_[d]: wanna check if lego builder is impacted by https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39709 ? It's usually seeing big benefits from compute related improvements, so I was wondering if it see s a bit from that one as well
14:17phomes_[d]: I don't see a change for lego builder. I'll run the full set and see if anything else changes
14:24karolherbst[d]: I highly doubt it will matter much and everything will be below 1% of change
14:24karolherbst[d]: it's only impacting compute shaders doing shared memory accesses
14:24karolherbst[d]: which is already a niche in games
15:10karolherbst[d]: okay.. it looks like the potential benefit of ULDC might be a bit bigger than I've anticipated..
15:37karolherbst[d]: in one shader here, replacing a single ld.constant with a uldc removed two other instructions.. impressive
15:37karolherbst[d]: and that's without the ugpr + gpr encoding wired up at all
15:38karolherbst[d]: this should allow us to use the bindless alu source more often, let's see how often we hit it in fossils
15:38karolherbst[d]: or uhm.. ugpr source as well
15:40karolherbst[d]: though in my example it just got rid of 2 r2ur as expected
16:42karolherbst[d]: crashing left and right, but:
16:42karolherbst[d]: Totals from 39268 (3.90% of 1007111) affected shaders:
16:42karolherbst[d]: CodeSize: 558211488 -> 556184240 (-0.36%); split: -0.41%, +0.05%
16:42karolherbst[d]: Number of GPRs: 2156391 -> 2144551 (-0.55%); split: -0.56%, +0.01%
16:42karolherbst[d]: Static cycle count: 464095742 -> 462700055 (-0.30%); split: -0.35%, +0.05%
16:42karolherbst[d]: Max warps/SM: 1491912 -> 1495744 (+0.26%); split: +0.26%, -0.00%
16:51karolherbst[d]: mhh some of the regressions are when the constant is used in e.g. tex instructions that can't take ugprs.. mhh
16:51karolherbst[d]: I wonder how I want to do the lowering here...
16:51karolherbst[d]: maybe I legalize uldc to ld.global.constant if the dest is a gpr..
16:52karolherbst[d]: or I just hardcode instruction types 🥲
16:52karolherbst[d]: but that's going to be annoying around phis
16:54karolherbst[d]: guess I could iterate each use of the def inside `opt_uniform_instrs`
16:54karolherbst[d]: the only drawback is, that uldc can't load 128 bit... whatever tho
16:58karolherbst[d]: which might actually be fine, because nothing can consume 128 bit uniform values anyway...
17:01karolherbst[d]: uhhh.. I see what's going wrong.. oof
17:07karolherbst[d]: so tex ops can take UGPRs...
17:07karolherbst[d]: but the issue here is that legalize is too late to make good decisions about uldc..
17:08karolherbst[d]: can we get rid of `copy_ssa_ref_if_uniform`? 😄
17:09mhenning[d]: karolherbst[d]: sort of. it's a lot of work
17:09karolherbst[d]: I know..
17:09karolherbst[d]: but it's gonna not be used for ld/st/atoms in the future anyway
17:09mhenning[d]: in general our handling of ugprs is pretty naive right now
17:09karolherbst[d]: because that's what I'll be doing with the GPR + UGPR stuff
17:10karolherbst[d]: Ast/Ald seems to be one of the few remaining users then
17:10karolherbst[d]: IPA can also take ugprs
17:10karolherbst[d]: oohhh Ast/Ald as well
17:10karolherbst[d]: yeah.. soo...
17:10karolherbst[d]: I think that will get removed on its own anyway 🙃
17:12mhenning[d]: Doesn't some of that depend on hardware version?
17:12karolherbst[d]: mhh surface + tex ops can only use UGPRs for the bindless tex handle
17:12mhenning[d]: iirc there's no UGPR source for ldg on turing
17:12gfxstrand[d]: karolherbst[d]: Not really. But we can make it so it's less likely to trigger.
17:12karolherbst[d]: mhenning[d]: there is
17:12gfxstrand[d]: Or we can add more encodings so we don't need it
17:13karolherbst[d]: yeah that's what I've been working on anyway
17:13karolherbst[d]: but I might need a solution earlier than landing the GPR + UGPR encoding bits
17:13karolherbst[d]: I think at least I'd like to have the UGPR -> GPR move be made earlier than legalize...
17:14karolherbst[d]: tldr: I want to add a ldgc_nv intrinsic that is a constant pull from a global VA into a UGPR always, but the way `opt_uniform_instrs` works I can't convert that to a ld.global.constant if needed
17:14karolherbst[d]: so I end up wiht ur2r now
17:15karolherbst[d]: ldc itself is a bit of a mess, and with uldc on global VAs in the mix it's not getting any prettier
17:16karolherbst[d]: anyway.. it needs to stay for texture/surface ops outside of bindless handles
17:17karolherbst[d]: just need to be smarter so we don't see r2ur and ur2r too often..
17:19karolherbst[d]: or... I just add a fake `LDGC` instruction that either emits to `ld.global.constant` or `uldc`...
17:19karolherbst[d]: not sure I like that any better
17:20karolherbst[d]: LDG can only take 24 bit offsets, and can load 128 bits, ULDC can take 32 bit offsets but can only load 64 bits...
17:20karolherbst[d]: LDG is way more flexible as well
17:20karolherbst[d]: has the GPR + UGPR encoding, where UDLC is only UGPR
17:21karolherbst[d]: so I'd rather make that explicit
17:21karolherbst[d]: what if we make the decision about ugprs vs gprs entirely in nir ...
17:22karolherbst[d]: I'm sure that's not a lot of work for almost no gain
17:22mhenning[d]: karolherbst[d]: That sounds like something the backend should be doing, not nir
17:22karolherbst[d]: maybe I should just write an op that eliminates r2ur and ur2r that runs after legalize
17:22karolherbst[d]: *opt
17:23karolherbst[d]: and if I see uldc + ur2r + something, I just convert uldc to ld.global.constant + something
17:23karolherbst[d]: but that feels like a bit of band aid
17:23mhenning[d]: mmm, the point of legalize is sort of that it's a last effort to make things work
17:24mhenning[d]: adding an opt pass afterward to undo part of that is against the point
17:24karolherbst[d]: yeah but the UGPR -> GPR thing is done there for all the memory ops
17:24karolherbst[d]: so maybe I move that into opt_uniform_instrs...
17:24mhenning[d]: You could do that before legalize too
17:25mhenning[d]: just legalize also needs to do it in case things didn't work out
17:25karolherbst[d]: but then it feels like I should do the GPR + UGPR encoding first, because that's gonna drop it for a bunch of instructions...
17:25karolherbst[d]: mhenning[d]: maybe I just add another interface function for it...
17:25karolherbst[d]: so it's not part of `legalize`
17:25mhenning[d]: I don't know what GPR + UGPR encoding means here
17:25karolherbst[d]: load/stores/atomics/etc.. can do a UGPR + GPR + constant
17:26karolherbst[d]: so they all take 3 sources basically
17:26mhenning[d]: right, I know that
17:26mhenning[d]: I don't know what "encoding" means before legalize
17:26karolherbst[d]: ohh I meant I wire up the instruction encoding first
17:26mhenning[d]: oh, sure
17:27karolherbst[d]: kinda wanted to do uldc first, because I was seeing r2ur otherwise 🙃
17:27karolherbst[d]: but maybe it's easier to tackle that problems while doing that..
17:27karolherbst[d]: oh well
17:27karolherbst[d]: still a lot of things to do there 🥲
17:28gfxstrand[d]: karolherbst[d]: I'm a fan of `_nv` intrinsics
17:29karolherbst[d]: gfxstrand[d]: then you'll love to hear that I already added a bunch of them recently 😄
17:29karolherbst[d]: though at least I've finished the shared address stride encoding, I just don't like the name's I've chosen here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39709
17:35gfxstrand[d]: Ok, IDK if I'm a fan of "a bunch". 😂
17:37mhenning[d]: gfxstrand[d]: these are the new ones so far https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39525/diffs?commit_id=e779538ad2d3f05a5c254b7daf1c96126b8a64bc#b36dc37ddf5b1144e08376782e22fa1b636c7423
17:39gfxstrand[d]: Those look reasonable
17:39gfxstrand[d]: I've thought for a while that it's better to have _nv versions than trying to pick out the offsets from an add
17:39gfxstrand[d]: Doubly so when we start trying to do GPR+UGPR+imm
17:39karolherbst[d]: yeah
17:40karolherbst[d]: that part already went in and showed some perf improvements in games even, so that was great
17:40karolherbst[d]: nir_opt_offsets made this so much more sane than trying to do it a from_nir time
17:40karolherbst[d]: especially with imad/lea/iadd3
17:44gfxstrand[d]: Yup
17:44gfxstrand[d]: And it lets us add stuff like predicates later if we want
17:45karolherbst[d]: also, and this is probably the biggest benefit, the nak shader prints are looking so much cleaner now, because you won't see those immediates/iadds that get folded into the load/stores 😄
17:46karolherbst[d]: ohh yeah...
17:46karolherbst[d]: ULDC with a global VA takes an input predicate as well
17:46karolherbst[d]: makes it return 0 on true
18:49gfxstrand[d]: karolherbst[d]: Oh, nice!
19:16karolherbst[d]: ULDC is really weird tbh... the most "weird" encoding it does is `URZ` + 32 bit unsigned immediate, that's for a global VA 🙂
19:16karolherbst[d]: there are some niche features in the hw, but I have no idea what's that even used for
19:17karolherbst[d]: but maybe an instance of "harder to remove than to keep it"
19:21karolherbst[d]: could maybe use it for things like a printf/shader constant buffer address that's placed in the first 32 bits of the VA and then patch the address into the shader 🙃
19:27karolherbst[d]: when I'm done with the address calc stuff, maybe I get to modeling predication.. I have a few ideas that might generally work, but I'm not sure if we want to do it as part of from_nir or deeper into nak, especially as it also impacts where we can safely use UGPRs
19:42mhenning[d]: There have been multiple starts at predication already and I don't think we've really decided on an approach
19:42mhenning[d]: I keep meaning to get back to compiler stuff but I've been mostly focused on bug fixing lately
23:44karolherbst[d]: `Static cycle count: 14985 -> 12137` yeah....
23:45karolherbst[d]: I'm sure I screwed something up 🙃
23:46karolherbst[d]: because I don't see why the shader is any better...
23:46karolherbst[d]: ohh...
23:46karolherbst[d]: mhhh
23:49karolherbst[d]: ohh..
23:49karolherbst[d]: okay, it came from something else.. interesting...
23:49karolherbst[d]: `('iadd', ('iadd', 'a(is_convergent,is_not_const)', 'b(is_divergent)'), 'c(is_divergent)'), ('iadd', ('iadd', b, c), a)`
23:50karolherbst[d]: ohh maybe my pass is broken or so...
23:52karolherbst[d]: yeah.. that opt alone drops a bunch of stuff...
23:52anholt_: sounds like you're approximating nir_opt_reassociate?
23:52karolherbst[d]: but why 🙃
23:53mhenning[d]: yeah see also https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39028
23:53karolherbst[d]: ahh...
23:54karolherbst[d]: ohh wait.. it reorders all the address calc chains, mhh
23:55karolherbst[d]: mhenning[d]: yeah, maybe I should use that instead.. let me rebase
23:58karolherbst[d]: will you have time to work on this MR or should I just take it over as it seems it benefits things I'm looking at 😄