00:17gfxstrand[d]: karolherbst[d]: Actually, it is. We need to figure out how to use the HW thing
00:20karolherbst[d]: there is a hw thing?
00:20karolherbst[d]: isn't that just a very aggressive prefetcher?
00:34gfxstrand[d]: IDK what exactly it does. I just know it's super necessary for perf according to nvidia folks
00:35airlied[d]: wow if nvidia have to do 4 hadd2s, they do hadd2/hfma2/hadd2/hfma2
00:41airlied[d]: also their shader container a section called BARRIERs with offset into the shader program where there are NOPs
00:41airlied[d]: I wonder do they patch barriers somehow later
00:44airlied[d]: like maybe if you don't launch more than 1, they don't bother
03:43mhenning[d]: airlied[d]: yeah, that's on the todo list
03:45mhenning[d]: although we'd ideally have a model of the functional units and instruction throughputs in order to do so
04:08mangodev[d]: gfxstrand[d]: i'm appreciative of what's working
04:08mangodev[d]: the fact that most of this works *at all* is insane, even despite the sync issues
04:08mangodev[d]: love the work, i admire it ❤️
04:10mangodev[d]: although that (presumably nouveau??) cursor bug has been driving me insane and i wanna find why it acts like how it does
04:10mangodev[d]: …but the thing that drives me more insane than the cursor shaking maniacally is navigating old nouveau source code
11:01karolherbst[d]: ahh `LOAD_ROOT_TABLE` which is 3D only...
11:02karolherbst[d]: ohhh
11:02karolherbst[d]: I see
11:02karolherbst[d]: okay
11:02karolherbst[d]: I know what it is
11:02karolherbst[d]: gfxstrand[d]: tldr: there are 32 constant memory banks
11:03karolherbst[d]: some of them are root-constants
11:09carlorados: So according to logs which i never touched was as to how to get inverse index such as 512-115=397 for dependent inst. on one barrel, since everything else was base magic, aka what is the barrel for index elimination. so 230+512=742-371=371 where 115+512-371=256 you get 115 out from this transaction, so when putting all indexes of dependent inst combined out of two such, the stuff repeats
11:09carlorados: that base is the main controller of that tweak too. But what was 371? and how we can remove those? it was indeed single base+indexes in the intermediate routine set. In other words all indexes+bases can be removed from that IR to get the 397 inverse index of any dependent instruction upto certain point too. not more complex is the routine of getting all of those in reindexeable format, where
11:09carlorados: you put 345 to the hash aka 3*115 in such case one pair is also left behind.
13:22gfxstrand[d]: karolherbst[d]: Yes but the root constant ones are tiny and way fast.
13:22karolherbst[d]: ahh
13:22karolherbst[d]: anyway, it's 3D only ~~so I don't care about them :P~~
13:22karolherbst[d]: but I don't think they are hard to wire up
13:22karolherbst[d]: nothing special in the shader really
13:23karolherbst[d]: they seem to be used for d3d stuff
13:24karolherbst[d]: I think they are direct only tho
13:24karolherbst[d]: like no indirect accesses to them
13:26karolherbst[d]: `SET_PIPELINE_BINDING`
13:26karolherbst[d]: `BIND_GROUP_CONSTANT_BUFFER`
13:26karolherbst[d]: `ROOT_TABLE_VISIBILITY`
13:26karolherbst[d]: `SET_ROOT_TABLE_SELECTOR` and `LOAD_ROOT_TABLE` are all relevant I thik
13:28gfxstrand[d]: Yeah
13:29karolherbst[d]: might be fun to use them for push constants at index 31 and see if somebody gets them working
13:31gfxstrand[d]: Indirects is an interesting question because it affects what we can put in them.
13:32gfxstrand[d]: But the Nvidia folks made it sound like they were using one for push data and you can definitely indirect that.
13:32karolherbst[d]: ohh, I mean indirect on the index
13:33karolherbst[d]: not the offset
13:33gfxstrand[d]: Right
13:34karolherbst[d]: but I base this on a d3d partial bindless image thing where for root buffers it's a constant
13:34gfxstrand[d]: In any case, I think they're way faster for partial updates and stall less than binding regular constant buffers.
13:34karolherbst[d]: right
13:35karolherbst[d]: I'll play around with the qmd stuff later today tho.. getting rid of all the memcpys in a trivial compute dispatch should also help a bit 😄 maybe I look into constant buffer uploads next then
13:35gfxstrand[d]: So for stuff like base vertex and friends, we really want to be using them.
13:36karolherbst[d]: right..
13:37karolherbst[d]: it kinda sounds like you can slot them and switch between groups between drawws
13:37karolherbst[d]: without having to change the shader
13:37gfxstrand[d]: IDK how useful that is
13:37karolherbst[d]: and without having to reupload new buffers
13:38gfxstrand[d]: We never upload new buffers for root constants. Haven't in a while. We do entirely inline updates.
13:38karolherbst[d]: like even for compute we upload the qmd every dispatch which isn't great
13:38karolherbst[d]: and I suspect similar happens for 3d
13:39karolherbst[d]: ohh, maybe it's just compute that sucks then
13:39karolherbst[d]: I meant the root desc thing tho
13:39karolherbst[d]: +qmd
13:40gfxstrand[d]: Compute is annoying because it's really focused on these in-memory structures which point to other in-memory structures. It's great for parallelism inside the GPU but it kinda sucks for us to have to manage.
13:41gfxstrand[d]: 3D is much more focused on blasting everything in through the command streamer.
13:41karolherbst[d]: I see
13:41karolherbst[d]: well.. nuking the memcpies for compute first then and see how well it goes...
13:41karolherbst[d]: not really sure I have a great plan so far
13:42gfxstrand[d]: Compute doesn't have inline constant buffer updates, for instance. It has QMD DMAs but I doubt those do anything special besides pre-populate the SKED cache.
13:42karolherbst[d]: the issue I'm seeing is that the QMD gets replaced by an almost identical QMD except the cb0 address is different
13:43gfxstrand[d]: Yup
13:43karolherbst[d]: so instead of uploading a new QMD 60 times, I just want one and stick with it
13:43karolherbst[d]: which also means the root desc needs t stay at the same VA
13:43karolherbst[d]: and get updated via the push buffer
13:43gfxstrand[d]: Can't. Compute doesn't have that
13:43karolherbst[d]: it does
13:44gfxstrand[d]: ?
13:44karolherbst[d]: just use NVC6C0_LAUNCH_DMA
13:44karolherbst[d]: with INLINE_DATA
13:44gfxstrand[d]: That's not the same
13:44gfxstrand[d]: That'll cause a huge steak
13:44karolherbst[d]: why?
13:45gfxstrand[d]: Because you have to wait for any running compute shaders to finish or else they might read the wrong root data.
13:45gfxstrand[d]: The 3D constant data updates are pipelined. That's just blasting data into memory.
13:47karolherbst[d]: mhhhh
13:47karolherbst[d]: that's indeed kinda annoying
13:48gfxstrand[d]: The way they get parallelism is by everything having its own data in memory so that anything can run in any order and still have access to it's version of push constants or constant buffers. It's kinda like Mali or pre-SNB Intel. Or you can think of it as bindless GPU state if you wanna make it sound fancy and modern.
13:49karolherbst[d]: right....
13:49karolherbst[d]: maybe I should look at the root table and maybe it's all the same
13:49karolherbst[d]: but probably not
13:50gfxstrand[d]: If the root table and the QMD are both the same then yeah, we can just dispatch back-to-back. But I expect that's a pretty uncommon case.
13:52karolherbst[d]: but at least not switching the QMD does seem to have perf benefits, so I'll dig a bit to see if something sane could be done about it... but yeah if launch_dma is too heavy...
13:52karolherbst[d]: there is also the `SET_INLINE_QMD_ADDRESS` thing... but that alone didn't seem to do much
13:53karolherbst[d]: maybe I should dump whatever nvidia is doing there and see
13:53gfxstrand[d]: karolherbst[d]: Have you proven that?
13:55karolherbst[d]: I wouldn't say "proven", but if I use the same qmd for the 60 back to back dispatches it does seem to speed up, but who knows if I'll see the same if I don't hack it like that.
13:56karolherbst[d]: like it got me quite close to nvidias perf
13:56gfxstrand[d]: Okay
13:56karolherbst[d]: the only difference was the root desc of course, I doubt the shaders runtime depends on the buffers content at all, because it's just a static loop
13:57karolherbst[d]: so let's say "there is a sign it might"
13:57gfxstrand[d]: It'd be interesting to know if Nvidia is dirty tracking. That shouldn't be too hard to do
13:58gfxstrand[d]: If we version the root table somehow and track the last QMD we should be able to avoid re-emitting.
13:58gfxstrand[d]: And maybe for some ML workloads that's worth doing.
13:58karolherbst[d]: heh
13:58karolherbst[d]: there are... other benefits of doing that
13:59gfxstrand[d]: ?
13:59karolherbst[d]: like if it helps for ML workloads, business might get more interested in NVK, which might also mean more funding for development, etc... kinda sucks, but it might also help with getting more people to work on it as a consequence
13:59gfxstrand[d]: In most games, every CS dispatch is a different shader so it's kinda meh. But for really compute heavy stuff I can see it.
14:01karolherbst[d]: right
14:50gfxstrand[d]: Okay, kernel patch reviewed. Hopefully those will be hitting -fixes soon
15:00gfxstrand[d]: I think I need to write a crucible test to see if this subc stuff mohamexiety[d] found is real
15:01gfxstrand[d]: Shouldn't be too hard, I don't think.
15:03mohamexiety[d]: it would be nice to know how much it affects
15:03mohamexiety[d]: cuz like the compute MME MR on a 5090 produced a 10x improvement so there's clearly some sort of WFI or such being done
15:04mohamexiety[d]: granted I am not sure if maybe it's a nouveau thing that is being done on blackwell when it doesnt need to be
15:07gfxstrand[d]: Yeah
15:07gfxstrand[d]: It's possible there's some register we need to whack to actually get rid of them
15:07marysaka[d]: I *think* GPFIFO subchannels are handled by one of the falcon firmware
15:07gfxstrand[d]: Actually, if there's a bit I bet it's in openrm
15:07marysaka[d]: so yeah maybe we should dump the priv registers write done on Blackwell
15:08marysaka[d]: yeah that or FALCON_04 call
15:08gfxstrand[d]: Yeah
15:08marysaka[d]: or actually not done and in the golden context setup but idk if we have those with GSP now 🙃
15:08gfxstrand[d]: This does feel like the kind of thing you'd put a chicken bit around
15:10ermine1716[d]: Falcon != GSP?
15:22karolherbst[d]: okay... qmd hacking time
15:36cubanismo[d]: ermine1716[d]: correct
15:42karolherbst[d]: okay..
15:42karolherbst[d]: the root desc is indeed the same 🙃
15:43karolherbst[d]: mhhhh
15:44karolherbst[d]: mhhh
16:47karolherbst[d]: okay.. I think I need to dump nvidia now
16:52pac85[d]: karolherbst[d]: finally buying amd?
16:52karolherbst[d]: yes
16:54asdqueerfromeu[d]: karolherbst[d]: Did you listen to the Linus Torvalds statement on NVIDIA too much?
16:54karolherbst[d]: he appeared in my dreams and told me to fuck of for using NVIDIA GPUs
16:57pac85[d]: That's kinda rude
17:07karolherbst[d]: said the same. He took away my tux and ferris plushies and jumped out of the window flying towards the sunset while his cape with a big tux on it fluttering in the wind
17:12gfxstrand[d]: Wait, Torvalds took your Ferris plushies with him
17:13karolherbst[d]: I was as baffled as you are
17:24karolherbst[d]: marysaka[d]: looked into the nvidia driver 575 or so?
17:24karolherbst[d]: though maybe I just downgrade.. not sure if that's trivials tho
17:24pac85[d]: karolherbst[d]: This is terrible, you can't possibly go on without the ferris plushies.
17:25pac85[d]: OK enough shitposting
17:25karolherbst[d]: marysaka[d]: do I just need to update the external subproject to try if a newer version works or is there more to do?
17:26marysaka[d]: is it not working? :aki_thonk:
17:26marysaka[d]: but yeah update submodule and rebuild
17:26karolherbst[d]: I haven't tried
17:26marysaka[d]: I also rewrote most of the tracking soo
17:26marysaka[d]: is 575 on rpmfusion?
17:26marysaka[d]: I could do the update if you want :aki_thonk:
17:27karolherbst[d]: I'll be busy for the next ~30 minutes with other things 😄
17:27karolherbst[d]: yeah, if you have nothing better to do then sure go ahead, otherwise I'll look into it later
18:26karolherbst[d]: yeah.. so it fails to compile against 575
18:26karolherbst[d]: "`&NVOS32_PARAMETERS__bindgen_ty_1__bindgen_ty_8`, found `&NVOS32_PARAMETERS__bindgen_ty_1__bindgen_ty_7`" ah yes
18:37karolherbst[d]: `NVOS32_FUNCTION_DUMP` mhh got removed?
18:38karolherbst[d]: okay.. but does it also work 😄
18:40karolherbst[d]: nvidia is having a day again...
18:40karolherbst[d]: fails to load the gsp 🥲
18:42karolherbst[d]: 47 push bufs... impressive
18:45karolherbst[d]: uhhh...
18:45karolherbst[d]: something isn't right
18:46karolherbst[d]: [0x00000002] HDR 2001008d subch 0 NINC
18:46karolherbst[d]: mthd 0234 unknown method
18:46karolherbst[d]: .VALUE = 0x1
18:46karolherbst[d]: why...
18:47karolherbst[d]: something is broken there
18:49karolherbst[d]: ohh.. mhh
18:50karolherbst[d]: my 3D subchan is 0 in the dumper...
18:51karolherbst[d]: ahhhhhh
18:52karolherbst[d]: `value` is 0 here: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/headers/nv_push.c?ref_type=heads#L113
18:53karolherbst[d]: yeah.. Faith broke it with 8e93a763a36fd8999aedd5cc899877f3da4b86d9
18:56karolherbst[d]: mhh subchan 6...
18:58karolherbst[d]: they use it quite a lot
19:00karolherbst[d]: so nvidia uses NVC7C0_LAUNCH_DMA 😄
19:00karolherbst[d]: I just wished what subchan 6 is
19:01karolherbst[d]: they use `NVC56F_SEM_ADDR_LO` on it
19:01karolherbst[d]: and the following
19:02karolherbst[d]: ahh it's new
19:03karolherbst[d]: c36f nice
19:04kwizart: marysaka[d]> is 575 on rpmfusion? FYI we have 580xx on rpmfusion-nonfree-updates-testing but on hold since it breaks some users...
19:07marysaka[d]: I see
19:07marysaka[d]: karolherbst[d]: if you do a MR for the update let me know :P
19:07karolherbst[d]: mhhhh
19:08karolherbst[d]: I wonder if subchan 6 is host only?
19:08karolherbst[d]: nvidia does all the semaphore stuff on it
19:08marysaka[d]: isn't it just gpfifo channel
19:08karolherbst[d]: yeah well.. it might be, point is.. we don't decode it properly 😄
19:09marysaka[d]: we do I should have fixed that last week :nya_confused:
19:09karolherbst[d]: maybe faith also broke that..
19:09marysaka[d]: ... or not
19:09marysaka[d]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/headers/nv_push.c#L85
19:09marysaka[d]: it's missing there
19:09karolherbst[d]: yeah..
19:10karolherbst[d]: okay.. the hell is nvidia doing
19:10karolherbst[d]: bunch of `[0x0000004f] HDR 0 subch 0 NINC` thingies.. 🙃
19:10karolherbst[d]: `3b18 NVC697_CALL_MME_MACRO(99)` might be interesting
19:11gfxstrand[d]: karolherbst[d]: Oops
19:12karolherbst[d]: yeah.. SET_OBJECT handling sees a value of 0 🙂
19:12gfxstrand[d]: Oops
19:12karolherbst[d]: okay.. what's macro 99..
19:14karolherbst[d]: `DWRITE $load0 $load1` mhh
19:14karolherbst[d]: I still haven't found where they launch the compute jobs...
19:14karolherbst[d]: but where they do update QMDs
19:15karolherbst[d]: but...
19:15karolherbst[d]: https://gist.githubusercontent.com/karolherbst/0272cd8da462c66cb59e0bc1aaf5376d/raw/c08a86ac4e3a334cf3d77b9b3e2e57ca03057116/gistfile1.txt
19:15karolherbst[d]: ?!?
19:15karolherbst[d]: and a bunch of those
19:19karolherbst[d]: `NVC7C0_SET_TRAP_HANDLER_A` wait a second...
19:19karolherbst[d]: they install a trap handler? funky
19:19karolherbst[d]: ~~I should dump the shader~~
19:22marysaka[d]: karolherbst[d]: it's a macro
19:23marysaka[d]: can't remember the number
19:23karolherbst[d]: ehh I meant launch the compute jobs
19:23karolherbst[d]: they do use inline qmd directly
19:23marysaka[d]: it does inline upload + 4run
19:23marysaka[d]: like that handle both
19:23marysaka[d]: in one thing
19:24marysaka[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36093/diffs?commit_id=ae6bd8f9127ac442a037aff5a3ebc2ee580475e5&diff_id=5056100#4932fd0770f3d068158d5321972c486cb45e854b_558_552
19:25karolherbst[d]: right
19:25karolherbst[d]: ohh
19:26karolherbst[d]: they execute the compute job after the send?
19:26karolherbst[d]: mhhhhhhhh
19:26marysaka[d]: yup
19:26karolherbst[d]: mhhhhhhhhhhhhhh
19:26marysaka[d]: it's automatic
19:26karolherbst[d]: that explains everything now
19:26karolherbst[d]: no wonder the hw got angy
19:26karolherbst[d]: okay thanks!
19:27karolherbst[d]: sadly, I have something urgent to do: buying ice cream 😄
19:27karolherbst[d]: oh fuck, it's late lol
19:27marysaka[d]: too late for ice cream karol :blobcatnotlikethis:
19:28karolherbst[d]: fake news
19:56karolherbst[d]: I have ice cream
20:44phomes_[d]: I have a list of games with issues that I am looking into and another list with fps measurements. Would it be okay if I add separate tabs for these in the google sheet with game reports?
20:48gfxstrand[d]: sure
20:48gfxstrand[d]: mohamexiety[d]: Left a pile of comments on the compression MR. It honestly wasn't as much of a mess as I'd feared. I think once you deal with those, we'll be very much on the right track for an initial version.
20:49gfxstrand[d]: (Not a mess because you're bad at writing code. A mess because compression is a PITA.)
20:49mohamexiety[d]: hm. so the approach of e.g. swapping PTEs is actually viable? :thonk:
20:49mohamexiety[d]: but yeah, will take a look in a bit. thanks a lot! ❤️
20:50gfxstrand[d]: Well, no... But I left comments with what to do.
20:51gfxstrand[d]: I suppose this means I should review ZCULL soon
20:51gfxstrand[d]: Not today, though. It's 5 pm on a friday
20:53mohamexiety[d]: oh woops, didnt see the last comment. sorry
20:54gfxstrand[d]: I'm starting to feel really good about where we're headed perf wise. We've got a ways to go to close the gap but we're making progress.
20:56mohamexiety[d]: yeahh
20:57mohamexiety[d]: btw pre turing is a complete no go for compression
20:57mohamexiety[d]: it's not even wired up in the kernel and from what I have been told it's very annoying
20:57gfxstrand[d]: yup
20:58gfxstrand[d]: We should very explicitly not even try pre-Turing
21:30cubanismo[d]: pre-turing compression = why VK_NV_dedicated_allocation exists. Dragons.
21:34mohamexiety[d]: yeah I heard it was really bad but didn't know it was this bad :nervous:
21:35mohamexiety[d]: also going back to older messages, apparently the GPU blocked configuration of the compression tags since gp100 (presumably firmware controlled?) so I am not sure if it could be done on nouveau even if we wanted to for pascal/volta
21:46gfxstrand[d]: cubanismo[d]: Our first pass at compression is going to rely on dedicated allocations.
21:46gfxstrand[d]: But hopefully we'll be able to relax that over time
21:53karolherbst[d]: mhhh interesting
21:53karolherbst[d]: so nvidia does upload the cb0 via LAUNCH_DMA
21:54karolherbst[d]: now I want a QMD decoder 🙃
21:56karolherbst[d]: mhh what's `NVC7C0_QMDV03_00_SM_GLOBAL_CACHING_ENABLE`
21:58mohamexiety[d]: gfxstrand[d]: oh yeah actually, I guess we can thank the cursed HW for giving us that extension then :KEKW:
22:00gfxstrand[d]: From what my spies tell me, they still use it for a few things, at least pre-Blackwell. Maybe Z/S?
22:00gfxstrand[d]: I kinda wonder if we can just turn on memory compression for everything on Blackwell. Like, just set it on the BOs and be careful about alignments and let the hardware go to town.
22:01mohamexiety[d]: wait what makes blackwell special here
22:01gfxstrand[d]: Z/S use generic on Blackwell as well
22:01mohamexiety[d]: like the issue with dedicated allocs is we cant have compression in sysmem
22:01gfxstrand[d]: With a smart enough kernel, I think you can do compression without pinning
22:02mohamexiety[d]: yeah I'd trust the NV driver to be like that
22:02gfxstrand[d]: But nouveau isn't smart enough for that and we had enough of a headache getting dedicated to work.
22:03mohamexiety[d]: mohamexiety[d]: but yeah for blackwell even if we could in theory do it I doubt we actually can due to this
22:06mohamexiety[d]: I wish I could have helped make the kernel smarter in this regard tbh cuz this is just sad but _I_ am not smart enough for that sadly
22:06karolherbst[d]: huh....
22:08karolherbst[d]: Nvidia launches a 0x20000 by 0x1 CTA, where we do 0x20 by 0x20...
22:09karolherbst[d]: I wonder if they merge the jobs...
22:09mohamexiety[d]: does this matter actually?
22:09karolherbst[d]: who knows
22:10karolherbst[d]: I mean..
22:10karolherbst[d]: mhh
22:11karolherbst[d]: maybe it's also because it's a different test actually.. I always forget that the benchmark behaves differently 🙃
22:12karolherbst[d]: I kinda need to hack the benchmark to launch a specific one
22:12karolherbst[d]: nvidia sets the OCCUPANCY values tho...
22:13mhenning[d]: yeah, I've been meaning to figure out how OCCUPANCY works
22:14karolherbst[d]: well...
22:29karolherbst[d]: soooo...
22:29karolherbst[d]: what nvidia is doing is to set up a semaphore in LAUNCH_DMA and then the QMD waits on completion on it
22:29karolherbst[d]: before launching
22:30karolherbst[d]: not sure if that's terribly relevant for performance in either case here
22:34gfxstrand[d]: I do suspect using compute semaphores is gonna end up being important long-term. Same with DMA. But I haven't come up with a good strategy for it yet.
22:34karolherbst[d]: could have some caching benefits dunno
22:35karolherbst[d]: it's just cb0 they LAUNCH_DMA
22:35gfxstrand[d]: <a:shrug_anim:1096500513106841673>
22:35karolherbst[d]: but the inline QMD stuff works at least
22:36karolherbst[d]: doesn't really give a perf improvement tho 😄
22:36karolherbst[d]: maybe there are some cache invalidations I missed to remove
22:37karolherbst[d]: but the hw gets super angry if you even try to reuse a QMD across dispatches, I suspect I messed up when I got the good perf numbers somehow
22:38karolherbst[d]: but...
22:38karolherbst[d]: there is something funny I figured out
22:38karolherbst[d]: I can reuse the old inline QMD across dispatches even if I use a new QMD buffer without uploading any contents to it..
22:38karolherbst[d]: not sure if that's just "it's still in the cache" or not situation here
22:39karolherbst[d]: but doing a `INVALIDATE_SKED_CACHES` still keeps it working
22:39karolherbst[d]: ehh no, I jsut messed up my code, nvm
23:02cubanismo[d]: gfxstrand[d]: airlied[d] That sysmem bar for fence completion thing was really bothering me, so I followed up with architecture: It is indeed required, matching your experimental findings. There's no implied flush here. I need to ask some clarifying questions, but it sounds like on Hopper+, the WFIdoesn't actually prior writes land by itself either.
23:03cubanismo[d]: Damn, hit enter on accident there. Should say: The WFI's implicit sysmembar doesn't actually imply prior writes land by itself either.
23:04cubanismo[d]: I think a very careful reading of the manuals would reveal this, but I couldn't parse it out without talking to arch first.
23:05gfxstrand[d]: cubanismo[d]: Thanks for following up! Glad to know I'm not going crazy! 😅
23:08gfxstrand[d]: cubanismo[d]: Not sure exactly what you mean there. The "after all previous writes" has to mean something. Does it just mean they've all landed relative to the GPU's view of memory but not necessarily made their way all the way to system RAM? Or is this just in the case where you do a release without setting that flag?
23:10cubanismo[d]: If you have some SMs doing a bunch of memory writes, don't do any additional coherency work, then send a XX6f semaphore release op, with WFI_EN set, those writes aren't necessarily visible to the rest of the GPU when before that semaphore write lands.
23:11cubanismo[d]: "those writes" being the SM writes.
23:13cubanismo[d]: So even though WFI_EN idles the downstream engines (For some complex definition of idle), and inserts an implicit sysmembar that makes any prior memory writes that have made it to the point of coherence visible to everything in and external to the GPU, the downstream engine's writes haven't necessarily made it to the point of coherence that the sysmembar applies to, so they aren't necessarily
23:13cubanismo[d]: included in the sysmembar.
23:14cubanismo[d]: If you read the entire PBDMA manual with that in mind, you can get all the details. It's just spread around.
23:23gfxstrand[d]: Okay but if I've done the appropriate flush in the shader or if I do a SHADER_CACHE_FLUSH, then I'm good, right?
23:25cubanismo[d]: I haven't read all those docs yet.
23:25cubanismo[d]: But I would assume anything that says it guarantees reads/writes have made it to the global point of coherency is sufficient.
23:26gfxstrand[d]: I don't know what anything guarantees. I don't have docs. 😂
23:26cubanismo[d]: I assumed, because I don't see anything called SHADER_CACHE_FLUSH
23:54cubanismo[d]: This is too far outside my expertise. I think the thing that's intended to flush everything is a backend semaphore release with FLUSH_DISABLE_FALSE/FLUSH_ENABLE_TRUE.
23:54cubanismo[d]: I can't speak to the other options.
23:55cubanismo[d]: But I didn't know a host WFI by itself wasn't enough before today, so I have a lot more reading to do I think.