07:56notthatclippy[d]: karolherbst[d]: While the code does live on the GSP as well, I can't see a good way for nouveau to call it through an API. Much better for you to just program the CE channel directly to do this. Here's the magic pushbuffer sequence: <https://github.com/NVIDIA/open-gpu-kernel-modules/blob/555/src/nvidia/src/kernel/gpu/mem_mgr/channel_utils.c#L594-L696>
07:56notthatclippy[d]: You'll want to use the pipelined variant, which has minimal impact on other running CE work. You'll then get an IRQ for the ->payload semaphore when it's done and you can put the pages back into the free list.
07:57notthatclippy[d]: IIRC Mesa is only using a couple of CE engines overall, so it would probably be zero noticeable overhead if you reserve a full CE just for driver use like this.
07:59notthatclippy[d]: Also IIRC, HW special-cases `pattern=0` so that will be faster than a poison pattern, even if its tempting to do.
08:07pac85[d]: I have a question about nvidia hw. So if workgroup shared memory is actually L1 memory and the driver needs to reconfigure things when running a CS to make that work I wonder how it works for async compute?
08:08pixelcluster[d]: IIRC it doesn't work like amd where stuff from two different queues can run at the same time
08:09pac85[d]: Ah
08:15notthatclippy[d]: The driver shouldn't be involved in any of the context switches. So while things may not run literally _in parallel_, the HW should be able to do the switching really fast between compute and graphics just as well.
08:16notthatclippy[d]: With the exception being Maxwell (1 and 2) where some of the relevant HW was removed for power efficiency reasons. But this was just as async compute started becoming a big deal so it came back quickly.
08:16notthatclippy[d]: At least IIRC, it's been almost 10 years.
08:17karolherbst[d]: yeah... the issue is mostly that ttm is responsible for most of those things and it's kinda messy to touch
08:17karolherbst[d]: not sure if we could easily add a scrubber there
08:19notthatclippy[d]: Could also add one as a userspace service :D
08:20notthatclippy[d]: notthatclippy[d]: Feels like something one of those "enhanced security on linux" consultancies would add as an LD_PRELOAD-able thingie
08:22pac85[d]: notthatclippy[d]: Ah interesting, though in the async compute case one of the things I would expect to happen is a single CU (or I think nvidia calls it cuda core) keeping track of at least one compute and one gfx context at a time and being able to switch between them to hide latency. Now in this case LDS should be preserved when switching back and forth so even switching fast enough isn't enough
08:22pac85[d]: right? Unless gfx can also run while part of the L1 cache is not available
08:23karolherbst[d]: the hardware is a bit more dynamic than that
08:24karolherbst[d]: not the entire thing needs to run at the same time
08:24notthatclippy[d]: I think I'm gonna have to stop commenting here because I just don't know the full answer, and if you're gonna get a wrong answer I'd rather it comes from someone else :P
08:24karolherbst[d]: heh
08:24pac85[d]: notthatclippy[d]: Fair no worries
08:24karolherbst[d]: but anyway, you can have warps running from different jobs at the same time, afaik
08:25karolherbst[d]: so if one exits or yields, the scheduler just picks a new one regardless of the "job" having completed or not
08:26karolherbst[d]: the issue is mostly that warps reserve GPRs and there aren't enough GPRs to cover all needs, so the scheduler needs to be smarter anyway
08:27pac85[d]: karolherbst[d]: So I've come across this thing where nvidia hw can only overlap compute and gfx in the same CU if they come from different queues
08:27karolherbst[d]: and I do wonder how much the driver has to configure there
08:27pac85[d]: And was wondering whether the L1 thing had anything to do with it
08:27karolherbst[d]: pac85[d]: L1 are private to the warp, so why would that matter?
08:28karolherbst[d]: or rather
08:28karolherbst[d]: private to the SM they are running on
08:28pac85[d]: karolherbst[d]: When you day warp what do you mean?
08:28karolherbst[d]: subgroup
08:28pac85[d]: karolherbst[d]: Yeah so if you want a compute and gfx dispatch to be alive at the same time in the same SM how do you deal with the L1 partitioning?
08:29karolherbst[d]: each subgroup gets its own slice
08:29pac85[d]: I understand how it would work for compute and gfx in different SMs
08:29pac85[d]: Uhm
08:29pac85[d]: I see
08:29karolherbst[d]: though the context switcher would need to be able to also save it, so yeah..
08:29karolherbst[d]: I think
08:30pac85[d]: I'm somewhat confused
08:30karolherbst[d]: L1 on nvidia hardware is just weird
08:30pac85[d]: Like, each warp gets private L1 cache?
08:30karolherbst[d]: well...
08:30karolherbst[d]: the thing is, that L1 is weird
08:30pac85[d]: So reads and writes are incoherent I'm the same workgroup?
08:30karolherbst[d]: yeah, should be
08:30pac85[d]: Wow
08:31karolherbst[d]: but workgroup is something else
08:31karolherbst[d]: but yeah
08:31karolherbst[d]: mhh wait
08:31karolherbst[d]: ehh
08:31karolherbst[d]: shared memory is allocated for the entire workgroup, not warp
08:31karolherbst[d]: but yeah, I think it's still incoherent
08:32pac85[d]: Well yeah it has to be according to the spec
08:32pac85[d]: But when you say each warp gets a slice of L1, I though you meant that warps don't share the cache when accessing memory
08:33pac85[d]: I'm really confused now
08:33karolherbst[d]: same
08:33karolherbst[d]: I should read those things up again, but it's not really relevant to the fact that L1 is weird
08:33karolherbst[d]: it's partitioned between "internal use" (e.g. shared memory) and "data cache" depending on the work load
08:34karolherbst[d]: and you need to tell the hardware when scheduling jobs how much is shared memory
08:34karolherbst[d]: 3D does that internally, because L1 is used for various things there
08:34pac85[d]: Uhm
08:34redsheep[d]: Fwiw the L2 is weird too... Just a different kind of weird. Nvidia hardware is just weird.
08:35karolherbst[d]: L2 is less weird
08:35karolherbst[d]: it's pretty straightforward actually
08:35karolherbst[d]: at least it's not used for random stuff
08:35karolherbst[d]: anyway
08:35karolherbst[d]: L1 is integral part of the pipeline, so it needs to be saved/restored on context switches
08:36redsheep[d]: Tiled cache exists which is pretty weird. But yeah less weird.
08:37karolherbst[d]: but anyway
08:37karolherbst[d]: the hardware doesn't switch between "jobs"
08:37karolherbst[d]: afaik
08:37karolherbst[d]: it just schedules warps from jobs and they run pretty independently
08:45pac85[d]: karolherbst[d]: Where is it saved to?
08:47karolherbst[d]: VRAM mostly
08:47karolherbst[d]: I think...
08:51karolherbst[d]: though most of the time it doesn't really matter
09:20marysaka[d]: karolherbst[d]: I feel that's what `SCG_GRAPHICS_SCHEDULING_PARAMETERS` is supposed to be for
09:21marysaka[d]: (and the same one for compute exist)
09:50phodius[d]: hi, does wlroots based compositors work with NVK zink? im on latest 24.2-dev and i get an error on wayfire
09:51phodius[d]: https://pastebin.com/87P1rfb4
13:04zmike[d]: it should work
13:07zmike[d]: if you're able to debug, I'd suggest checking out `zink_get_display_device()` and see why it isn't finding your device
13:21ericonr: phodius[d]: fun, I was testing the same thing recently, and didn't manage to launch it either. I will check if the crash I got was the same. (My GPU is a 1050 Ti though, so I didn't want to pursue it before it's actually supported)
13:25phodius[d]: i think its a bug, i think next release there have it nailed, i am using the development build after all
13:47karolherbst[d]: marysaka[d]: there are two parts to this, interleaving work of the same engine, and across engines (3D and compute), and those are for the 3D + compute thing
13:56marysaka[d]: hmm I see
15:06karolherbst[d]: skeggsb9778[d]: (or anybody else) any idea what `fault 01 [VIRT_WRITE] at 0000000006b84000 engine 40 [gr] client 71 [GPC4/] reason 0a [UNSUPPORTED_APERTURE] on channel 1 [01ffc8f000 sway[768]]` that means? The `UNSUPPORTED_APERTURE` bit specifically
15:06karolherbst[d]: it's in regards to https://gitlab.freedesktop.org/drm/nouveau/-/issues/374
15:16notthatclippy[d]: First guess, something programmed CE to access sysmem, but sysmem access is disabled for it.
15:19asdqueerfromeu[d]: notthatclippy[d]: So it isn't camera-related?
15:20notthatclippy[d]: Could easily be, I'm just basing this on the print.
15:21notthatclippy[d]: But, looking a bit closer, you'll also get this fault if you describe system memory with invalid PTEs, such as you make the GPU think sysmem is compressed.
15:26notthatclippy[d]: Yep, looks like we had exactly that bug on Ampere, same as above, and the fix was this <https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/mem_mgr/virtual_mem.c#L1324-L1341>