18:34fdobridge: <airlied> @gfxstrand @karolherbst🐧 so with GSP enabled, the context gets killed when we submit the initial context draw state pushbuf at least on turing
18:34fdobridge: <karolherbst🐧🦀> what does it complain about?
18:34fdobridge: <airlied> GSP says you are dead
18:35fdobridge: <airlied> I know nothing more
18:35fdobridge: <karolherbst🐧🦀> well.. then I can't really help here either
18:35fdobridge: <airlied> the wonders of firmware 😛
18:35fdobridge: <karolherbst🐧🦀> though
18:35fdobridge: <karolherbst🐧🦀> I suspect it might be related to the channel binding
18:35fdobridge: <karolherbst🐧🦀> but...
18:35fdobridge: <airlied> GL works fine
18:35fdobridge: <karolherbst🐧🦀> in any case
18:35fdobridge: <karolherbst🐧🦀> we need errors
18:35fdobridge: <karolherbst🐧🦀> if we don't get errors, we can't use GSP
18:36fdobridge: <karolherbst🐧🦀> @airlied you could dump the buffer and see if there is anything obviously wrong
18:36fdobridge: <karolherbst🐧🦀> but...
18:37fdobridge: <karolherbst🐧🦀> I don't like guessing and I don't think we are doing anything crazy in nvk here
18:37fdobridge: <airlied> I assume at some point we can process better errors, I'd hope so
18:37fdobridge: <karolherbst🐧🦀> we should make that a requirement
18:37fdobridge: <karolherbst🐧🦀> I'm mostly done guessing what nvidia's firmware is up to
18:54fdobridge: <airlied> appears to be sw.cls init
19:04fdobridge: <airlied> ah nvc0 has if (dev->chipset < 0x140) {
19:12fdobridge: <airlied> okay 189 works around it for now for me
19:21fdobridge: <gfxstrand> That's believable
19:23fdobridge: <airlied> back to having broken sync now, which is what I wanted to dig into 😛
19:25fdobridge: <karolherbst🐧🦀> ahh yeah.. that would do it
19:25fdobridge: <karolherbst🐧🦀> ohh.. we should ask nvidia if they can give us doc on all those fancy sw methods
19:44fdobridge: <gfxstrand> Yeah, that'd be nice if they have SW methods
19:44fdobridge: <gfxstrand> Particularly, we need to make sure that we don't break the helper pixels fix
19:44fdobridge: <airlied> @gfxstrand do you see any PBDMA errors in your dmesg at all?
19:45fdobridge: <airlied> I've seen those and I think it's caused by that fix, so we might need to reconsider how it works anyways, I've no idea what GSP does here
19:45fdobridge: <gfxstrand> Yup, piles
19:45fdobridge: <airlied> yeah try commenting out the sw.cls and see do they go away
19:46fdobridge: <karolherbst🐧🦀> you remember this magic thing nvidia did where they allow userspace to set certain registers?
19:46fdobridge: <karolherbst🐧🦀> via macros and stuff
19:46fdobridge: <karolherbst🐧🦀> I suspect we need to do the same thing to be compatible to GSP
19:46fdobridge: <gfxstrand> Probably
19:46fdobridge: <gfxstrand> We just need to know how that magic works
19:46fdobridge: <karolherbst🐧🦀> and nvidia probably has to set up a buffer with bit masks of those regs
19:46fdobridge: <karolherbst🐧🦀> and nouveau probably has to set up a buffer with bit masks of those regs (edited)
19:46fdobridge: <karolherbst🐧🦀> I checked how they did it in nvgpu
19:47fdobridge: <karolherbst🐧🦀> I should check how they do it in their open driver
19:47fdobridge: <gfxstrand> Yeah
19:47fdobridge: <karolherbst🐧🦀> anyway, if anybody throws me a branch with all the GSP stuff I can look into it, as I was planning to upstream those bits anyway
19:47fdobridge: <gfxstrand> The other option is if we can just have nouveau.ko do some stuff to the context at context creation.
19:47fdobridge: <airlied> I don't think you need GSP though, I think we already generate PBDMA errors
19:47fdobridge: <airlied> we just don't blow away the context as aggressively
19:48fdobridge: <karolherbst🐧🦀> that's not the problem
19:48fdobridge: <karolherbst🐧🦀> the core issue is, there are registers we have to mess with from userspace
19:48fdobridge: <airlied> it's not like GSP is seeing the pushbuf here, it's just reporting the hw error
19:48fdobridge: <karolherbst🐧🦀> right
19:48fdobridge: <karolherbst🐧🦀> but that will cause regressions in nvk if we don't do it
19:49fdobridge: <karolherbst🐧🦀> so we have to figure out how to do the same thing with GSP
19:49fdobridge: <airlied> yes well we should also figure out how to do it without GSP
19:49fdobridge: <karolherbst🐧🦀> there is a magic context switch mmio register with a bit which enables/disables memory load in helper invcations
19:49fdobridge: <karolherbst🐧🦀> it's disabled by default
19:49fdobridge: <karolherbst🐧🦀> we have to enable it
19:49fdobridge: <karolherbst🐧🦀> we already do it without GSP 🙂
19:49fdobridge: <karolherbst🐧🦀> but before we upstream it, we should see how it works with GSP so it looks the same
19:49fdobridge: <airlied> no we don't do it
19:50fdobridge: <airlied> we do it, but it seems to blow up in places
19:50fdobridge: <karolherbst🐧🦀> .....
19:50fdobridge: <airlied> hence all those PBDMA errors
19:50fdobridge: <karolherbst🐧🦀> please understand what I'm writing
19:51fdobridge: <karolherbst🐧🦀> I have a patch to do it via the sw stuff
19:51fdobridge: <karolherbst🐧🦀> that isn't upstream yet
19:51fdobridge: <karolherbst🐧🦀> before upstreaming it, we should see what GSP is doing
19:51fdobridge: <karolherbst🐧🦀> and implement it the same way prior GSP
19:52fdobridge: <airlied> I've no idea though how to work out what the GSP interface is for it
19:52fdobridge: <airlied> did nvidia ever drop any useful hints on how their userspace programs the workaround?
19:53fdobridge: <karolherbst🐧🦀> via macros
19:53fdobridge: <karolherbst🐧🦀> they interrupt the firmware
19:54fdobridge: <gfxstrand> I had a dump of all their macros at one point
19:54fdobridge: <karolherbst🐧🦀> or rather.. use a doorbell or something. Anyway, nvgpu is setting up a buffer
19:54fdobridge: <karolherbst🐧🦀> and each bit represents a mmio reg
19:54fdobridge: <karolherbst🐧🦀> and the buffer decides what userspace can mess with
19:54fdobridge: <karolherbst🐧🦀> and when bootstrapping the firmware, they pass that buffer along
19:54fdobridge: <karolherbst🐧🦀> and there are like 20? slots for doing random interactions wiht the firmware afaik
19:54fdobridge: <karolherbst🐧🦀> and there are like 20? slots for doing random interactions with the firmware afaik (edited)
21:35fdobridge: <airlied> @karolherbst🐧 any memories or ptrs where in nvgpu to look?
21:39fdobridge: <airlied> also how did we work out the sw class fix? I don't see that in my email
21:45fdobridge: <karolherbst🐧🦀> ehh.. let me see..
21:45fdobridge: <karolherbst🐧🦀> @airlied search for "Global memory loads in helper invocations"
21:47fdobridge: <karolherbst🐧🦀> the nvgpu stuff is gr_init_get_access_map
21:47fdobridge: <karolherbst🐧🦀> and stuff using that
21:48fdobridge: <karolherbst🐧🦀> or rather `get_access_map`
21:48fdobridge: <airlied> my copy of that thread ends before anyone mentions a sw method
21:48fdobridge: <karolherbst🐧🦀> ohh
21:48fdobridge: <karolherbst🐧🦀> sw method is a software thing
21:48fdobridge: <karolherbst🐧🦀> nouveau implements it
21:48fdobridge: <karolherbst🐧🦀> there is a patch somewhere..
21:48fdobridge: <karolherbst🐧🦀> sw methods are basically interrupts on the kernel, and then the kernel handles the method from the push buffer
21:49fdobridge: <karolherbst🐧🦀> which is nice, because the mmio access is switched to the correct context then
21:50fdobridge: <airlied> oh so we just wire 604 up somewhere on the kernel side and we do the register write there?
21:50fdobridge: <karolherbst🐧🦀> yeah
21:51fdobridge: <airlied> okay I'm failing to figure out the kernel side of that
21:51fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/drm/nouveau/-/commit/bfe2b42ca7de5793e4b3847ca13ef305465a9492
21:52fdobridge: <karolherbst🐧🦀> or rather
21:52fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/drm/nouveau/-/commits/topic/vulkan/
21:52fdobridge: <karolherbst🐧🦀> need all of it
21:52fdobridge: <karolherbst🐧🦀> well.. the two top commits
21:53fdobridge: <karolherbst🐧🦀> the method nvidia uses for this kind of stuff obviously doesn't involve kernel roundtrips, so that's why I'd like to figure it out and do the same thing instead
23:08fdobridge: <airlied> okay there seems to be some sort of priv access map you can attach to a context in the kernel, then it lets those register be programmed
23:10fdobridge: <karolherbst🐧🦀> yeah
23:10fdobridge: <karolherbst🐧🦀> I suspect GSP has the same thing
23:10fdobridge: <karolherbst🐧🦀> most likely even configured the exact same way
23:11fdobridge: <karolherbst🐧🦀> the annoying part will be to figure out if the gr firmware we got pre GSP even has it
23:11fdobridge: <airlied> yeah I'm trying to find the interfaces for it
23:11fdobridge: <karolherbst🐧🦀> worst case, we do SW pre GSP
23:12fdobridge: <airlied> NV0080_CTRL_FIFO_GET_ENGINE_CONTEXT_PROPERTIES_ENGINE_ID_GRAPHICS_PRIV_ACCESS_MAP is one
23:12fdobridge: <karolherbst🐧🦀> sounds like it
23:12fdobridge: <airlied> NV2080_CTRL_GPU_PROMOTE_CTX_BUFFER_ID_PRIV_ACCESS_MAP seems to be the other side of it
23:13fdobridge: <karolherbst🐧🦀> actually
23:13fdobridge: <karolherbst🐧🦀> let's do the pre GSP stuff via SW
23:14fdobridge: <karolherbst🐧🦀> we'll need to use it for perf counters anyway
23:14fdobridge: <karolherbst🐧🦀> @airlied is there a property in the new nouveau UAPI to tell if we are on GSP or not?
23:14fdobridge: <karolherbst🐧🦀> I suspect we'll want to have it for stuff like this
23:14fdobridge: <karolherbst🐧🦀> could be part of the device info stuff tho
23:16fdobridge: <airlied> so far the new uapi is only va/sparse stuff, any other new uapi should be separate
23:17fdobridge: <airlied> no point needlessly tying things together
23:17fdobridge: <airlied> if we need a GSP property it should go with the GSP patches
23:18fdobridge: <airlied> though the SW method availability should possibly be it's own flag somewhere
23:18fdobridge: <airlied> I forsee more uAPI changes for GSP, but very separate to the uAPI changes for vma
23:20fdobridge: <airlied> it does seem like skeggsb code for gsp gr setup does a bit of this already so it might be easy to add there
23:55fdobridge: <🌺 ¿butterflies? 🌸> What are the odds of getting PMU fw from NV.....