00:31 fdobridge: <g​fxstrand> `60/60 sessions passed, conformance test PASSED`
00:31 fdobridge: <g​fxstrand> That's Turing
00:31 fdobridge: <g​fxstrand> I have my test box back!!!
00:32 fdobridge: <r​edsheep> Which architectures have you gotten passing now?
00:32 fdobridge: <g​fxstrand> Turing and Ampere
00:32 fdobridge: <r​edsheep> I assume you'll do ada and then be done?
00:32 fdobridge: <g​fxstrand> Ada keeps hitting random timeouts
00:32 fdobridge: <g​fxstrand> Probably because it's a laptop and power management is always more funky there
00:33 fdobridge: <g​fxstrand> I'm running `nouveau.runpm=0` but still
00:33 fdobridge: <r​edsheep> Do you have any ada desktop card?
00:34 fdobridge: <r​edsheep> If I can replicate your test conditions maybe I could run the conformance for you if you don't have a card at all
00:35 fdobridge: <g​fxstrand> No, I don't have a desktop ada
04:17 fdobridge: <g​fxstrand> bl4ckb0ne, @karolherbst https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28794
10:56 fdobridge: <a​huillet> does Nouveau try to enter RTD3? if so, there's potentially some fixes to do, as @marysaka discovered the hard way
11:09 fdobridge: <k​arolherbst🐧🦀> yeah... we need to copy&paste all those workarounds the nvidia's driver is doing
11:16 fdobridge: <m​ohamexiety> what's RTD3?
11:16 fdobridge: <k​arolherbst🐧🦀> d3cold
11:17 fdobridge: <m​ohamexiety> oh
12:07 fdobridge: <m​tijanic> I don't know if nouveau realy needs all the hacks that RM is doing. A lot of that was necessitated by UMDs that poke the GPU directly in the proprietary stack, which all goes through the kernel on nouveau IIUC.
12:07 fdobridge: <m​tijanic> eg I think the USERD thing that Mary hit doesn't apply at all to the mesa/nouveau stack
12:09 fdobridge: <k​arolherbst🐧🦀> yeah, but it's more about chipset specific workarounds
12:09 fdobridge: <m​tijanic> Oh, those are _fun_.
12:10 fdobridge: <k​arolherbst🐧🦀> I've added one of those: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nouveau_drm.c?h=v6.8.7#n736
12:10 fdobridge: <k​arolherbst🐧🦀> but uhhh..
12:10 fdobridge: <k​arolherbst🐧🦀> no idea
12:10 fdobridge: <k​arolherbst🐧🦀> 😄
12:11 fdobridge: <m​tijanic> In a weird chain of battlefield promotions I ended up owning all those workarounds in RM a while back. I have exactly the same info as you do with OpenRM for why they were needed.
12:11 fdobridge: <k​arolherbst🐧🦀> ohh, that would be super useful information
12:11 fdobridge: <k​arolherbst🐧🦀> I tried to ask for them multiple times
12:11 fdobridge: <m​tijanic> But if there's a specific thing, I can do some commit history spelunking or try to track down the original authors, etc.
12:12 fdobridge: <k​arolherbst🐧🦀> I think it would be easier if we'd have a "this system causes issues, what workarounds do we need to apply" kinda thing
12:12 fdobridge: <k​arolherbst🐧🦀> or to better understand on what's going on with the existing one I've added
12:13 fdobridge: <m​tijanic> This is the workarounds file we have: <https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/platform/chipset/chipset_info.c>
12:13 fdobridge: <k​arolherbst🐧🦀> `{PCI_VENDOR_ID_INTEL, DEVICE_ID_INTEL_1901_ROOT_PORT, Intel_Skylake_setupFunc},` :ferrisUpsideDown:
12:13 fdobridge: <k​arolherbst🐧🦀> that's the root port we deal with atm
12:14 fdobridge: <k​arolherbst🐧🦀> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/ea4c27fad63a607e663732842221ef619156ec24/src/nvidia/src/kernel/platform/chipset/chipset_pcie.c#L3317
12:14 fdobridge: <k​arolherbst🐧🦀> ...
12:14 fdobridge: <m​tijanic> Okay, so the one thing to note it is that a lot of these things need to be sent down to GSP on boot time. Best _exactly_ as OpenRM is doing it.
12:15 fdobridge: <k​arolherbst🐧🦀> mhh I see
12:16 fdobridge: <k​arolherbst🐧🦀> the frustrating part about all of this is, that laptop vendors didn't know anything and instead of fixing nouveau, they added workarounds in their firmware which then canonical pushed for. And intel doesn't know about anything either 🤷
12:16 fdobridge: <k​arolherbst🐧🦀> and nvidia was being nvidia
12:16 fdobridge: <k​arolherbst🐧🦀> but luckily it's open source now...
12:16 fdobridge: <m​tijanic> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/include/nvrm/535.113.01/nvidia/inc/kernel/gpu/gsp/gsp_static_config.h#n156 this field is actually just a bunch of bits...
12:17 fdobridge: <m​tijanic> ```c
12:17 fdobridge: <m​tijanic> pCl->setProperty(pCl, PDB_PROP_CL_IS_CHIPSET_IN_ASPM_POR_LIST, NV_TRUE);
12:17 fdobridge: <m​tijanic> pCl->setProperty(pCl, PDB_PROP_CL_ROOTPORT_NEEDS_NOSNOOP_WAR, NV_TRUE);
12:17 fdobridge: <m​tijanic> ``` this sets the bit
12:17 fdobridge: <k​arolherbst🐧🦀> it took us around 3 years to come up with a reliable working workaround, because....
12:17 fdobridge: <k​arolherbst🐧🦀> this shit is a mess
12:17 fdobridge: <k​arolherbst🐧🦀> especially if nobody tells you anything
12:17 fdobridge: <m​tijanic> <https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/generated/g_chipset_nvoc.h#L263-L299> this is the order of the bits in that field.
12:17 fdobridge: <k​arolherbst🐧🦀> the best reply I got was from intel "nah, our hardware is fine"
12:18 fdobridge: <m​tijanic> I'm linking to `/main/` but you want the 535.113.01 version of that file
12:18 fdobridge: <k​arolherbst🐧🦀> I kinda wished we could just copy&paste all those workarounds
12:19 fdobridge: <m​tijanic> Actually, belay that. _this_ is the actual order of bits:
12:19 fdobridge: <m​tijanic> <https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/platform/chipset/chipset.c#L776-L818>
12:19 fdobridge: <k​arolherbst🐧🦀> splendid
12:19 fdobridge: <k​arolherbst🐧🦀> I'm kinda burnt out from all that RTD3 mess, so it kinda needs to be somebody else dealing with this nonsense
12:20 fdobridge: <s​amantas5855> oi you are the guy from the open kernel modules repo
12:20 fdobridge: <s​amantas5855> hi
12:20 fdobridge: <m​tijanic> You can search for a specific "PDB_PROP_CL" to see if there's any workarounds applied in the kernel.
12:20 fdobridge: <m​tijanic> (hi!)
12:22 fdobridge: <!​DodoNVK (she) 🇱🇹> The current nouveau GSP code still relies on that GSP firmware version
12:22 fdobridge: <m​tijanic> Yeah, sorry, I just had the /main/ open and forgot to switch.
12:23 fdobridge: <m​tijanic> There's not too many workarounds and you probably don't even care for all of them, so just manually copying the relevant bits should be fine. You end up with 300 lines of crap shoved in a file no one ever looks at. 200 lines if it's rust :)
12:23 fdobridge: <k​arolherbst🐧🦀> perfect
12:24 fdobridge: <m​tijanic> If there's a particular thing you need history on, ask and I'll go digging. Some have a bug ID attached even!
12:28 fdobridge: <m​tijanic> Although, I don't think RTD3 works on GSP in 535. I think we only started testing it in 545 or 550, and there's a set of bugfixes about it slated for 555
12:28 bl4ckb0ne: gfxstrand: thanks!
12:35 fdobridge: <k​arolherbst🐧🦀> ahh, good to know
12:35 fdobridge: <k​arolherbst🐧🦀> so we should probably update to 555 as soon as that's out
12:36 fdobridge: <v​alentineburley> There's Nvidia folks on #Nouveau helping out - actually crazy
12:36 karolherbst: airlied, dakr: ^^ seems like we should update to 555 for a bunch of runpm fixes
12:36 fdobridge: <v​alentineburley> Great to see!
12:39 fdobridge: <v​alentineburley> I don't think 535 supports the new 40 Super cards so it's probably a good time to update anyway
12:42 fdobridge: <k​arolherbst🐧🦀> ....
12:42 fdobridge: <k​arolherbst🐧🦀> *sigh*
12:43 fdobridge: <k​arolherbst🐧🦀> @notthatclippy sooo.. we've been trying to get more useful information from nvidia in regards to "this new hardware batch needs this new GSP version" anything you could do to help out there?
12:43 fdobridge: <k​arolherbst🐧🦀> or to be more specific, is that true that the new 40 super cards will need a version newer than 535? 😄
12:46 fdobridge: <a​huillet> I don't specifically know the answer, but you'll most likely need a firmware version that knows about the existence of the board in question
12:46 fdobridge: <m​tijanic> Let me consult the decoder ring.. <https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#RTX_40_series> (actually best resource I know of)
12:46 fdobridge: <m​tijanic> Yes, very likely. Might be hackable somehow, but I doubt it.
12:46 fdobridge: <a​huillet> lol @decoder ring, I use the same...
12:47 fdobridge: <k​arolherbst🐧🦀> same
12:47 fdobridge: <a​huillet> @karolherbst so if 40xx super is released with 555 drivers, you want 555 firmware, otherwise it's going to be a massive pain for anybody to help you
12:47 fdobridge: <k​arolherbst🐧🦀> *Sigh*
12:47 fdobridge: <k​arolherbst🐧🦀> thanks for getting a proper engineering answer though
12:48 fdobridge: <a​huillet> actually Milos is more qualified than I am to answer that authoritatively
12:48 fdobridge: <m​tijanic> Oh, wait, no, this is AD103 which wasn't supported before anywhere. Then yeah, old GSP will not know how to wire the HALs
12:48 fdobridge: <k​arolherbst🐧🦀> okay, so every new chipset absolutely needs a new firmware
12:48 fdobridge: <a​huillet> yes Sir, at least I don't see how we would ever commit to something else
12:48 fdobridge: <a​huillet> maybe you're lucky from time to time, but you can't rely on it
12:48 fdobridge: <k​arolherbst🐧🦀> it's not like it matters all too much
12:48 fdobridge: <k​arolherbst🐧🦀> mhh
12:49 fdobridge: <k​arolherbst🐧🦀> we do support ad103 though
12:49 fdobridge: <k​arolherbst🐧🦀> and I think it works
12:49 fdobridge: <m​tijanic> New chip, yes. New board/product with old chip but new PCI devID can probably be made to work with some caveats (such as generic name, etc)
12:49 fdobridge: <k​arolherbst🐧🦀> I'm more concerned about situations like where there is a new batch of hardware, but needs updated firmware for $unknown_reason like we had with pascal and turing
12:50 fdobridge: <k​arolherbst🐧🦀> it wasn't new chipsets, just uhh..
12:50 fdobridge: <k​arolherbst🐧🦀> new something
12:50 fdobridge: <k​arolherbst🐧🦀> or is the number behind the ad103 also relevant here?
12:50 fdobridge: <m​tijanic> You're right, AD103 is supported in 535. So.. ¯\_(ツ)_/¯
12:51 fdobridge: <k​arolherbst🐧🦀> well...
12:51 fdobridge: <k​arolherbst🐧🦀> maybe with GSP it's not an issue, but with the "old school" firmware, we required updates, for reasons never explained to us and it generlaly took a year to get nvidia to confirm we need an update
12:52 fdobridge: <k​arolherbst🐧🦀> and afaik it's about newer batches of hardware, not even new chipsets
12:52 fdobridge: <k​arolherbst🐧🦀> to some tu102 worked, some didn't
12:53 fdobridge: <k​arolherbst🐧🦀> but anyway.. the issue with updating often is: 1. work and time and 2. the initramfs situation is a mess and atm the initramfs would ship each firmware file
12:54 fdobridge: <k​arolherbst🐧🦀> and if your distro has like 3 kernels installed, that means 3 times each 30MB file
12:54 fdobridge: <!​DodoNVK (she) 🇱🇹> Does Ada still use old firmware path on the NVIDIA proprietary driver? 🧓
12:57 fdobridge: <m​tijanic> Currently, for closed source drivers, everything is using the old paths except Turing+ (incl. Ada) _datacenter_ GPUs (so eg L40 for Ada).
12:58 fdobridge: <m​tijanic> (by default that is. There's a module param to force GSP on closed source; or just use openRM)
13:00 fdobridge: <m​arysaka> huh Ada still have some Falcons around or am I misreading this
13:03 fdobridge: <m​tijanic> "Falcon" is a highly overloaded term. I don't think Ada has any FALCON ISA cores, it's all risc-v
13:03 fdobridge: <m​arysaka> Okay so it's all gone now nice 😄
13:03 fdobridge: <m​tijanic> For a couple of gens we put both FALCON and RISC-V cores onto every microcontroller so that we had a fallback if riscv didn't work out for whatever reason, but it did.
13:04 fdobridge: <m​tijanic> But the term Falcon is still used to effectively mean "microcontroller on the GPU"
13:08 fdobridge: <m​ohamexiety> it does
13:09 fdobridge: <m​ohamexiety> I actually bought a super card specifically for this a while back and it just worked:tm
13:09 fdobridge: <m​ohamexiety> ™️
13:09 fdobridge: <v​alentineburley> Ah cool
13:10 fdobridge: <m​ohamexiety> was surprised because prop NVIDIA without the newer driver just says "NVIDIA Device" or something like that. I guess GSP does some lifting here given these new cards are based on the same old chip (e.g. AD104, 103, etc)
15:40 fdobridge: <g​fxstrand> It shouldn't because I've got `nouveau.runpm=0` set.
15:42 fdobridge: <a​huillet> what does the timeout look like? who's timing out and with what message?
15:42 fdobridge: <a​huillet> there may be a way to figure out what's going on by inspecting some SuperSecret(TM) GPU registers at the hang point
15:42 fdobridge: <g​fxstrand> There's no hang message
15:43 fdobridge: <g​fxstrand> It's just a timeout
15:43 fdobridge: <a​huillet> this stuff will also be relevant for shader fault handling (working on getting the registers documented publicly)
15:49 fdobridge: <g​fxstrand> Yeah, getting shader fault handling working would be amazing!
15:49 fdobridge: <g​fxstrand> That's my biggest gripe with GSP.
15:49 fdobridge: <g​fxstrand> Among other things, I can't implement VK_KHR_device_fault right now. But also I'd like it for my own debugging.
15:50 fdobridge: <g​fxstrand> Sometimes I still swap in a Turing card and shut off GSP just so I can get debug info.
15:50 fdobridge: <g​fxstrand> Like, I'll probably do that for S8 unless it "just works"
15:58 fdobridge: <p​avlo_kozlenko> Wayland worked perfectly for me on Kepler with mesa (edited)
16:01 fdobridge: <p​avlo_kozlenko> Judging by the name of the .bin, pascal probably also has something like RISC-V... (edited)
16:03 fdobridge: <a​huillet> GSP isn't directly related to that though is it? just that the Nouveau codepaths are different?
16:03 fdobridge: <k​arolherbst🐧🦀> we get less useful error messages with GSP I think
16:03 fdobridge: <k​arolherbst🐧🦀> though Timur posted some patches which should help with that I think?
16:04 fdobridge: <k​arolherbst🐧🦀> like wiring up `NVreg_RegistryDwords` and I think that can be used to get more debugging out of GSP?
16:05 fdobridge: <a​huillet> likely. Timur or Milos would know more, I'm not a RM guy.
16:06 fdobridge: <g​fxstrand> For shader exceptions, GSP gives better messages. It gives actual string descriptions rather than whatever enum someone figured out for nouveau back in the day.
16:07 fdobridge: <g​fxstrand> For faults, we get nothing besides "mmu fault queued". For method errors, we get an inscrutable set of 3 hex values.
16:07 fdobridge: <g​fxstrand> When I say "we get" I mean that's what nouveau gives me on dmesg. IDK what the FW interface looks like.
16:08 fdobridge: <g​fxstrand> I assume the method error data we had before is probably encoded in those 3 hex values somehow but IDK how to parse them.
16:10 fdobridge: <a​huillet> I'm not equipped to reproduce that right now, but if you can share an example maybe we can get this explained
16:11 fdobridge: <g​fxstrand> I don't have a handy one at the moment but I'll toss you the next one I see
16:18 fdobridge: <!​DodoNVK (she) 🇱🇹> Is Milos the name or surname? 🇧🇷
16:20 fdobridge: <s​unrise_sky> Wow, I never realized we had another Timur in the community
16:27 fdobridge: <r​edsheep> Not sure whether this is the kind of reproducer you wanted, but doom 2016 vulkan mode is pretty consistent in hitting these mmu faults https://gitlab.freedesktop.org/mesa/mesa/-/issues/10910
16:27 fdobridge: <!​DodoNVK (she) 🇱🇹> Timur Tabi
16:32 fdobridge: <g​fxstrand> Ugh... Looks like some things have regressed. 😢 I'll get it sorted once I get a new baseline CTS run.
16:38 fdobridge: <S​id> the thing is
16:38 fdobridge: <S​id> Doom 2016 is not a vulkan conformant app
16:38 fdobridge: <S​id> afaik
16:41 fdobridge: <S​id> if I'm not wrong DOOM needed a workaround on radv as well
16:41 fdobridge: <S​id> which nv prop already had since they get access to games prior
16:44 fdobridge: <r​edsheep> That may be, but either way it does kick up the shader fault
16:45 fdobridge: <r​edsheep> For the purpose of trying to fix getting useful details out of faults it should be a usable test
16:45 fdobridge: <r​edsheep> The app having invalid behavior is kind of beside the point
16:46 fdobridge: <S​id> but what if invalid behavior is the cause of the fault in that app
16:47 fdobridge: <r​edsheep> It probably is, but that's fine because afaict ahuillet was just looking for something that would kick up a fault, and it does
16:47 fdobridge: <g​fxstrand> The thing being talked about right now is trying to get the fault information out of the kernel for when something does fault.
16:48 fdobridge: <g​fxstrand> Pre-GSP, the kernel would print an actual hex address for a fault. These days, it doesn't. That's the thing that @redsheep is talking about here.
16:48 fdobridge: <g​fxstrand> For that, a broken app is just the ticket.
16:48 fdobridge: <g​fxstrand> As for NVIDIA working around it, I suspect that has to do with them shutting off faults.
16:52 fdobridge: <S​id> ah, fair
16:52 fdobridge: <S​id> I was under the impression we're aiming to fix the bug, completely overlooking the fact we don't have enough info to do so, and that the discussion was to get said info
16:53 fdobridge: <g​fxstrand> I used to have a branch in my Mesa repo that had a bad patch that utterly destroys Intel GPUs that I kept around specifically so that the CI guys had something handy for testing CI robustness.
16:54 fdobridge: <S​id> though, I do wonder if we can hook this up in nouveau kmod to get more info out of the GSP
16:56 fdobridge: <g​fxstrand> It's look like someone broke barycentrics. 🤔 Probably a pretty simple fix once I bisect
17:03 fdobridge: <m​tijanic> RmMsg is for the NVIDIA kernel driver; it does nothing for GSP and therefore nothing for nouveau
17:04 fdobridge: <S​id> sad
17:04 fdobridge: <S​id> what about the gsp_log_XX10x.bins :P
17:04 fdobridge: <S​id> are those interal only?
17:04 fdobridge: <m​tijanic> There's prints that GSP pushes but those require a private database file to decode. I think Timur added the logic to decode it to nouveau, but we can't actually publish those.
17:04 fdobridge: <S​id> I can't type...
17:05 fdobridge: <m​tijanic> Yes, that's the database I meant
17:05 fdobridge: <S​id> fair, I imagine it reveals a bit too much of what the GSP firmware does
17:05 fdobridge: <m​tijanic> We can't just publish them without auditing every single print that's in there, and that's a lot of additional effort/$$$ that would need staffing.
17:06 fdobridge: <S​id> understandable
17:06 fdobridge: <S​id> not ideal/very fun to hear, but definitely understandable
17:07 fdobridge: <m​tijanic> That said, if someone here is debugging a fault and hits a wall, they can ask one of us to decode, and then we can paste the relevant bits back. Not scalable, but if it's 10 minutes of my time and saves a few hours for you...
17:09 fdobridge: <m​tijanic> But lack of feedback from GSP to nouveau is a known problem for us and one that we'd like to solve somehow eventually. Things just move slowly.
17:39 Lyude: hm. airlied / dakr: I'm taking a look again at the low memory issue with runtime PM where we end up failing to allocate memory to hold the vram contents of the GPU in high-memory fragmentation situations. airlied if I remember correctly you had mentioned we should make nvkm_gsp_mem_ctor() use vmalloc() + sg tables, but are we actually sure all falcons are able to boot off that?
17:40 fdobridge: <g​fxstrand> Nah, looks like new tests or something. Whatever. I filed a bug and assigned @marysaka . I'll come back to it if I have time but I'd rather get some of this backlog merged today.
17:41 Lyude: I'm not 100% sure but https://paste.centos.org/view/db84b9db I wrote up a patch to convert nvkm_gsp_mem_ctor() over to doing that, and it seems like the first falcon boot we attempt fails now - but I can see something that at least seems like it is a proper dma_address for the falcon mbox so I think it might not actually be an issue with my patches
17:41 Lyude: The values I'm seeing for context btw:
17:41 Lyude: [ 9.429884] Lyude:r535_gsp_init:2186: (mbox1) == 0
17:41 Lyude: [ 9.429898] Lyude:r535_gsp_init:2186: (mbox0) == dbdfe000
17:42 Lyude: ( ^ TimurTabi as well I guess, realizing there's a chance you might know if this is possible or not)
17:47 fdobridge: <m​tijanic> Is this the FSP falcon that's failing to load?
17:48 Lyude: do you mean the GSP falcon? since I'm pretty sure that's the falcon failing here
17:48 fdobridge: <m​tijanic> The GSP bootup logic is complicated. You first need the booter ucode loaded on FSP, then that starts the GSP and then you send gsp.bin to GSP.
17:49 Lyude: gimme a sec, I will just show you the full backtrace
17:50 fdobridge: <m​tijanic> (AFAICT the FSP is just called 'booter' in nouveau)
17:50 Lyude: https://paste.centos.org/view/b773d549
17:50 Lyude: ah yes
17:50 Lyude: it is the FSP then
17:51 fdobridge: <m​tijanic> Bah, everything got inlined in that stacktrace...
17:51 Lyude: i can translate it one sec
17:52 Lyude: mtijanic: https://paste.centos.org/view/dea0488e
17:53 Lyude: note as well this is with the patch I had mentioned before - trying to get nvkm_gsp_mem_ctor to use vmalloc() which should work a lot better then dma_alloc_coherent()
17:56 Lyude: I also wouldn't be surprised if i just made a mistake in my patch somewhere, this is normally not usually the stuff I work on :P
18:08 fdobridge: <!​DodoNVK (she) 🇱🇹> This is the first time I'm hearing the FSP acronym
18:19 TimurTabi: dakr: can you test your "Use vmalloc() for GSP memory allocations" patch with CONFIG_DEBUG_SG enabled?
18:19 Lyude: TimurTabi: you mean me? :P
18:19 Lyude: and yes I can
18:19 TimurTabi: Sorry, yes.
18:20 TimurTabi: I was just talking to Danilo and spaced out.
18:20 Lyude: hehe np
18:20 TimurTabi: There's already a bug in Nouveau where it tries to call sg_init_one() on a buffer that isn't supported.
18:24 TimurTabi: Lyude: also, you might want to consider incorporating this change into your patch: https://lists.freedesktop.org/archives/nouveau/2024-February/044210.html
18:24 TimurTabi: As for your Falcon boot failure, please send me a quick email about it, and I'll look into it.
18:25 Lyude: gotcha. btw TimurTabi - I assume the general idea of my patch seems pretty sound?
18:25 TimurTabi: Well, I don't know. A lot of GSP-RM assumes physically contiguous memory.
18:26 Lyude: gotcha gotcha. honestly I figured that might be the case, I've been wondering if it just makes more sense to only use vmalloc for the suspend/resume memory
18:26 TimurTabi: I'm surprised your having problems though, because non-GSP firmware images are pretty small.
18:29 Lyude: mhhh - yeah seems like I'm hitting an sg bug, but it's from the firmware code https://paste.centos.org/view/20dade07
18:32 TimurTabi: #ifdef CONFIG_DEBUG_SG
18:32 TimurTabi: BUG_ON(!virt_addr_valid(buf));
18:32 TimurTabi: #endif
18:32 TimurTabi: That's the bug I'm talking about.
18:33 TimurTabi: Where you're getting that virtual address from is not compatible with the s/g code.
18:33 Lyude: I don't think any of my changes would have touched that codepath though, let me double check
18:33 TimurTabi: It's been broken since day one.
18:34 TimurTabi: I sent an email to Ben about it, and he said he would look into it after he submitted his cleanup patches, which he did yesterday.
18:35 TimurTabi: You might want to start an email thread on it (cc me and ben)
18:35 Lyude: gotcha, will cc airlied and dakr as well
18:38 Lyude: hm. me try one thing first
18:44 fdobridge: <m​agic_rb.> I gotta say, that aside from me having an awesome driver available, im learning soo much about graphics in general. So thank you all for giving me the opportunity to read the backlogs and learn from your discussions 💜
18:44 fdobridge: <m​agic_rb.> I will keep mostly quietly lurking and observing
18:55 Lyude: hah
18:55 Lyude: oh that makes me proud of myself :)
18:56 Lyude: TimurTabi: unsure ben will need to look at that bug
18:56 Lyude: let me send you a patch to try
18:56 TimurTabi: sure, but I think Ben has more bandwidth now than I do.
18:56 Lyude: ah gotcha :p
18:56 Lyude: i'll just start the thread then
18:57 TimurTabi: Feel free to attach a patch, I just don't know who will try it first.
20:27 Lyude: thread started
20:31 airlied: Lyude: the other option is to keep the allocation around all the time
20:35 Lyude: airlied: eeeehhh. technically, but if the allocation is the ~300-400MB I see get freed after runtime resume that's a lot of memory to keep around
20:36 airlied: what is order 7?
20:36 airlied: my brain isn't awake enough
20:36 Lyude: I wish I knew :)
20:37 Lyude: I think it just means cache coherent allocation that's contiguous? but that is entirely a guess
20:38 Lyude: I think you were the one that understood what that meant last time
20:47 Lyude: btw airlied, dakr - going to start writing up a basic feature matrix for what we currently want in nouveau, since I've gotten asked for one at work a few times already. will be putting it on the nouveau wiki assuming I have access
20:48 Lyude: I think I probably can just go through and update the table we already haev
21:53 fdobridge: <r​hed0x> is nvk supposed to work with renderdoc (with a D3D11 DXVK game)?
21:56 airlied: Lyude: so I don't think we can fix all the allocations to use that path, but maybe we can fix the one allocation that is large?
21:56 Lyude: airlied: ok - yeah that's what I was wondering tbh
21:56 airlied: I'm still a bit confused though, I probably need to read this code closer
21:57 fdobridge: <!​DodoNVK (she) 🇱🇹> I did get it working with Overwatch 2
21:58 airlied: Lyude: need to go look at the backtrace and identify which path is doing the order 7 again
21:58 fdobridge: <r​hed0x> didn't work with the genshin impact apitrace someone posted on GitHub though
21:58 fdobridge: <r​hed0x> I'll take another look tomorrow
21:58 airlied: a radix3_sg needing order 7 is definitely a bit wierd
22:02 airlied: Lyude: can you reproduce it easily btw?
22:03 Lyude: airlied: it takes me a bit tbh, but it's usually pretty certain I'll hit it after a bit of uptime on my machine
22:03 Lyude: so, yes, but slowly :P
22:04 airlied: actually it shouldn't need reproduction
22:04 airlied: but u64 len = meta->gspFwWprEnd - meta->gspFwWprStart; just printk those values
22:05 Lyude: alright, gimme a bit to get back to you on that - I want to make sure I actually get this feature matrix done today
22:06 airlied: so we allocate a big chunk of memory in nvkm_gsp_sg, then we create a 3 level page table to it, it might be possible to use the page table levels more effectively
22:09 airlied: so yeah we can't do this generically in the place you are trying t ofix
22:09 airlied: it needs to be a special path in the radix3 code
22:09 airlied: which looks like the nvkm_gsp_sg path
22:10 airlied: I'll reply to your email just for posterity :)
23:12 fdobridge: <g​fxstrand> @airlied So, uh... S8 seems to be working. 🤯
23:13 fdobridge: <g​fxstrand> `Pass: 805983, Fail: 2, Crash: 4, Skip: 1039508, Timeout: 3, Duration: 31:58, Remaining: 15:31`
23:22 fdobridge: <g​fxstrand> @airlied I think @mohamexiety is pretty close to having my modifiers branch ported to the Rust-based NIL. I think that means actually landing the damn thing is next on my ToDo list.
23:22 fdobridge: <g​fxstrand> I've got a Khronos meeting next week, though, so It'll probably be a couple weeks before I can go head-down on it.
23:24 fdobridge: <m​ohamexiety> should be tomorrow. I am just fighting with some errors from Rust abuse (that poor, poor compiler) at this point. (+ also rebasing the nvk changes on current main, but this doesn't really look like it'll be problematic)
23:29 fdobridge: <g​fxstrand> Cool. No pressure. Just ping me when you've got a rebase and I'll give it a skim and see if there's anything we need to tweak.
23:32 fdobridge: <g​fxstrand> Yup. S8 is working fine. No idea why it was giving us grief before. 🤷🏻‍♀️
23:32 fdobridge: <g​fxstrand> It's possible Turing needs a workaround of some sort but Ampere is fine?
23:32 fdobridge: <g​fxstrand> I've got to pop in the Turing anyway for something else so I'll test that quick.