03:56airlied[d]: notthatclippy[d]: okay I think I figured out the ASSERTs from GSP, we were setting the msg queue sequence, but not the rpc sequence number, I think setting those gets rid of the assert, still don't get rtx6000 to resume
05:40airlied[d]: _lyude[d]: https://lore.kernel.org/nouveau/20260119053701.181329-1-airlied@gmail.com/T/#u fix for the nocat asserts at least
08:25marysaka[d]: mhenning[d]: both patches you typed to disable compression will likely not be enough as the large page codepath will be hit in case size and the VA address are aligned to 64KiB. You need to enforce that `select_page_shift` returns PAGE_SHIFT too
11:05mohamexiety[d]: why not just disable it in userspace instead?
11:05mohamexiety[d]: older mesa can't trigger the larger page size codepaths anyways. newer mesa wont trigger it if it's disabled in mesa
11:05mohamexiety[d]: bonus points is it'll be easier to re-enable when/if it's fixed
11:15marysaka[d]: mohamexiety[d]: oh wait you are right I totally forgot that we were forcing VRAM with the new logic so we could disable it in userspace for now
11:21notthatclippy[d]: airlied[d]: I tracked down this change in >570.144. It tries to fix the issue of:
11:21notthatclippy[d]: > If Kernel RM initiates RPC#1 but times out and then sends RPC#2, GSP RM will still process both RPCs and send two separate responses.
11:21notthatclippy[d]: > The issue is that Kernel RM might incorrectly associate the first incoming response with RPC#2, when it's actually for the timed-out RPC#1
11:21notthatclippy[d]: but unfortunately it never considered nouveau codebase where that stuff wasn't set at all. I totally missed the change when it went in too, though I doubt I'd have remembered to check what nouveau does.
11:22notthatclippy[d]: your patch seems fine from a quick glance, but I will need a few more days before I can take a closer look
11:33notthatclippy[d]: Other than that, I don't see anything _particularly_ interesting that could cause the change in behavior. There is a lot of logging rework, so it might be generating more errors than before, and possibly something bails early on a new error. There's also a lot of message queue robustness stuff in place, but it would only get triggered if you're sending totally corrupted RPCs.
11:34notthatclippy[d]: One _possibly_ interesting bit is: <https://github.com/NVIDIA/open-gpu-kernel-modules/blob/570.211.01/src/nvidia/interface/nvrm_registry.h#L2636-L2647>
11:34notthatclippy[d]: This is mostly kernel-side, but that same regkey controls whether the SEC2 engine frees up some memory on suspend or not. Might be worth sending it down and seeing if the behavior changes
14:33marysaka[d]: airlied[d]: mohamexiety[d] mhenning[d] phomes_[d] I typed a simpler and less noisy reproducer for mesa/mesa#14610 if you are interested: This happen when switching a VA range from a 4K page BO to a 64K page BO.
14:33marysaka[d]: https://gitlab.freedesktop.org/marysaka/mesa/-/commit/60901decdc36d8fb309a025bbc852ccf953f50f8
14:44marysaka[d]: (branch is https://gitlab.freedesktop.org/marysaka/mesa/-/tree/mmu-16-stress-tests?ref_type=heads, added some new stuffs to support blackwell as I was restricting up to ADA_A)
17:23mhenning[d]: marysaka[d]: Oh, does the kernel automatically use large pages in that case? I thought that reducing the align parameter would be enough.
17:24marysaka[d]: So I did some digging and I think I found the precise thing racing:
17:24marysaka[d]: [31706.703076] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00100: ref: 0000003ffdf00000 0000000000010000 12 16 PTEs
17:24marysaka[d]: [31706.703085] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:xxxxx:xxxxx: PDE write PGD
17:24marysaka[d]: [31706.703093] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:xxxxx: PDE write SPT
17:24marysaka[d]: [31706.703096] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00110: flush: 2
17:24marysaka[d]: [31706.703108] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00100: map: 0000003ffdf00000 0000000000010000 12 16 PTEs
17:24marysaka[d]: [31706.703112] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00110: flush: 4
17:24marysaka[d]: [31706.703158] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00100: unmap: 0000003ffdf00000 0000000000010000 12 16 PTEs
17:24marysaka[d]: [31706.703163] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00110: flush: 4
17:24marysaka[d]: [31706.703175] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00010: ref: 0000003ffdf00000 0000000000010000 16 1 PTEs
17:24marysaka[d]: [31706.703183] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:xxxxx: PDE write LPT
17:24marysaka[d]: [31706.703186] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00011: flush: 3
17:24marysaka[d]: [31706.703198] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00010: map: 0000003ffdf00000 0000000000010000 16 1 PTEs
17:24marysaka[d]: [31706.703202] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00011: flush: 4
17:24marysaka[d]: [31706.703212] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00100: unref: 0000003ffdf00000 0000000000010000 12 16 PTEs
17:24marysaka[d]: [31706.703215] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00100: LPTE 00010: U -> I 1 PTEs
17:24marysaka[d]: [31706.703219] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:xxxxx: SPT empty
17:24marysaka[d]: [31706.703220] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:xxxxx: PDE unmap SPT
17:24marysaka[d]: [31706.703224] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:xxxxx: flush: 3
17:24marysaka[d]: [31706.703233] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:xxxxx: PDE free SPT
17:24marysaka[d]: [31706.703546] nouveau 0000:09:00.0: gsp: mmu fault queued
17:24marysaka[d]: [31706.705709] nouveau 0000:09:00.0: gsp: rc engn:00000001 chid:3 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003ffdf00000 fault_type:00000002
17:24marysaka[d]: [31706.705715] nouveau 0000:09:00.0: fifo:c00000:0003:0003:[mmu_tests[405897]] errored - disabling channel
17:24marysaka[d]: [31706.705719] nouveau 0000:09:00.0: mmu_tests[405897]: channel 3 killed!
17:24marysaka[d]: sometime the unref of the 4K BO mapping is performed after the ref and map of the 64K one, and the page directory gets destroyed at that moment
17:25marysaka[d]: meaning that the unbind/bind requests are somehow out of order on the kernel side
17:26marysaka[d]: mhenning[d]: So my bad here, what matter is where you place it, if we do not pin to VRAM we will be fine... I forgot that detail
17:26marysaka[d]: marysaka[d]: going to stop; on this for today, planning on poking more at it tomorrow unless someone beat me on that 😄
18:21_lyude[d]: notthatclippy[d]: I wonder if this is exactly what's happening with us
21:56_lyude[d]: airlied[d]: one other fix I wonder if we should do even though it didn't fix anything on my side (because perhaps it's one part of a number of issues...?): when we suspend the GSP we actually are missing a flag that rm sets
21:58airlied[d]: Which one?
21:58_lyude[d]: gimme a moment to login to my desktop
21:58airlied[d]: I got a patch locally for the boostclocks thing
22:00_lyude[d]: https://paste.centos.org/view/f5664b1f airlied[d] ignore the outdated notes from me (they're wrong, openrm does pass this flag - even though it doesn't actually pass it symmetrically as far as I can tell). The other thing I noticed is that on some systems including mine, suspend entails putting the GPU into S4 and not S3 mode
22:01airlied[d]: https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/nouveau-wip-fixes?ref_type=heads is what I'm running locally
22:01airlied[d]: it has 3 patches, to bridge various missing bits, nothing helps it yet
22:01_lyude[d]: yeah. I've had a lot of that so far too 🙂
22:03airlied[d]: I dug through fbsr a bit more yesterday, might keep digging in there
22:03airlied[d]: one thing I see is a reserved space at start of VRAM, that I'm not sure we see on other GPUs
22:03airlied[d]: but I haven't checked that out yet
22:03_lyude[d]: airlied[d]: do you mean MemMgrReadMmuLock or whatever it is?
22:04airlied[d]: openrm has anteriorFbSize as a newer thing
22:05airlied[d]: was going to see how that affected things need to add more debugging to openrm
22:05_lyude[d]: oh huh, don't think I spotted that on e
22:05_lyude[d]: also - let me show you the S4 thing as well. I wouldn't be surprised if it's one part of the puzzle here too
22:12_lyude[d]: airlied[d]: Here it is. It looks like it's actually that OpenRM suspends the device at level 3 but then says its resuming from level 4 if there's no ACPI help for resuming the GPU in `gpuResumeFromStandby_IMPL` in src/nvidia/src/kernel/gpu/gpu_suspend.c
22:13airlied[d]: ah yeah tried that as well, didn't help
22:14airlied[d]: @@ -1198,7 +1207,7 @@ r535_gsp_set_rmargs(struct nvkm_gsp *gsp, bool resume)
22:14airlied[d]: args->srInitArguments.flags = 0;
22:14airlied[d]: args->srInitArguments.bInPMTransition = 0;
22:14airlied[d]: } else {
22:14airlied[d]: - args->srInitArguments.oldLevel = NV2080_CTRL_GPU_SET_POWER_STATE_GPU_LEVEL_3;
22:14airlied[d]: + args->srInitArguments.oldLevel = 4;//NV2080_CTRL_GPU_SET_POWER_STATE_GPU_LEVEL_4;
22:14airlied[d]: args->srInitArguments.flags = 0;
22:14airlied[d]: args->srInitArguments.bInPMTransition = 1;
22:14airlied[d]: }
22:15_lyude[d]: airlied[d]: perhaps also with the missing flag in srInitArguments.flag that I mentioned as well?
22:15_lyude[d]: I wonder too if we should push patches for this even if it doesn't seem like it's changing anything…
22:15airlied[d]: I pushed another change I had locally to that branh
22:16airlied[d]: but I'll look at that flags as well
23:01_lyude[d]: btw - https://paste.centos.org/view/9b166a9d log from my machine, unfortunately it didn't seem to work on my end
23:06_lyude[d]: but my machine (at least looks like, unless I'm missing something?) at least seems like nouveau thinks the gpu suspends
23:07_lyude[d]: though that's with 1.44
23:31airlied[d]: is that with my seq num patch, you still see asserts?
23:37airlied[d]: but I see the same thing, no complaints on suspend, then resume just never gets the INIT_DONE