02:17airlied[d]: notthatclippy[d]: https://people.freedesktop.org/~airlied/scratch/rtx6000-sr-fail.tar.xz this is 570.207 on rtx6000, can you decode it?
02:17airlied[d]: suspend succeeds, resume never gets a GSP message
08:39marysaka[d]: marysaka[d]: Did some run this week-end with the original patch (the one changing VM_BIND side to handle ref/unref in the job instead of cleanup) + v3 of the patch and it never reproduced after running for ~16h so it must be still an ordering weirdness
08:54airlied[d]: yeah it's still something in that area, I went back to debugging rtx6000 s/r today as it's got an increase priority boost from internal testing
08:54airlied[d]: I think something in the ref/unref that is writing to the PTE entries might be racing with map/unmap
08:55airlied[d]: since ref/unref are locked with one lock and map/unmap with another one
09:02marysaka[d]: I don't think it's that on my side because before that run I cherry-picked the patch that force to lock vmm->mutex.vmm on raw ref/unref/map/unref and still got the issue (<https://gitlab.freedesktop.org/marysaka/linux/-/commit/9328b23d329b031a7ba1a03b20145d997dd62fa0>)
09:02marysaka[d]: but I could have missed a path who knows tbh
09:44notthatclippy[d]: airlied[d]: > suspend succeeds
09:44notthatclippy[d]: [citation needed]
09:44notthatclippy[d]: T:167126468640 1970-01-01 01:02:47 GPU0 ERROR GSP-RM: gsp_rpc.c(699): UNLOADING_GUEST_DRIVER failed: 0x1f
09:44notthatclippy[d]: T:167126477952 1970-01-01 01:02:47 GPU0 ERROR GSP-RM: gsp_rpc.c(1688): rpc_result 0x1f
09:44notthatclippy[d]: Looking at the rest right now, but GSP failed during suspend already. The failure signature is similar to what we saw earlier..
09:47airlied[d]: I wonder why that error doesn't propogate
09:48airlied[d]: Invalid argument to call
09:49notthatclippy[d]: These is all it complains about for suspend:
09:49notthatclippy[d]: ERROR GSP-RM: mem_utils.c(170): Assertion failed: pSurface->offset + size <= pSurface->pMemDesc->Size
09:49notthatclippy[d]: ERROR GSP-RM: mem_utils.c(625): Assertion failed: Invalid argument to call memmgrCheckSurfaceBounds(pDstInfo, size)
09:49notthatclippy[d]: ERROR GSP-RM: fbsr_physical.c(95): FBSR 7: Save failed for region #0
09:49notthatclippy[d]: ERROR GSP-RM: mem_mgr_regions_gsp.c(247): Check failed: Invalid argument to call fbsrExecuteSaveRestore_HAL(pGpu, pFbsr)
09:49notthatclippy[d]: ERROR GSP-RM: gpu.c(3852): Failed to unload engine with descriptor index: 0x39 and descriptor: 0x174e2100
09:49notthatclippy[d]: ERROR GSP-RM: gpu.c(3857): Assertion failed: 0
09:49notthatclippy[d]: ERROR GSP-RM: rmgsp.c(1030): Assertion failed: Invalid argument to call gpuStateUnload(pGpu, IS_GPU_GC6_STATE_ENTERING(pGpu) ? GPU_STATE_FLAGS_PRESERVING | GPU_STATE_FLAGS_PM_TRANSITION | GPU_STATE_FLAGS_GC6_TRANSITION : GPU_STATE_FLAGS_PRESERVING | GPU_STATE_FLAGS_PM_TRANSITION)
09:49notthatclippy[d]: ERROR GSP-RM: rmgsp.c(1062): RmUnloadGpuInternalForPM failed
09:49notthatclippy[d]: ERROR GSP-RM: gsp_rpc.c(699): UNLOADING_GUEST_DRIVER failed: 0x1f
09:49notthatclippy[d]: ERROR GSP-RM: gsp_rpc.c(1688): rpc_result 0x1f
09:49notthatclippy[d]: NOTICE GSP-RM: task_rm_v3.c(455): GSP TASK_RM SUSPEND.
09:49notthatclippy[d]: The line numbers won't match anything in openrm, but a few of those files you do have
09:50notthatclippy[d]: ` 0x174e2100` is just "MemorySystem", and it fails because of the FBSR save, which fails because some of the regions given to it are invalid.
09:51notthatclippy[d]: Which should have been set up during the initial boot sequence
09:51airlied[d]: oh that fbsr info might be useful to figure it uot
09:57notthatclippy[d]: Can you check if there's a difference in the payload of NV2080_CTRL_CMD_INTERNAL_FBSR_INIT between nouveau and openrm?
09:59airlied[d]: seems the same, it just contains a memory handle and and pointer to sr arguments
10:00airlied[d]: the memory handle is from ALLOC_MEMORY and seems to be the same size as openrm uses
10:01notthatclippy[d]: Do you maybe have openrm of the same version around, to run nvidia-bug-report.sh on?
10:01notthatclippy[d]: Ideally with `RmMsg=":"`
10:01airlied[d]: yes I've swapping between openrm and nouveau often
10:02airlied[d]: I'll dig it up a bit more tomorrow, gotta get kids to bed
10:04notthatclippy[d]: This is a really good exercise as it tells me all the ways our current GSP logging is lacking. Now, I don't think we'll actually be able to improve it in 570 for nouveau to use, but maybe I can make an out of band gsp.bin that's more developer friendly. And, definitely something to consider for Nova.
10:21airlied[d]: I do wonder why I never see the invalid value propogate from unload fails, though maybe it's one of those we always succeed to avoid failing suspend
10:26notthatclippy[d]: I don't think GSP is very good at recovering from a failed suspend in general. I'd be surprised if it didn't leak like crazy on every failed attempt.
10:26airlied[d]: Nouveau definitely has a WARN checking it
10:29notthatclippy[d]: Ah, because GSP has UNLOADING_GUEST_DRIVER marked as not returning any data.
10:29notthatclippy[d]: You're probably just reading the same 0 value for the status that you wrote on the way in
10:29notthatclippy[d]: That's also a bug :)
10:29notthatclippy[d]: I mean, GSP bug
13:12karolherbst[d]: thoughts on adding `CBuf::Global`? Not sure how I want to model ULDC/LDCU operating on global VA addresses..
13:17karolherbst[d]: The other interesting part is that it has a 32 bit constant offset...
13:24karolherbst[d]: mhh `CBufRef::offset` is just 16 bits mhh...
13:41marysaka[d]: airlied[d]: I tracked down the VMM pointer during channel error to filter down my logs and this is the pattern of the address faulting right before the fault (fault is received at 18980.807921)
13:41marysaka[d]: ```
13:41marysaka[d]: [18980.800138] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00100: 0000000085506c0e: unmap: 0000003ffdf00000 0000000000010000 12 16 PTEs
13:41marysaka[d]: [18980.802092] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00010: 0000000085506c0e: ref: 0000003ffdf00000 0000000000010000 16 1 PTEs
13:41marysaka[d]: [18980.802103] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00010: 0000000085506c0e: map: 0000003ffdf00000 0000000000010000 16 1 PTEs
13:41marysaka[d]: [18980.804646] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00010: 0000000085506c0e: unref: 0000003ffdf00000 0000000000010000 16 1 PTEs
13:41marysaka[d]: [18980.804855] nouveau 0000:09:00.0: mmu: user: 00000:00000:001ff:000ef:00100: 0000000085506c0e: unref: 0000003ffdf00000 0000000000010000 12 16 PTEs
13:41marysaka[d]: ```
13:49marysaka[d]: So yeah the unref of the first unmap is so delayed that userspace had the time to map and unmap... and the unref of the second mapping is done before the unref of the first one
14:00marysaka[d]: (updated logs with a bit more of before because it wasn't the full sequence)
14:22karolherbst[d]: I should probably add a OpLdgc instruction to model LDC with a global VA...
14:25karolherbst[d]: it also comes with an input predicate that makes the result be 0 (for bound checks)
16:43mhenning[d]: karolherbst[d]: Yeah, that might work
16:44mhenning[d]: How do these instructions work exactly? You just give them a global address rather than eg. a bindless cbuf handle?
18:15karolherbst[d]: mhenning[d]: yeah, I suspect the opcode is different, but yeah.. it's the global address as the VA and a 32 bit immediate as the constant offset
18:15karolherbst[d]: and an input predicate that makes the result be 0 on true
18:16karolherbst[d]: anyway, the annoying part is that it can't do loads equal to the max size of ldg, but I doubt that will matter
18:17karolherbst[d]: will only be used for fetching global addresses (most of the time I guess) and those should end up in UGPRs anyway for the UGPR + GPR encoding
18:17karolherbst[d]: the interesting part is that the global VA is always 64 bits
18:18mhenning[d]: Is it weaker than ldg.constant? Or are they interchangeable?
18:18karolherbst[d]: unknown
18:19karolherbst[d]: might use the constant buffer cache instead
18:19karolherbst[d]: it is always returning uniform values, so that wouldn't surprise me
18:20karolherbst[d]: (the inputs are also uniform)
18:20mhenning[d]: where are you planning to use it then?
18:20karolherbst[d]: I was looking into the UGPR + GPR forms of load/store/atomics and noticed that I got a bunch of ldcs into GPRs + R2UR as a result
18:21karolherbst[d]: so I get rid of the non uniform add, but see r2ur instead
18:21karolherbst[d]: so I think for now from a nir perspective it's mostly for `nir_intrinsic_load_global_constant` when the use is uniform
18:22karolherbst[d]: there might be more use cases
18:22karolherbst[d]: but probably not
18:22karolherbst[d]: ehh not ldcs, ldgs
18:24karolherbst[d]: I'd probably use ULDC/LDCU inside `nak_nir_lower_load_store` when I can determine that it's going to remain in a UGPR
18:24karolherbst[d]: and use ldg if not
18:25mhenning[d]: Are you sure it's valid to use for nir_intrinsic_load_global_constant? I'm worried if the memory ordering might be too weak
18:25karolherbst[d]: load_global_constant is only supposed to be used on constant memory, no?
18:26karolherbst[d]: but I've seen nvidia using it for vulkan shaders where we currently use LDG for
18:26karolherbst[d]: _maybe_ we should only use it when loading from descriptors, dunno
18:27mhenning[d]: We can infer load_global_constant if the variable is never written in the current shader. But it can still be written in other shaders
18:27karolherbst[d]: right.. maybe I should verify which cache is used there
18:28marysaka[d]: marysaka[d]: airlied[d] I know why it breaks badly with your patch: NVKM_VMM_PTE_BIG_VALID is being unset in this special combo:
18:28marysaka[d]: ```
18:28marysaka[d]: [first mapping (LPT, set NVKM_VMM_PTE_BIG_VALID)]
18:28marysaka[d]: [35135.416559] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: ref: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28marysaka[d]: [35135.416573] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: map: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28mhenning[d]: Yeah, if you have examples where nvidia upgrades that kind of read to a ULDC that's probably also enough
18:28marysaka[d]: [first mapping unmap]
18:28marysaka[d]: [35135.481170] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: unmap: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28marysaka[d]: [missing unref]
18:28marysaka[d]: [second mapping (SPT)]
18:28marysaka[d]: [35135.481825] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00100: ref: 0000003ffdf00000 0000000000010000 12 16 PTEs
18:28marysaka[d]: [35135.481836] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00100: map: 0000003ffdf00000 0000000000010000 12 16 PTEs
18:28marysaka[d]: [second mapping unmap]
18:28marysaka[d]: [35135.488485] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00100: unmap: 0000003ffdf00000 0000000000010000 12 16 PTEs
18:28marysaka[d]: [missing unref]
18:28marysaka[d]: [third mapping (LPT, set NVKM_VMM_PTE_BIG_VALID)]
18:28marysaka[d]: [35135.489050] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: ref: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28marysaka[d]: [35135.489063] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: map: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28marysaka[d]: [unref of first mapping (clear NVKM_VMM_PTE_BIG_VALID)]
18:28marysaka[d]: [35135.490094] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: unref: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28marysaka[d]: [unref of second mapping, PTE count reach 0, NVKM_VMM_PTE_BIG_VALID is not set marking the page as invalid]
18:28marysaka[d]: [35135.490199] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00100: unref: 0000003ffdf00000 0000000000010000 12 16 PTEs
18:28marysaka[d]: [35135.490209] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00100: LPTE 00010: U -> I 1 PTEs
18:28marysaka[d]: [35135.490816] nouveau 0000:09:00.0: gsp: mmu fault queued
18:28karolherbst[d]: to be fair, I only saw them using it to fetch buffer addresses
18:28marysaka[d]: [third mapping unmap + unref (process death/cleanup)]
18:28marysaka[d]: [35135.546041] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: unmap: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28marysaka[d]: [35135.546060] nouveau 0000:09:00.0: mmu: 00000000232c3a87: user: 00000:00000:001ff:000ef:00010: unref: 0000003ffdf00000 0000000000010000 16 1 PTEs
18:28marysaka[d]: ```
18:28marysaka[d]: We should ref count big pages too as a result to fix it
18:28karolherbst[d]: as part of descriptors
18:28karolherbst[d]: but I also haven't looked further
18:28karolherbst[d]: I have the same use case there anyway...
18:29karolherbst[d]: I think at least...
18:29karolherbst[d]: yeah..
18:29mhenning[d]: Yeah, I'm not too worried about us using it to fetch descriptors, I'm mostly just worried about using it for general nir_intrinsic_load_global_constant
18:30karolherbst[d]: it's fetching the address of an ssbo that's just stored in the descriptor, and it's not bindless ubo, because it's not an ubo
18:33karolherbst[d]: do we clear constant buffer caches between shader invocations? Though I guess within a 3D pipeline that could cause issues if a stage writes to the memory and it's cached in the constant cache.. mhh dunno
18:37mhenning[d]: I think it automatically invalidates if we modify one of the bound constant buffers. I don't really understand how it interacts with bindless cbufs or uldc
19:54karolherbst[d]: yeah, me neither
21:01marysaka[d]: marysaka[d]: If I do something simple like this on top of v3 patch, I do not get any fault anymore so far (have been running for 3h), will let it run the night and see
21:01marysaka[d]: ```diff
21:01marysaka[d]: diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
21:01marysaka[d]: index 0fc7da02ef69..48f87a6b81ec 100644
21:01marysaka[d]: --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
21:01marysaka[d]: +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
21:01marysaka[d]: @@ -285,6 +285,7 @@ nvkm_vmm_unref_ptes(struct nvkm_vmm_iter *it, bool pfn, u32 ptei, u32 ptes)
21:01marysaka[d]: if (desc->type == LPT && (pgt->refs[0] || pgt->refs[1])) {
21:01marysaka[d]: for (u32 lpti = ptei; ptes; lpti++) {
21:01marysaka[d]: + pgt->pte[lpti]--;
21:01marysaka[d]: pgt->pte[lpti] &= ~NVKM_VMM_PTE_BIG_VALID;
21:01marysaka[d]: ptes--;
21:01marysaka[d]: }
21:01marysaka[d]: @@ -390,6 +391,7 @@ nvkm_vmm_ref_ptes(struct nvkm_vmm_iter *it, bool pfn, u32 ptei, u32 ptes)
21:01marysaka[d]: for (u32 lpti = ptei; ptes; lpti++) {
21:01marysaka[d]: pgt->pte[lpti] &= ~NVKM_VMM_PTE_VALID;
21:01marysaka[d]: pgt->pte[lpti] |= NVKM_VMM_PTE_BIG_VALID;
21:01marysaka[d]: + pgt->pte[lpti]++;
21:01marysaka[d]: ptes--;
21:01marysaka[d]: }
21:01marysaka[d]: }
21:01marysaka[d]: ```
21:03airlied[d]: I need to caffeine myself a bit, but I think refcounting is right, just that seems like a bad way. I was thinking of two 16-bit refcounts
21:13marysaka[d]: yeah two counters would do the job too
21:22airlied[d]: I'm just not sure I want to shoe horn that into the current 32-bit, might make a struct
21:31marysaka[d]: yeah I am not too sure what form it should take especially with the flags around, should we have a valid and sparse flag for both counter? have one carry it?
21:33marysaka[d]: will let my dirty patch run for the night just to see if I get another kind of MMU fault, maybe we have more hidden
22:04airlied[d]: marysaka[d]: posted a v4 patch, totally untested here though, will play around a bit more later today
22:05marysaka[d]: will put that on the test bench quickly and let it run, thanks a lot