13:50Ristovski: agd5f: How come current CG mask reports that AMD_CG_SUPPORT_JPEG_MGCG (among others) are disabled, when it's enabled in soc15.c for Renoir APUs (GC 9.3.0)? Is there something I am missing?
13:50Ristovski: These are all the discrepancies: AMD_CG_SUPPORT_MC_MGCG, AMD_CG_SUPPORT_MC_LS, AMD_CG_SUPPORT_SDMA_MGCG, AMD_CG_SUPPORT_VCN_MGCG, AMD_CG_SUPPORT_JPEG_MGCG, AMD_CG_SUPPORT_IH_CG
13:51Ristovski: xref: https://elixir.bootlin.com/linux/v6.7.9/source/drivers/gpu/drm/amd/amdgpu/soc15.c#L1114
14:17agd5f: Ristovski, looks like VCN doesn't have a get_clockgating_state() callback to query the current state
14:21Ristovski: Oh, never occurred to me to check if the get_* funcs are used
15:33Ristovski: agd5f: I also see power gating is "/* todo */" on soc15. Do you have an estimate in the difference in power usage with PG+CG enabled for VCN/SDMA/MC?
15:34agd5f: Ristovski, all of the clock and powergating features are implemented
15:35agd5f: there aren't and pg features specific to soc15.c
15:35Ristovski: Oh, I assumed there are since soc15_common_set_powergating_state is "todo" and doesn't do anything
15:36agd5f: yeah, that can be removed
15:37Ristovski: Hm, then I assume just like VCN that MC/SDMA/JPEG/IH are also missing the get_getclockgating_state() funcs? Or what could be causing them to return as disabled
15:38Ristovski: hm, perhaps I could utilize umr to read the state regs to confirm
16:18Ristovski: bah, umr is missing a bunch of regs for gfx90c/green_sardine(?)
16:38Ristovski: agd5f: re mesa#10794 of mine that you moved out from drm/amd, is there any way I can debug this to make sure its mesa to be blamed? fwiw I've hit the same page fault several months ago when using ROCm compute
16:39Ristovski: ping mareko ^
16:41agd5f: Ristovski, the GPU page fault is coming from the userspace app. In most cases that is a bug in userspace. E.g., bad alignment somewhere
16:46Ristovski: agd5f: What are IH and UTCL2? Tried my best but couldn't find them in any terminology listing
16:47Ristovski: IH == interrupt handler I would assume?
16:49agd5f: IH is interrupt handler. UTCL2 is the gfx L2 cache
16:49agd5f: basically a page fault in the shader
16:54mareko: how do you get a page fault from a memcpy shader with bounds checking though
16:57mareko: the shader can't access outside buffer bounds because the hw won't let it, so the page fault is from a valid address
17:08Ristovski: mareko: Any debugging steps/ideas of how I can narrow this down further?
17:47mareko: Ristovski: how random is it?
17:47Ristovski: mareko: I can trigger it every time
17:47Ristovski: Always on the same CS x2 step under 4096K
17:47Ristovski: VRAM->VRAM, that is
17:47mareko: Ristovski: can you test Mesa before 6584088cd5e ?
17:50Ristovski: Will do. I see it uses ACCESS_COHERENT, fwiw I was able to hit a nearly identical page fault while messing with AMD_pinned_memory in OpenGL (can't recall what I did wrong to trigger it, I remember issuing a memory barrier followed by a glFinish in order to access the results of a mapped SSBO after running a compute shader)
18:03agd5f: TCP is reporting a read to an unmapped page at 0x0000800101800000.
18:14Ristovski: mareko: The bogus speeds are gone fwiw, lets see if it crashes
18:21Ristovski: mareko: No crash so far at all
18:22Ristovski: So indeed seems like NIR rewrite introduced this bug along with the bogus speeds reported
18:25Ristovski: mareko: Output: https://paste.debian.net/plain/1310325
18:32mareko: Ristovski: can you test that it hangs with exactly commit 6584088cd5e ?
18:33Ristovski: will do
18:42Ristovski: mareko: Yep, that commit is the culprit - the bogus speeds are back as well. This time the last logged value is for 2048kB https://paste.debian.net/plain/1310329
18:43Ristovski: dmesg: https://paste.debian.net/plain/1310330
18:48Ristovski: agd5f: what is TCP in this context btw?
18:50agd5f: Ristovski, shader's path to memory. It used to be called Texture Cache long ago, but is now more like shader cache interface
18:50Ristovski: I see
19:19mareko: it's the vector L0 cache
19:47mareko: Ristovski: does this commit fix the hang? https://gitlab.freedesktop.org/mareko/mesa/-/commit/9ff8bc3c13123c52d156bec0e33a57310e261974
19:51Ristovski: will test in a bit
19:52mareko: it should also fix the numbers
19:55Quibus: Any more people having problems with the machine not getting out of (S3) sleep? I.e., the video signal doesn't return
19:55Quibus: See https://gitlab.freedesktop.org/drm/amd/-/issues/3048
19:56mareko: it was indeed a sneaky Mesa bug that allowed any GL app to bypass SSBO bounds checking (i.e. ignore the size in the descriptor)
19:56Quibus: But one would expect this not to be some special case (there's nothing special to my setup)
20:00Ristovski: mareko: Yep, both fixed!
20:02Ristovski: Also explains how I managed to trigger it myself with my OpenGL compute prog :P
20:26mareko: Ristovski: the out-of-bounds bug will be fixed by a different commit
20:26mareko: the test just doesn't access out of bounds anymore
20:26mareko: thanks for the report
20:27Ristovski: No problem, thanks for looking into this