13:50 Ristovski: agd5f: How come current CG mask reports that AMD_CG_SUPPORT_JPEG_MGCG (among others) are disabled, when it's enabled in soc15.c for Renoir APUs (GC 9.3.0)? Is there something I am missing?
13:50 Ristovski: These are all the discrepancies: AMD_CG_SUPPORT_MC_MGCG, AMD_CG_SUPPORT_MC_LS, AMD_CG_SUPPORT_SDMA_MGCG, AMD_CG_SUPPORT_VCN_MGCG, AMD_CG_SUPPORT_JPEG_MGCG, AMD_CG_SUPPORT_IH_CG
13:51 Ristovski: xref: https://elixir.bootlin.com/linux/v6.7.9/source/drivers/gpu/drm/amd/amdgpu/soc15.c#L1114
14:17 agd5f: Ristovski, looks like VCN doesn't have a get_clockgating_state() callback to query the current state
14:21 Ristovski: Oh, never occurred to me to check if the get_* funcs are used
15:33 Ristovski: agd5f: I also see power gating is "/* todo */" on soc15. Do you have an estimate in the difference in power usage with PG+CG enabled for VCN/SDMA/MC?
15:34 agd5f: Ristovski, all of the clock and powergating features are implemented
15:35 agd5f: there aren't and pg features specific to soc15.c
15:35 Ristovski: Oh, I assumed there are since soc15_common_set_powergating_state is "todo" and doesn't do anything
15:36 agd5f: yeah, that can be removed
15:37 Ristovski: Hm, then I assume just like VCN that MC/SDMA/JPEG/IH are also missing the get_getclockgating_state() funcs? Or what could be causing them to return as disabled
15:38 Ristovski: hm, perhaps I could utilize umr to read the state regs to confirm
16:18 Ristovski: bah, umr is missing a bunch of regs for gfx90c/green_sardine(?)
16:38 Ristovski: agd5f: re mesa#10794 of mine that you moved out from drm/amd, is there any way I can debug this to make sure its mesa to be blamed? fwiw I've hit the same page fault several months ago when using ROCm compute
16:39 Ristovski: ping mareko ^
16:41 agd5f: Ristovski, the GPU page fault is coming from the userspace app. In most cases that is a bug in userspace. E.g., bad alignment somewhere
16:46 Ristovski: agd5f: What are IH and UTCL2? Tried my best but couldn't find them in any terminology listing
16:47 Ristovski: IH == interrupt handler I would assume?
16:49 agd5f: IH is interrupt handler. UTCL2 is the gfx L2 cache
16:49 agd5f: basically a page fault in the shader
16:54 mareko: how do you get a page fault from a memcpy shader with bounds checking though
16:57 mareko: the shader can't access outside buffer bounds because the hw won't let it, so the page fault is from a valid address
17:08 Ristovski: mareko: Any debugging steps/ideas of how I can narrow this down further?
17:47 mareko: Ristovski: how random is it?
17:47 Ristovski: mareko: I can trigger it every time
17:47 Ristovski: Always on the same CS x2 step under 4096K
17:47 Ristovski: VRAM->VRAM, that is
17:47 mareko: Ristovski: can you test Mesa before 6584088cd5e ?
17:50 Ristovski: Will do. I see it uses ACCESS_COHERENT, fwiw I was able to hit a nearly identical page fault while messing with AMD_pinned_memory in OpenGL (can't recall what I did wrong to trigger it, I remember issuing a memory barrier followed by a glFinish in order to access the results of a mapped SSBO after running a compute shader)
18:03 agd5f: TCP is reporting a read to an unmapped page at 0x0000800101800000.
18:14 Ristovski: mareko: The bogus speeds are gone fwiw, lets see if it crashes
18:21 Ristovski: mareko: No crash so far at all
18:22 Ristovski: So indeed seems like NIR rewrite introduced this bug along with the bogus speeds reported
18:25 Ristovski: mareko: Output: https://paste.debian.net/plain/1310325
18:32 mareko: Ristovski: can you test that it hangs with exactly commit 6584088cd5e ?
18:33 Ristovski: will do
18:42 Ristovski: mareko: Yep, that commit is the culprit - the bogus speeds are back as well. This time the last logged value is for 2048kB https://paste.debian.net/plain/1310329
18:43 Ristovski: dmesg: https://paste.debian.net/plain/1310330
18:48 Ristovski: agd5f: what is TCP in this context btw?
18:50 agd5f: Ristovski, shader's path to memory. It used to be called Texture Cache long ago, but is now more like shader cache interface
18:50 Ristovski: I see
19:19 mareko: it's the vector L0 cache
19:47 mareko: Ristovski: does this commit fix the hang? https://gitlab.freedesktop.org/mareko/mesa/-/commit/9ff8bc3c13123c52d156bec0e33a57310e261974
19:51 Ristovski: will test in a bit
19:52 mareko: it should also fix the numbers
19:55 Quibus: Any more people having problems with the machine not getting out of (S3) sleep? I.e., the video signal doesn't return
19:55 Quibus: See https://gitlab.freedesktop.org/drm/amd/-/issues/3048
19:56 mareko: it was indeed a sneaky Mesa bug that allowed any GL app to bypass SSBO bounds checking (i.e. ignore the size in the descriptor)
19:56 Quibus: But one would expect this not to be some special case (there's nothing special to my setup)
20:00 Ristovski: mareko: Yep, both fixed!
20:02 Ristovski: Also explains how I managed to trigger it myself with my OpenGL compute prog :P
20:26 mareko: Ristovski: the out-of-bounds bug will be fixed by a different commit
20:26 mareko: the test just doesn't access out of bounds anymore
20:26 mareko: thanks for the report
20:27 Ristovski: No problem, thanks for looking into this