00:48 kode54: agd5f: oh, I see you updated the branch today
07:12 vedranm: agd5f: curious, do you have any idea why Aqua Vanjaram (and older MI cards) have VCN hardware? do users of accelerators actually require them to have video encoding/decoding functionality for some applications?
07:31 MrCooper: kode54: Xwayland can't recover from a GPU reset yet, so the compositor would need to restart it for now
07:41 pixelcluster: MrCooper: the idea with per-queue resets is that Xwayland doesn't need to recover
07:42 MrCooper: if only it was that simple
07:42 pixelcluster: well, it is as long as the patchset works :))
07:42 MrCooper: what patchset?
07:43 pixelcluster: https://lists.freedesktop.org/archives/amd-gfx/2024-July/111215.html
07:45 pixelcluster: when an app triggers a gpu hang with those patches, only the app that was executing the hanging work should get killed
07:45 pixelcluster: which means your desktop etc. stay alive
07:45 MrCooper: "If this fails, we fall back to a full adapter reset", and it might still affect other processes using the same queue (which in the case of GFX is basically all GUI apps using HW acceleration)
07:46 pixelcluster: sure a full adapter reset would kill everything
07:46 MrCooper: Xwayland will need to learn how to recover from a reset, it just can't yet
07:46 pixelcluster: but it's a fallback and not the usually intended path
07:47 pixelcluster: pretty much all GPU resets that can be caused by apps should be recoverable with the queue reset
07:47 MrCooper: I honestly hope you're right
07:48 pixelcluster: I mean I tested it, it works like this for me
07:48 MrCooper: cool
07:48 pixelcluster: that's why I'm so excited about the whole thing, it's a complete game changer for development
07:59 vedranm: pixelcluster: pardon my lack of knowledge, GPU reset is handled internally by the kernel module and does not involve module unload/load?
08:00 pixelcluster: vedranm: correct
08:02 vedranm: pixelcluster: thx
13:58 agd5f: pixelcluster, all work in the queue is lost, so it could be multiple apps. You could limit one app per queue, but that would impact performance
13:59 agd5f: we need to finish user mode queues for gfx to really limit things to one app
14:00 agd5f: vedranm, for feeding model training. Video and image decode to feed training models
14:32 pixelcluster: agd5f: right. still, very exciting progress
17:02 kode54: MrCooper: the problem is, kwin is stuck waiting on something on the already crashed xwayland
17:03 MrCooper: what something?
17:04 MrCooper: BTW, Xwayland crashing is likely a Mesa issue
17:05 kode54: https://bugs.kde.org/show_bug.cgi?id=459872
17:05 MrCooper: it deliberately kills processes using GL after a GPU reset if they don't use the robustness APIs, without any justification in the specs that I can see
17:05 kode54: Gpu resets, xwayland crashes, session hangs
17:06 MrCooper: mareko: ^ Mesa killing non-robust GL processes after a GPU reset is causing trouble
17:07 kode54: labwc session survives
17:07 kode54: xwayland does not
17:07 kode54: plasma just hangs
17:08 pixelcluster: MrCooper: it's impossible to keep non-robust GL processes running after gpu resets
17:10 pixelcluster: if the kernel rejects submissions, that means either vram as a whole was lost and the contents of any buffers might be completely random, or the application was killed in the middle of execution which, too, leads to inconsistent memory contents
17:11 pixelcluster: without robustness extensions it's not possible to notify apps that their buffers now have inconsistent contents and therefore the only option left is to not let the application execute any further commands
17:23 mareko: program termination due to out of bounds access is allowed by GL without robustness, as indicated by the GL_ARB_robustness specification (i.e. even VM faults are allowed to terminate the app)
17:32 mareko: apps can turn off undefined handling (e.g. app killing) of the "device lost" state by enabling robustness
17:34 mareko: and they should expect undefined behavior from non-robust processes, including program termination
18:36 vedranm: agd5f: oh, I see, that makes a lot of sense
18:50 vedranm: agd5f: on the same topic, is the current state of the kernel driver for Aqua Vanjaram on par with older generations, specifically if I ran radeontop or nvtop, can I expect sensible GPU usage monitoring capabilites to be exposed? does that require a specifically new kernel version?
18:53 agd5f: vedranm, on MI300? you'd probably need to make changes to nvtop or radeontop to query the right registers because there are a lot of IP instances to query
19:03 vedranm: agd5f: yes, my experience thus far has been everything shown at 100% usage on radeontop and wrongly reported memory capacity/usage and crash on start with nvtop
19:03 vedranm: that's with amdgpu from kernel 6.7
19:04 agd5f: vedranm, The registers may not even be at the same offsets
19:05 vedranm: agd5f: OK, but 6.7 should be new enough kernel to work with modifying radeontop?
19:06 vedranm: I can easily change the registers queried if the numbers reported by the kernel are sensible
19:11 agd5f: vedranm, I don't think the read_register query in the INFO IOCTL does the right thing or not for MI300. That would need to be fixed
19:21 vedranm: agd5f: OK, I will look into how radeontop queries it
19:24 agd5f: vedranm, for PM4 queues, those only operate on one XCC, so on a full MI300, you'd have 8 of them. Basically 8 independent compute instances. The driver creates a separate drm device node for each XCC
19:24 agd5f: AQL queues will work across XCCs
19:27 agd5f: so it will depend on what you are trying to measure with nvtop
19:27 agd5f: might be better off with amd-smi
19:41 vedranm: agd5f: when you say full MI300, do you mean X? in that case, would A be a partial MI300?
19:42 vedranm: well, I just want real-time utilization statistics for different IP blocks
19:43 agd5f: vedranm, no MI300A or X are full GPUs, but they can be partitioned. So far example, you might get a VM with only a slice of a MI300. E.g., 1/8 or 1/4, etc.
19:43 vedranm: agd5f: oh, OK, yeah, let's assume for the sake of simplicity that is not necessary to support at the moment
19:45 vedranm: anyhow, thanks for the pointers, I will look into it and see where I get stuck
19:45 agd5f: ok, so assuming full GPU, bare metal, you'd need to query all of the instances of GFX, VCN, JPEG, etc. to get the full picture
22:24 kode54: agd5f: I experienced memory leaks with your kernel branch (queue-reset) as of e6df56e7db9e