IRC Logs of #radeon on irc.freenode.net for 2023-07-04

03:02 dormito: Is there a way to preven programs from making amdgpu (driver &| hw) enter an infinte reset loop?
03:04 airlied: not really, the reset loop should just be made stable
03:04 dormito: what do you mean by "stable"
03:05 airlied: well there was a time, a gpu reset would take out the card or machine, it's getting better now
03:05 dormito: meh, my machine has been "taken out"
03:05 dormito: I can only access it via ssh
03:14 dormito: so is it the design of hw accel interface that makes it impossible for the driver to prevent userland from killing the gpu?
03:15 dormito: *hardware's hardware acceleration interface.
03:17 airlied: any turing machine can be stuck into a loop
03:17 airlied: and you can't detect that in advance
03:18 airlied: turing complete machine
03:18 airlied: you just have to have a good reset mechanisnm
03:18 dormito: indeed. though previledge seperation typically allows us to kill said loop (when it's userspace)
03:20 airlied: yeah the idea is to get the gpu resets to be a lot more streamllined, some hw is a lot better at it
03:20 airlied: it would be nice if it could get to cpu context killing level
03:21 dormito: I wonder if a PCIe bus reset would allow me to regain it
03:27 dormito: Hmmm. I guess maybe PCIe bus resets aren't defined/implemented for some/all on-chipset devices. and "unbind" is a dead end.
03:29 dormito: Hmmm: does the dmesgs screaming about "secure display" mean amdgpu driver is married to the PSP?
08:31 Venemo: dormito: speaking of gpu resets, unfortunately, this is a pretty common issue, and sadly needs a lot of work to solve
13:11 dormito: Venemo: Is there anything that could give more background? I have a lot of projects going on, so my bandwidth is limited, but I also hit this issue often enough that it might be worth contributing
13:11 dormito: (I have exp with kernel drivers and PCIe devices, but not so much with drm subsystem)
13:11 Venemo: dormito: depends on which part you are interested in. this affects the entire stack
13:12 Venemo: the first problem is that desktop environments and apps are usually not robust against a gpu reset, that is why your system dies when the gpu resets
13:12 Venemo: the second problem is that gpu reset itself can be buggy in the kernel, so the kernel sometimes crashes when it resets
13:13 dormito: Venemo: in my specific case, killing all userspand apps the are directly using the gpu did not allow fbcon to be me a usable getty (even whith a "chvt" via ssh), so it seems likely to me the kernel driver was stuck in a endless reset lop
13:14 Venemo: depends on your hardware
13:14 Venemo: generally speaking, APUs are somewhat more resilient to this issue because they don't have dedicated VRAM so the memory isn't lost. also, the kernel tends to handle GFX10.3 and newer better.
13:16 dormito: this a AMD Ryzen 7 5700U, and an intigrated radeo (laptop). Anyhow that was a very round about way of saying "kernel drivers a where most of my interests would be"
13:43 Venemo: well, if you can ssh into the machine you can take a look at dmesg and see what is happening
13:44 Venemo: 5700U has an old vega igpu, I wouldn't place any bets on that working right with gpu resets
13:53 dormito: I've alread hard powered off the. but my I have logs for the boot, and the endless driver reseting the gpu, with associated warning/stacktrace (endless, untill I powered off the sysem that is).