10:08 kode54: I appear to have a bit of a problem with 2a238b09bfd04e8155a7a323364bce1c38b28c0f
10:08 kode54: it's the only backport from 6.13 to touch the smu in 6.12.10
10:08 kode54: and I have some smu related problems since updating past 6.12.9
10:09 kode54: hwmon probing software that used to work perfectly fine is suddenly throwing the GPU into random fits where the SMU locks up
10:09 kode54: never mind
10:09 kode54: it's happening again
10:10 kode54: I guess I detected it wrong
10:11 kode54: could SMU lockups also be caused by using a riser cable?
10:22 Venemo: kode54: it depends, if it's a bad quality cable it could be responsible for all sorts of random issues
10:22 kode54: it's a cooler master riser, supposed to be good for pcie 4.0
10:23 Venemo: I'm not familiar with what issue you are having, but I'd recommend to try reseating the cable and setting the PCIe port to PCIe 3.0 mode just in case and see if that helps. if it doesn't help then it's a different issue
10:23 Venemo: kode54: do you have a SFF case?
10:24 kode54: no, a mid tower
10:24 kode54: oh wait
10:24 kode54: yes, I do have SFF
10:24 kode54: it's a QUBE 500
10:24 Venemo: nice one, I was considering that
10:25 Venemo: and you have the GPU vertically mounted?
10:25 kode54: yes
10:25 kode54: and the GPU is just sort of hanging in space by its mounting bracket and the support brace sort of holding below it
10:26 kode54: there's at least 1-2cm of gap below the riser socket and the bottom of the case
10:26 Venemo: well, it's worth trying to reseat it and fiddle with those settings, but then again it may not help.
10:26 Venemo: what GPU do you have and what is the issue?
10:26 kode54: Sapphire Pure 7700 XT
10:26 kode54: and the issue I have, is the hwmon software I've been using for months now is suddenly randomly failing
10:27 kode54: causing a flood of SMU messages in the kernel log
10:27 kode54: and the GPU locking up
10:27 Venemo: did the problems start happening when you started using the riser cable?
10:28 kode54: no, I've been using it for months now
10:28 Venemo: then it's unlikely to be connected to the issue
10:28 kode54: the problem started around the 18th
10:28 Venemo: out of curiosity, what hwmon software do you use and what do you do with it?
10:28 kode54: coolercontrold is polling fan speeds and temperatures and graphing them
10:29 kode54: and Beszel is polling GPU temperature, usage, and RAM usage
10:29 Venemo: aha
10:29 kode54: I'm also using the Games on Whales container app, Wolf
10:29 kode54: that seems to idle on the gpu the whole time it's running, even when no clients are connected to it
10:30 Venemo: and why do you think those hwmon apps are related to the SMU errors?
10:30 kode54: because the SMU errors happen whenever the hwmon files are touched by either monitoring app
10:31 Venemo: does it happen also if you touch the files manually?
10:31 kode54: Jan 28 02:02:01 copycat kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000>
10:31 kode54: Jan 28 02:02:01 copycat kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
10:31 kode54: which file should I touch?
10:31 Venemo: the same ones that the app does?
10:31 kode54: one of the monitors is using the script, rocm-smi
10:31 kode54: running the `sensors` command locks up when it hits the GPU
10:31 Venemo: or you could run 'sensors' which should also get info from the same place
10:32 Venemo: huh
10:32 Venemo: that is really weird to be honest
10:32 Venemo: I have never tried those apps, but 'sensors' always worked fine here
10:32 kode54: I can't even break out of sensors
10:34 Venemo: did it never work, or did it stop working at the same time when you started getting those SMU errors?
10:34 kode54: same time
10:34 kode54: stopped working for the GPU
10:34 kode54: the GPU is essentially locked up now
10:34 Venemo: was there a kernel or firmware update?
10:34 kode54: linux-firmware updated january 9th
10:35 kode54: linux kernel updated to 6.12.10 and then 6.13 around the same time
10:35 Venemo: I suspect one of those updates is responsible
10:35 kode54: I started running this container docker thing on the 18th too
10:35 kode54: I've tried reverting the kernel, and the firmware
10:35 kode54: I could try reverting literally everything
10:36 kode54: I also tried reverting my BIOS, which I just updated on the 18th
10:36 kode54: I updated a whole bunch of things around the time it started happening
10:36 Venemo: :(
11:20 kode54: back
12:18 kode54: just downgraded my GPUs
12:18 kode54: removed the 7700 XT as it appears to be failing already
12:18 kode54: now I have a 6700 XT, and the other machine has an RX 480
13:18 Venemo: kode54: you think it's a hw defect? :(
13:18 Venemo: I'm sorry to hear that
13:18 kode54: probably
13:18 kode54: I also managed to snap off the aRGB header
13:18 Venemo: I assume it's still in warranty so you can probably get a replacement
13:18 Venemo: oh :(
13:19 kode54: I really really really hate the aRGB plug design
13:19 kode54: why do they have to use tight tube sockets for the pins
13:19 Venemo: yeah it's... yeah
13:19 Venemo: did you snap it off the motherboard or off the graphics card?
13:20 kode54: off the graphics card
13:20 kode54: the one on the graphics card was angle mounted to the board
13:20 kode54: it would bend every time I tried to seat the plug
13:20 Venemo: oof
13:20 Venemo: I'm so sorry
13:21 kode54: nah, it's okay, I have lots of other working hardware
13:21 kode54: I'll stop buying stupid shit
13:21 kode54: ah, no pq to get annoyed at me being brash
13:21 kode54: he's right though, I need to be gentler with language in professional spaces
13:22 Venemo: don't worry about it
13:22 kode54: annoying that this stupid card had to do this
13:22 kode54: and preceded by SMU failures cropping up randomly
13:22 kode54: I can try filing for warranty replacement
13:23 Venemo: worth a try
13:23 kode54: it's a Sapphire card
13:23 kode54: the only problem is, I'll have to pay two Uber trips to take it to the appropriate postage depot
13:23 kode54: and back again