IRC Logs of #radeon on irc.freenode.net for 2025-02-03

04:45 Venemo: iive: drm (direct rendering manager) is a framework for creating graphics drivers in the kernel
07:29 kode54: heck yea
07:29 kode54: my GPU survived prodding the recovery file
07:30 kode54: and my DE, labwc, recovered instantly
07:30 kode54: only losses were xwayland restarting and taking all my X apps with it, Discord restarting its GPU process, and Ghostty dying
07:31 kode54: firefox didn't put up any noticeable effects to restarting the GPU
07:31 kode54: oh, right
07:31 kode54: swaybg died too
07:31 kode54: https://github.com/SeryogaBrigada/linux/commits/v6.13-amdgpu-testing all it took was this patch series on top of linux-zen
07:32 HdkR: Meanwhile, I just had a fault, the kernel claimed that the GPU soft-recovered, but it actually didn't :D
07:32 kode54: hope this work sees more testing
07:32 kode54: HdkR: oops!
07:32 HdkR: The recovery is consistently inconsistent
07:32 kode54: yeah, I got used to the one RT related hang in Stray causing only soft recovery
07:33 kode54: it used to cause full resets
07:33 kode54: whereas poking the recovery file in sysfs debug system causes a full reset
07:33 kode54: amdgpu_gpu_recovery, don't read it unless you want a fun time
07:34 HdkR: It feels like video decoding in the background raises the frequency of faulting, but hard to tell
07:34 kode54: I have video decoding randomly
07:34 kode54: from chat clients
07:34 kode54: and I also have Beszel poking my hwmon files every few seconds
07:35 kode54: without those patches, that used to hang after a few hours
07:35 HdkR: Quirky
07:36 kode54: started around when I had things running either Mesa 24.2.8 or 24.3.4 on this system
07:37 kode54: also started when I installed a 6.12.10 kernel, but then I downgraded and it still did it
07:37 kode54: so I ruled out kernel version
07:37 kode54: or it's a long standing bug that just became a problem
07:37 kode54: this developer is experimenting with two patch series related to GFXOFF states
07:38 kode54: the "stable" series removes the 100us delay on switching the state, and puts some guard on one state start/end function set
07:38 kode54: the "testing" set I'm currently using removes a bunch of GFXOFF workarounds, then distills them to a single workaround that puts GFXOFF disable around job queue processing
07:39 kode54: seems to make sense that it shouldn't be halting processing in the middle of a job, though
07:39 kode54: unless that points to a bug in state pausing
08:35 Venemo: kode54: what series is that?
08:36 Venemo: there is 0 explanation in the commit messages
08:36 kode54: it's the author's own series, I guess
08:36 kode54: I literally have no idea why they're not involving the dri-devel or amdgfx mailing lists
08:37 kode54: they've been exclusively conversing with a handful of people on the Arch forums
08:37 kode54: well, and they popped into my issues on the FDO gitlab
08:37 Venemo: someone should tell them to work with upstream
08:37 kode54: would be nice
08:37 Venemo: removing random workarounds from the code... seems shady
08:38 kode54: indeed
08:38 kode54: though it's only for testing
08:38 kode54: the testing is trying to determine if the workarounds are being incorrectly applied because GFXOFF gating was missing from somewhere else
08:38 kode54: near as I can figure
08:39 kode54: I'd love to help them get upstream involved in testing this
08:39 kode54: they also patched it backwards
08:39 kode54: they should apply the gating before removing the workarounds
08:39 kode54: otherwise a bisect could hit a broken kernel
08:40 kode54: they have a FDO account, but private profile
08:41 kode54: really, those workarounds should only be removed if it's determined that the new additional gating fixes the exact cases those workarounds were added to solve
08:45 kode54: until upstreaming happens, I have something that works around my machine hanging regularly since a given install date
08:45 kode54: and Arch continues to be a playground for testing bleeding edge things to find bugs before they filter down to LTS distributions
09:00 kode54: cool
09:00 kode54: I experienced more fun debugging this
09:00 kode54: a forced gpu reset caused labwc to hang with flip_done timeouts
09:01 kode54: got back to GDM by kill -9'ing labwc and resetting the GPU again
09:35 Venemo: kode54: yeah you expressed my concerns quite well
09:36 Venemo: kode54: what are you expecting from that series?
09:39 kode54: well, applying it fixes my problems, and I knew my problem was gfxoff related
09:39 kode54: so I expect gfxoff related code needs more testing
09:40 kode54: now I'll be going to bed for the night
09:42 Venemo: what is the problem that it fixes?
11:29 fililip: kode54: displayport hotplugs and resets are currently borked like that, i remember reporting it a few months ago, maybe you hit an issue with that
22:15 kode54: dang, there's yet another workaround in test
22:44 kode54: I'm giving up on these weird "test" commits
22:44 kode54: I'll stick to stock kernels
22:44 kode54: leave the testing to the people who have test hardware
22:46 kode54: if necessary, I'll ship my video card to someone to test it
22:49 kode54: or I can just give up on running any recent generation and switch back to RDNA 2