IRC Logs of #radeon on irc.freenode.net for 2025-02-23

01:13 DemiMarie: mupuf: Does AMD allow any shader to access the global counters?
01:37 mareko: DemiMarie: GPU recovery is unrelated to user queues and is a completely different problem
01:40 mareko: it's also impossible to not affect other processes when the only way to recover it is to do a full reset while 1 bad process and 2 good processes are using the GPU at the same time
01:51 soreau: mareko: I thought this was about innocent resets
01:52 soreau: not sure exactly what innocent means in this context (no pun intended) but if an app is the offender, it won't bring down the compositor if you can kill the context without full gpu reset IIUC
01:54 mareko: soreau: the problem is a full GPU reset is usually needed
02:04 DemiMarie: soreau: that is what gives AMD its bad reputation for reliability
02:11 mareko: any compositor design that assumes that a full GPU power cycling will never be needed is not a good design
02:19 soreau: mareko: sure, but all bets are off if clients using the gpu don't play along
02:19 soreau: I mean what's the difference between a reboot, compositor crash, gpu reset (where composiotr does not handle it) and logging out?
02:20 soreau: ideally, gpu reset would be a quickish flicker and back to business as usual
02:20 soreau: but as DemiMarie said, gtk (at least) treats textures as immutable
02:22 soreau: sure you could have some session-save functionality, but it will only restart clients and not save any client data that was only in memory
02:22 DemiMarie: mareko: not every GPU user is able to handle the GPU going away
02:23 DemiMarie: that requires treating GPU-side state as a cache a cache that can always be regenerated if necessary, and not everyone does that.
02:23 soreau: yea but it's possible, just not all (if any) clients do it
02:24 DemiMarie: the other question is what makes a GPU less reliable than a CPU
02:24 soreau: I think compositor surviving while clients die is just as bad as any of the other aforementioned fates
02:25 DemiMarie: the answer, of course, is lack of instruction-level preemption for graphics (as opposed to just compute), but this is apparently very difficult to fix at a hardware level
02:25 DemiMarie: and impossible for existing hardware
02:25 soreau: but at the core, gpu resets should be taken more seriously, not 'welp let us just reset now and dust our hands clean of this failure'
02:25 DemiMarie: yup
02:25 soreau: would be nice to assume compositors and clients both will pick up the slack, but it's just not reality
02:26 mareko: soreau: switching all non-GL non-Vulkan processes to swrast after a reset and continuing is the safest recovery; toolkits that use GL could switch to swrast manually if the compositor told them so
02:26 DemiMarie: mareko: GTK considers its non-GL, non-Vulkan paths to be deprecated
02:26 soreau: mareko: I tried the implement this in wayfire and failed miserably (though admittedly that's probably due to my lack of fundamental understandings)
02:26 mareko: read the part after ;
02:27 DemiMarie: there are no non-deprecated swrast paths in GTK
02:27 DemiMarie: it's “use a GPU or use llvmpipe/lavapipe, and the latter is sloooooow”
02:27 mareko: GTK using GL is still a non-GL app in essence becaue it controls how it uses GL
02:28 DemiMarie: they have no interest in supporting SW rendering except as a slow fallback
02:28 soreau: but if you can switch between hw -> sw, would the other bit be possible? 'This might cause your world to end. Try anyway?'
02:28 soreau: sw -> hw
02:28 mareko: a lost context is lost, so you might as well recreate it as swrast
02:29 soreau: I mean what do you do, put the system on life-support because the gpu is up in arms?
02:29 DemiMarie: mareko: they consider context lost to be a HW or driver bug
02:29 mareko: that's not what it is
02:29 DemiMarie: or a bug in the app that lost the context
02:29 DemiMarie: mareko: why?
02:29 DemiMarie: fundamentally, the GPU is just another processing unit in the system
02:29 DemiMarie: nobody expects the OS to recover from a CPU being yanked out
02:30 DemiMarie: so why is everyone expected to recover from the GPU going away?
02:30 mareko: irrelevant
02:30 DemiMarie: I think this is what you need to persuade app & toolkit developers of
02:30 DemiMarie: that a GPU can never be made reliable
02:31 mareko: malware can hang the GPU
02:31 soreau: If it were the case that compositors and clients handled resets gracefully, there's be an instant lack of motivation to *not* reset the gpu from driver dev community. Then you have black fickers at times, and then stuff runs slowly?
02:31 DemiMarie: but why is a hang not recoverable without losing other process's state?
02:31 DemiMarie: why can't you kill the context at fault and keep going?
02:32 mareko: because it's not implemented in hw
02:32 soreau: It would be like windows xp all over again
02:32 soreau: explorer.exe has crashed
02:32 DemiMarie: mareko: why is it not implemented in hw?
02:33 mareko: DemiMarie: some recovery cases are, some aren't
02:33 DemiMarie: mareko: which ones are not?
02:33 DemiMarie: or if the GPU does need to be reset, why does that kill anything not currently running on it?
02:34 DemiMarie: that seems like a HW bug to me
02:34 DemiMarie: possibly one that cannot be fixed on current HW, but still a HW bug
02:35 mareko: that's not what a bug is
02:35 DemiMarie: but why?
02:36 DemiMarie: I can state that Intel GPUs guarantee that a single bad context cannot bring down others unless there is a hardware, firmware, or kernel driver bug
02:36 mareko: since hw is exposed to userspace, anybody can hang it and affect all apps
02:36 DemiMarie: non sequiter
02:37 mareko: like I said, not all recovery cases are implemented in hw
02:37 DemiMarie: why is that?
02:37 mareko: why is water wet?
02:37 DemiMarie: it is
02:38 DemiMarie: Also, for APUs a GPU reset should only affect stuff currently running on the GPU at the time of the reset, because RAM is not affected.
02:39 mareko: that's what happens
02:39 DemiMarie: Is the problem that with dGPUs a reset wipes VRAM?
02:39 DemiMarie: That seems like it should be fixable in future HW versions
02:41 soreau: I think this is the reason why Xgl wasn't widely adopted in its time, because gpu docs were not widely available and writing drivers was a lot of guesswork, and folks were like 'are you sure we should rely on gpu drivers to run the desktop display server?'
02:42 DemiMarie: Some failures (like "HW corrupted its own internal state") will always be unrecoverable, but it should not be possible for unprivileged software to trigger those failures.
02:43 soreau: now fast forward to the future, many compositors and clients expect the gpu drivers to work and if they were working at the start of the process, they don't really expect that contract to end
02:44 soreau: but piping context failure states back to userspace and hoping userspace will do The Right Thing?
02:44 soreau: when has that assumption ever worked
02:45 soreau: if anything, the drivers should be the one ready to handle context lost and fix the brokenness, before userspace even notices
02:45 DemiMarie: bingo
02:46 DemiMarie: unless the app did something undefined, but an app is always allowed to crash itself
02:46 mareko: lost context/device was added to APIs for a reason, if you don't handle it, it's your fault
02:47 soreau: that's easy argument to make, when you're on the driver side perspective
02:47 soreau: but from consumer/user POV, it doesn't really make sense
02:48 mareko: the only argument that works in your favor is that robustness was added into GL too late and was implemented too late
02:48 soreau: true, it would be nice to see it work for maybe simple-egl.c
02:48 soreau: and for any gl(es) compositor to get it right
02:48 soreau: but so far, we got nothin'
02:49 mareko: had it been in GL 1.0 (30 years ago), the SW would have evolved to handle it from the beginning
02:49 soreau: supposedly sway does it, but I haven't seen it in action
02:50 soreau: mareko: but there are so many details in 'wild' software, and not many developers are going to reset their gpu's (on purpose) just to test the path
02:50 soreau: you'd have to have had the context lost and gpu reset trigger in place
02:50 mareko: if you don't handle lost context/devices today, it's still your fault, but that being added and implemented too late is a good excuse
02:50 soreau: and some CI to figure it all out for you
02:51 soreau: mareko: well, you know the real case but I can understand your position is what it is :P
02:52 soreau: as I said before, I'd hate to see driver devs get comforatble with resetting just because it's convenient
02:53 mareko: a GPU reset is really a solution to a halting problem that CPUs solved with preemption instead of resetting
02:54 DemiMarie: mareko: then GPUs need preemption too
02:54 mareko: definitely
02:54 DemiMarie: it might be destructive to the current context, but no other context should be affected
02:55 DemiMarie: If AMD GPUs cannot make this guarantee even with process isolation, I consider that to be a fairly severe hardware limitation
02:57 soreau: mareko: is this not something that cna be simulated or otherwise handled by the drivers?
02:58 DemiMarie: soreau: no
02:58 DemiMarie: specifically, the user-mode driver is not a security boundary, and the kernel-mode driver doesn’t validate commands past by userspace
02:59 mareko: the solution that exists today is lost contexts/devices in APIs
02:59 DemiMarie: mareko: and the point that soreau and I are making is that this is a bad solution
02:59 mareko: it's the only one though
02:59 soreau: according to google's AI: "GPU preemption can be simulated by using a kernel thread or IOCTL-based approach to execute GPU segments. These approaches can be used to prioritize real-time tasks on a GPU."
02:59 DemiMarie: only one on current hardware, perhaps
02:59 mareko: you can argue that OpenGL is also a bad solution for year 2025
02:59 DemiMarie: not on future hardware
03:00 DemiMarie: and not on some hardware that exists today
03:00 soreau: mareko: are you suggesting that gpu resets are somehow different with vulkan?
03:00 DemiMarie: If AMD GPUs can’t deal with certain faults without wiping VRAM, and unpriv userspace can trigger those faults, that’s a hardware problem
03:00 mareko: soreau: no
03:00 DemiMarie: and it needs to be fixed in future hardware
03:01 DemiMarie: because at least one of your competitors already makes a “no innocent context crash" guarantee
03:01 mareko: that's that compositor's problem
03:01 DemiMarie: and I would be highly surprised if Nvidia doesn’t make it as well
03:01 soreau: mareko: well I don't really understand what you mean by choosing opengl is 'a bad solution' right now
03:01 DemiMarie: mareko: I think you are expecting SW writers to work around what is ultimately a HW/FW problem
03:01 soreau: yes
03:01 DemiMarie: and needs to be solved on the HW/FW side
03:02 soreau: and SW writers shouldn't have to care what happens 'behind the scenes'
03:02 DemiMarie: and which other vendors do solve in HW/FW
03:02 DemiMarie: or the kernel driver
03:03 mareko: the API solution exists and it's the only one that will work for some number of years
03:04 DemiMarie: It currently does not work
03:04 soreau: well it 'works' for some values of 'works' :P
03:04 DemiMarie: It probably works if you are on KDE and using only KDE applications
03:05 DemiMarie: mareko: is future HW better?
03:05 mareko: DemiMarie: we don't talk about future HW publicly
03:05 DemiMarie: mareko: please tell your HW engineers that future HW needs to be able to guarantee no innocent context losses
03:06 DemiMarie: (assuming the HW is not completely broken due to e.g. overheating)
03:06 soreau: and have preemption :P
03:06 DemiMarie: soreau: current HW does have that
03:06 soreau: then why don't the driver take advantage of it?
03:07 DemiMarie: soreau: GPUs don’t have instruction-level preemption for graphics workloads, and I believe only some have it for compute workloads
03:07 DemiMarie: The fixed function hardware is not preemptable
03:07 mareko: if an app only uses compute, I think that can be killed without affecting other apps
03:07 DemiMarie: There is simply too much state that cannot be saved and restored
03:07 soreau: niche case
03:08 DemiMarie: What should happen, though is that the context is blown away and other contexts are not affected
03:08 DemiMarie: That is what happens on at least Intel
03:08 DemiMarie: If you hang while using the fixed-function hardware
03:09 mareko: compute is easily killable and preemptible, so the other solution is to use only compute queues in compositors and toolkits
03:09 DemiMarie: mareko: why isn’t graphics killable?
03:09 DemiMarie: I understand why it can’t be preempted
03:09 DemiMarie: but why can’t it be killed?
03:09 DemiMarie: mareko: only using compute queues is not a realistic option
03:09 DemiMarie: and expecting every program to handle GPU resets is not realistic either
03:10 DemiMarie: Innocent context GPU resets are a security vulnerability
03:10 mareko: those are the tools you have
03:10 DemiMarie: mareko: the tools you have on current hardware
03:10 DemiMarie: hence why I am saying that if you want AMD HW to no longer have a reputation for unreliability, you need future HW to do better
03:11 mareko: I think the future of drawing GUI and blitting is compute shaders
03:11 DemiMarie: why?
03:11 mareko: simpler
03:11 mareko: good preemption, good QoS
03:12 soreau: pixdecor uses gles compute quite heavily, with decent results
03:12 mareko: non-3D use cases should probably switch to compute
03:14 DemiMarie: 3D is very much a thing
03:14 soreau: I ported the weston smoke (shm) to gles compute and of course, night and day https://www.youtube.com/watch?v=MX7xPQBZYE4
03:14 DemiMarie: mareko: are you saying that the fixed function HW should only be used for games?
03:14 DemiMarie: also, the fixed function HW still needs to be resettable without tearing down all the compute queues
03:15 mareko: if you don't need to draw triangles projected on the screen (rasterization), why use gfx
03:15 DemiMarie: Does AMD have any protection against performance counter reads from unprivileged userspace?
03:15 mareko: DemiMarie: yes, they can be disabled
03:16 soreau: mareko: so ousting the vertex shader is optimal?
03:16 DemiMarie: soreau: the trend is towards Amplification/Mesh in Vulkan, but those still have fixed function hardware involved
03:16 DemiMarie: also, doing everything with compute is a huge perf hit on mobile, because mobile uses tiled rendering
03:16 DemiMarie: tiled rendering massively reduces memory bandwidth
03:17 DemiMarie: mareko: which fault cases are not handled without loss of VRAM right now?
03:17 DemiMarie: and which cannot be handled without loss of VRAM due to hardware (not firmware, that’s updateable) limitations?
03:18 mareko: the gfx pipeline is not needed for 2D, originally it wasn't used for 2D, it's used now, and in the future having compute as another option would be more robust
03:19 mareko: also if the gfx pipeline hangs, you can just keep it hung because compute queues don't need it
03:20 DemiMarie: mareko: why can AMD not recover while Intel can? Is that something you can't discuss?
03:20 DemiMarie: also "cannot play any other games until reboot" is not good experience either
03:22 mareko: DemiMarie: it's not implemented, it's as simple as that
03:22 DemiMarie: mareko: that is a severe limitation then
03:22 DemiMarie: and one AMD should fix
03:22 DemiMarie: in future HW
03:22 mareko: it can recover, but innocent contexts will likely not continue
03:23 DemiMarie: because VRAM is gone?
03:23 mareko: because it's sometimes not possible to kill a draw that is stuck in the gfx pipeline
03:23 DemiMarie: I'm only interested in the process isolated case, so there should not be other contexts in the pipeline.
03:24 DemiMarie: mareko: why are there any other contexts resident on the GPU at all?
03:24 DemiMarie: I thought that the only GPU-resident state that is not per-context is VRAM.
03:25 DemiMarie: mareko: is this specific to AMD HW or do other vendor’s dGPUs behave the same?
03:26 mareko: any app that has submitted jobs to hw queues that hasn't started executing is potentially considered resident
03:27 DemiMarie: I thought that with process isolation there will only ever be one such app
03:27 mareko: and a full reset can kill those too because the kernel driver doesn't know whether any of them have started executing
03:27 mareko: process isolation kills perf so much that nobody will ever use it outside of niche use cases
03:28 DemiMarie: How much?
03:28 mareko: that's not true
03:28 DemiMarie: I believe Google will be turning process isolation on in Chromebooks
03:29 DemiMarie: that?
03:31 mareko: if you only allow 1 app on the GPU, windowed benchmarks will be much slower and my above statement is true in that case
03:32 mareko: if you only allow 1 app per queue, then you still have N queues that can execute in parallel
03:33 DemiMarie: I’m thinking of the 1 app at a time case
03:33 DemiMarie: In my use-case (Qubes OS) security >> performance
03:34 mareko: that should be fine, but windowed app perf will suck
03:34 DemiMarie: right now everything is using software rendering, so any GPU acceleration will beat that
03:34 DemiMarie: how much suck?
03:34 mareko: it's FPS-dependent
03:34 DemiMarie: when does one notice it?
03:34 DemiMarie: high or low FPS?
03:34 mareko: let's say 50% slower
03:49 soreau: all that I know is mareko made r300g run (without graphical corruption) Amnesia on RV350 when the minimum requirements per the game devs was RV530, with some mesa patches when r300g was in its inception
03:49 soreau: something about those limited ALU's
03:51 soreau: it's encouraging to hear that compute might be the way forward
03:57 soreau: mareko: so for the purposes of a (gles) wayland compositor, would it be feasible to use (imported client) GL textures as inputs for compute shaders? (porting a gles vert/frag shader compositor to compute shaders only)
03:57 soreau: or would you be the person to ask that question
04:35 mupuf: DemiMarie: FYI, NVIDIA has a 32 cycles Worst case latency for context preemption... but they don't make use of it because the context is HUUUGE (was ~100MB around the Kepler time IIRC)
04:36 mupuf: !00
04:46 mupuf: mareko: VRAM is not magically lost upon GPU reset though, but you are absolutely right: The hardware doesn't make it easy, and features to increase reliability around hangs have just not implemented or prioritized unlike on Intel GPUs or NVIDIA. Most of the problems stem from the differences between Linux and Windows's WDDM. AMD's solution works for Windows, not easily for Linux
11:27 MrCooper: mupuf DemiMarie: amdgpu does intend to always zero VRAM before handing it out to user space now, there's a bug though: https://gitlab.freedesktop.org/drm/amd/-/issues/3812
11:29 MrCooper: looks like there might be a fix for that
16:47 DottorLeo: hi! @agd5f_ i've seen the PR for 6.15 about the i2c, it will cover all the GPU manufacturer or there are non standard interfaces for RGB?
20:14 DemiMarie: mupuf: what makes the solution work well for WDDM but not Linux?
20:16 mupuf: MrCooper: thanks, I was not aware it was a bug
20:17 mupuf: DemiMarie: On Windows, all GPU command submission is performed by a Microsoft-provided component which handles hang detection, resets, and resubmission
20:17 mupuf: all apps use the same GPU context
20:18 mupuf: context switching is done in the component
20:18 mupuf: disclaimer: i'm not an.expert and likely am saying utter nonsense