03:02 DemiMarie: mareko: why is gfx11 unlikely to be useful?
03:04 DemiMarie: To me, this solves a very serious stability problem: on AMD GPUs, any process that times out typically kills the whole desktop session, as most compositors cannot recover from a context loss.
03:20 DemiMarie: Is the problem that these devices cannot handle preemption when a page fault is pending?
03:20 DemiMarie: That can be solved with "no page faults"
03:21 DemiMarie: For context, GTK developers consider GPU resets to be a hardware/driver bug and aren't interested in supporting recovery.
08:40 MrCooper: colour me skeptical that's a tenable position though
16:38 DemiMarie: MrCooper: Complain to the GTK developers, not me. They want to treat GPU buffers as immutable.
16:40 DemiMarie: MrCooper: Personally I think that innocent-context resets *are* a driver bug.
16:40 DemiMarie: Or a hardware bug.
16:42 DemiMarie: The GPU and hardware should ensure that contexts cannot interfere with each other.
16:43 DemiMarie: Actually, I think in my use-case (Qubes OS) this might not matter, because Qubes OS will only support AMD GPUs with process isolation for security reasons. With process isolation, there is only ever one context running on the GPU at a time. Of course, dGPUs losing VRAM is still a problem but I think that is just a hardware/firmware/etc bug.
17:00 MrCooper: I did express my opinion to them
17:36 DemiMarie: MrCooper: My hope is that a Red Hat customer pays Red Hat to make GTK and Mutter robust to device loss.
18:21 mupuf: DemiMarie: innocent context resets will be greatly diminished in a near future
18:23 DemiMarie: mupuf: is that the user queue work?
18:23 mupuf: we had two meetings this week with Christian, pepp, and more AMD and RADV people on this very topic
18:24 DemiMarie: nice
18:24 mupuf: userspace command submission is orthogonal to the handling of resets
18:24 DemiMarie: Ah, okay
18:24 DemiMarie: I thought userspace command submission was needed for per-queue reset. I’m glad to know I was wrong.
18:24 mupuf: well, not fuuuuuullly
18:24 mupuf: yeah, no, we can and will do much better across the board
18:25 DemiMarie: That is great.
18:25 mupuf: well, not sure how far back it will go, but we'll see
18:26 DemiMarie: RDNA1 and later would be enough I think
18:27 mupuf: it will never be perfect because the hardware is deeply pipelined and more than one context is in flight at all times, but there is plenty we can do to kill the hung context and keep the rest running
18:27 mupuf: I would prefer Polaris10+, but we'll see
18:29 mupuf: we have a plan to make it happen, and it will take many hands to succeed
18:30 mupuf: IGT tests, CI, continuous bug filing, and bug fixing
18:30 DemiMarie: mupuf: actually, in my use-case there should not be more than one context in-flight
18:31 mupuf: on top of the first implementation
18:31 DemiMarie: Qubes OS will be forcing process isolation on for security reasons
18:31 DemiMarie: hopefully that will make things better
18:31 DemiMarie: still, I am very glad that you want to improve this.
18:31 mupuf: DemiMarie: oh, that's a separate feature, yeah
18:32 DemiMarie: mupuf: LeftoverLocals and Whispering Pixels are very in-scope for the Qubes OS threat model, which is why I was so glad to see that the process isolation feature arrived.
18:34 mupuf: yeah, for desktops, isolation is a must. Screw the bubble in the pipeline
18:35 mupuf: hardware should evolve in this direction though, so the perf hit is only temporary... unlike the IOMMU's...
18:35 DemiMarie: how big a hit is the IOMMU?
18:36 mupuf: depends on the app. Nothing to 50%
18:36 DemiMarie: What is it determined by?
18:36 DemiMarie: Also, does huge pages on the CPU side change anything?
18:36 mupuf: the size of the dataset and access pattern
18:36 mupuf: this is mostly problematic for APUs
18:37 mupuf: discrete GPUs shouldn't be so affected... although with resizeable bar 🫣
18:38 mupuf: yeah, huge pages could help, but GPUs don't support them AFAIK... unless your name is broadcom
18:38 DemiMarie: Christian wrote that AMD GPUs do have huge pages in their own MMUs, and the IOMMU also supports them I believe.
18:39 DemiMarie: Whether the driver supports them is obviously another matter
18:39 mupuf: oooooooh, that is interesting
18:39 DemiMarie: That is one reason that Xen PV + GPU = sadness
18:39 mupuf: we disabled it for the deck because of the perf hit... and the limited security concern
18:39 DemiMarie: no huge pages in Xen PV
18:39 DemiMarie: Security concern?
18:40 DemiMarie: Ah, “it = IOMMU”, not “it = huge pages”
18:41 mupuf: well, a gaming console doesn't need much confidentiality between apps... especially when they all run with the same UID
18:41 mupuf: IOMMU
18:41 mupuf: just sucks that we can't catch driver bugs
18:41 DemiMarie: mupuf: A gaming console should have confidentiality between applications, and this should be enforced by a sandbox (perhaps provided by Steam)
18:42 DemiMarie: whether or not they do is another question
18:42 DemiMarie: Also firmware bugs
18:43 mupuf: but FYI, we run in CI with the IOMMU enabled, IIRC
18:43 mupuf: to catch some of these bugs
18:43 DemiMarie: nice
18:43 DemiMarie: My favorite thing about GPUs is probably that they do not execute speculatively, because they have no need to.
18:44 DemiMarie: So no Meltdown and no Spectre outside the firmware and kernel driver.
18:45 DemiMarie: I'm really glad that AMD GPUs will be losing their reputation for bringing down desktop environments.
18:47 mupuf: there are plenty more side channel attacks, but at least no.speculation
18:47 mupuf: perf counters, are insanely powerful on gpus
18:48 mupuf: but, what you should worry about is leaking memory between processes
18:49 mupuf: amdgpu doesn't zero out VRAM pages before handing them o another process, IIRC
18:50 mupuf: same with registers, and other. They've been designed for throughput, not sharing resources
18:53 DemiMarie: mupuf: that is what process isolation is for
18:53 DemiMarie: not zeroing memory is something that is going to be dealt with
18:53 DemiMarie: on the kernel side
18:53 DemiMarie: is there a way to just disable the perf counters?
18:54 DemiMarie: or wipe them at context switch?
19:45 mupuf: On Intel and nvidia GPUs, the global counters are mediated by the kernel. Intel adds some random noise and limits the refresh rate to mitigate the problem. For per-context counters, yeah, i'm sure they are reset in the preambles