IRC Logs of #dri-devel on irc.freenode.net for 2024-02-24

01:33 robclark: DemiMarie: more like different use-cases.. if it takes too long to render a single frame you might as well give up and declare context-lost.. whereas there are legit use-cases for long-running compute shaders, and there is vastly less state to save/restore.. so different trade-offs lead to different results
01:34 DemiMarie: robclark: I do wish that context loss was much less disruptive.
01:34 DemiMarie: Why do you need to reset the whole GPU, instead of just aborting the computation?
01:34 robclark: that is mostly an app problem
01:35 DemiMarie: yes, but unfortunately there are tons of broken apps
01:35 robclark: aborting the computation == undefined state... resetting gpu == defined state
01:35 DemiMarie: robclark: only partially undefined
01:36 DemiMarie: And resetting the GPU tears down lots of other processes
01:36 DemiMarie: The huge blast radius of a problem is my main complaint.
01:36 robclark: if done correctly, it should effect other processes
01:36 robclark: shouldn't
01:36 DemiMarie: It means that shared GPUs are vulnerable to a denial of service attack.
01:37 DemiMarie: Which I consider to be a (possibly unfixable) security problem with current hardware.
01:37 robclark: so, things like browsers take steps.. like disallowing webgl/webgpu to create new ctx if the tab (or origin?) has already triggered a gpu crash
01:38 DemiMarie: Can virtio-GPU native contexts do the same thing?
01:39 DemiMarie: Right now, it seems that GPU resets tend to cause a system-wide hang
01:39 DemiMarie: Which is really bad
01:40 robclark: re: systemwide hang, I guess that depends on gpu.. and things with vram sitting off pci bus have a harder time at it.. that is certainly not a problem on qc things and probably not on other igpu things
01:41 DemiMarie: I mostly hear people with AMD GPUs complaining of systemwide hangs
01:41 DemiMarie: It seems that a buggy VAAPI implementation causes the GPU to reset, which in turn causes the Wayland compositor to get a context lost error
01:42 robclark: re: nctx doing the same thing.. I don't think we could out of the box, and still pass deqp/cts (there are tests for robustness which intentionally trigger gpu hangs).. but maybe it would be an option if you didn't care about conformance
01:42 DemiMarie: It would indeed be a useful option
01:43 DemiMarie: Even more useful would be to log what happened in a way that allows the user to be asked.
01:43 robclark: I'd have to think about it.. it might need to be implemented in guest kernel (since host doesn't know if the same process simply closed and re-opened the guest drm device)
02:38 pixelcluster: yes, system-wide hangs are a pretty huge problem especially for dGPUs
02:40 pixelcluster: resetting the GPU for AMD GPUs means quite literally "rebooting" at least parts of the GPU (which on dGPUs also means *all* of VRAM is lost)
02:42 pixelcluster: if you get lucky and the hang is inside shaders, you can try killing these shaders instead of having to go for a full reset, but that doesn't work way too often
02:46 tleydxdy: > that kind of stuff
02:46 tleydxdy: what kind of stuff?
02:47 tleydxdy: also partial gpu reset is already a thing https://www.phoronix.com/news/AMDGPU-Linux-6.1-Mode2-RDNA2 it just need more work still
03:35 pixelcluster: MODE2 reset is a thing but I'm pretty sure only on APUs
03:36 pixelcluster: at least I haven't seen it on any dGPU I own, but all of my APUs have it
04:11 pixelcluster: MODE2 also still has the big problem that it reboots all command processors etc., so if any innocent workload is running on a different queue while some other workload is hanging, that innocent workload will be killed too
04:11 pixelcluster: it's better than MODE1 but definitely not perfect
04:22 DemiMarie: robclark: for me, _any_ guest process hanging the GPU would be enough justification to trigger the prompt.
04:27 robclark: yeah, one way to approach it is for trusted host to prompt users to nerf badly behaving guest... it won't really be practical for conformance testing (since infinite loops is a thing that is basically allowed by the standard) but it puts the user in control
04:33 DemiMarie: That is what I want indeed. Qubes OS doesn’t consider denial of service a security issue, but it’s still not great.
04:35 DemiMarie: pixelcluster: this explains why AMD GPUs have a bad reputation for problem recovery.
04:38 DemiMarie: robclark: one idea I had was some sort of software fault isolation a la Native Client, where the shader compiler is patched to produce code can statically be checked to not hang the GPU in a not-easily-recoverable way.
04:40 psykose: reminds me of the halting problem
13:12 DodoGTA: Why does the Mesa repository occasionally throw a `fatal: bad object refs/remotes/origin/staging/20.3` error?
13:23 kisak: Older staging branches don't exist anymore https://cgit.freedesktop.org/mesa/mesa/refs/heads
17:26 maybebaby: The best language for doing the needed is scala language and it's programmatic paradigm under the hood to mean to deal with compressed sequences and accessor transformer wild toggles. As well as rich iterations specification for or over collections
17:26 maybebaby: Sequences and sets
17:27 maybebaby: And the relevant approach is to derive an interface for ordered sets backed up by compressed bitsets.
17:28 maybebaby: The cherry on the cake is that they allow predicates to be built too to collections iterators and everyone.
17:28 maybebaby: It's fused paradigm correct for such purposes.
17:29 dcbaker: karolherbst: I got your bindgen static functions MR landed right before the feature freeze tomorrow
17:29 maybebaby: EU investments or funded language for hw
17:30 maybebaby: And the method is that every sequences has min max data structure, where the size is determined by biggest element in it.
17:31 maybebaby: You can make this in rust too through crates, core does not have it unlike in scala.
17:41 maybebaby: It's because it's two directional information passing during the compilation as example building the whole lists per or during iteration building
17:41 maybebaby: The hash cause of counting zeros and ones keeps the state in hw structures
17:42 maybebaby: It says how big is the remainder from whatever modulo
17:42 maybebaby: Relative to that
17:44 maybebaby: Where as compilation provides the info to what range, hash provides to compiler or builder the difference of modulo, by or after the remainder calculated.
17:44 karolherbst: dcbaker: cool, thanks
17:46 karolherbst: dcbaker: btw, I think those "1.3.0" need to be replaced by "1.4.0"
17:47 maybebaby: You have transformers if you want to convert the type
17:57 dcbaker: karolherbst: guess that’ll happen after the feature feeeze.