IRC Logs of #dri-devel on irc.freenode.net for 2024-11-20

00:17 Company: people just need to learn that there's 3 different names for each gpu: kernel driver, GL driver, Vulkan driver
00:17 Company: and those names are creatively chosen, not because they make sense
00:22 Company: but I can confirm from Gnome that people who aren't aware of this get very confused by those names
00:22 airlied: in some cases they made sense when they were chosen, but the meanings have shifted
00:23 alyssa: for us, kernel=gl=asahi, vk=honeykrisp which.. could be worse
00:23 Company: also, Gnome lost its creativity because we named everything gnome-thing
00:24 feaneron: you can't say that with a straight face
00:24 Company: which worked for a few years but now that we want to replace older apps with newer ones, we don't have names available anympre
00:24 feaneron:named his app Boatswain, and all types start with Bs
00:25 feaneron: BsWindow, BsStreamDeck, it's all bs
00:26 Company: i'm still salty that it's called gnome-text-editor because typing that in a terminal takes way too long
00:27 Company: and autocomplete doesn't work either, because everything starts with "gnome-"
00:28 zf: KDE had the right idea :-)
00:45 alyssa: Company: rip gedit
00:46 alyssa: wait is gedit not gnome-text-edtior
00:52 feaneron: it is not
00:52 mattst88: nope, separate thing
00:52 mattst88: gnome-text-editor is the new thing
00:54 alyssa: joy
01:01 Company: maintainer issues caused a fork
01:01 Company: so gedit still exists
01:06 Company: bunch of projects had some reckoning when the gtk4 transition happened, both because gtk4 nudged very hard towards design changes, from menu + toolbar towards "touch" design
01:06 Company: and because backends that weren't reasonably clean and operating under an X model suddenly had to deal with a toolkit that didn't bent over backwards to make that work on Wayland
01:07 Company: so you can no longer do updates via XCopyArea()
01:08 pac85: I thought gnome-text-editor was done from scratch, it looks very different than gedit (at least how I remember it)
01:09 Company: not sure - but 90% of it is GtkSourceView and that remained a thing
01:10 Company: it's either gedit and deleting all the plugin stuff or redone from scratch around the GtkSourceView port to GTK4
01:12 Company: https://gitlab.gnome.org/GNOME/gnome-text-editor/-/commit/bdf472712b995ec737a4913ac2f57cf89bf7bc4a
01:13 Company: it's a case of "what do I do now that there's a lockdown?"
01:24 alyssa: i mean go back far enough and krita is a gimp fork right? ;P
01:33 DemiMarie: alyssa: why is the userspace code not named Asahi too?
01:34 DemiMarie: Company: Yup, GTK4 very very much pushes one towards a certain design style, which makes porting some applications almost impossible. I have no idea how I would port Horizon EDA.
02:21 Company: DemiMarie: there's 2 answers to that. One of them is to find some modern UI designers to take on that topic, and change the parts that don't fit well anymore
02:21 DemiMarie: Company: fit well with *what*?
02:21 Company: DemiMarie: and the other option is to do an alternative/companion to libadwaita that focuses on that older style of application design
02:22 DemiMarie: Is the existing UI objectively bad in some way, even for applications that do not need touch support?
02:23 Company: none of that design is about touch support, that's just how people call it
02:23 DemiMarie: Is there scientific evidence that the newer design is objectively (as opposed to subjectively) superior?
02:23 Company: I have no idea
02:24 Company: though I'd be interested how anyone would quantify "better" for design
02:24 DemiMarie: “How long does X take?” studies would be the most obvious one I can think of, but there are people (not me) who actually do work on this stuff.
02:24 Company: but people like developing apps this way, so that's what's happening
02:25 DemiMarie: Is GTK no longer intended to be used without libadwaita or another platform library?
02:26 Company: I always compare it to the web for the answer
02:26 Company: can you make a webpage without some framework? Sure. Should you? Probably not.
02:27 Company: GTK is trying to push the widgets that imply some sort of UI design out of the platform
02:28 Company: and focus on the core building blocks
02:29 Company: but that leaves you without the base widgets that make up the UI - sidebars, headerbars, toolbars, statusbars, etc (why are those all bars?)
02:30 Company: and you also have no design language, ie no consistent spacing, no good color/contrast choices, all the themeing stuff is missing
02:31 Company: and that's basically what a framework/platform lib gives you
02:34 Company: if someone made a library with a toolbar widget, a menu and some MDI docking widget, so that you could implement Gimp's and Inkscape's UI with it, then you could port those and the whole Cinnamon/Mate apps to it and you could probably find a bunch more
02:34 Company: gedit!
02:34 Company: but you need to find someone who wants to create that library, and there's been a distinct lack of interest for years
02:35 DemiMarie: At least some are just leaving GTK instead.
02:39 Company: that's also an option - depends on how much UI you have and how well other stuff fits
02:41 Company: all I know is that the Gnome community is not gonna make it happen
03:00 Lynne: is there some database of the throughput of GPUs in terms of 32/16/8-bit (non-matrix) integer ops?
08:39 MrCooper: RAOF: my recommendation is to just signal the release point when the atomic commit completion event arrives, not materialize any fence before
08:40 MrCooper: zamundaaa[m]: even a client which prefers signalled release points might reasonably say "the release point has a fence, I can re-use this buffer, don't need to allocate another one"
09:03 emersion: sounds like a broken client
09:07 MrCooper: why? That's how I'd implement it for dynamic number of buffers
09:08 MrCooper: if there being a fence doesn't imply the buffer can be re-used, what does?
09:38 emersion: MrCooper: it's easy to wait for the timeline point to be signalled
09:38 emersion: as opposed to materialized
09:39 MrCooper: right, then there's no point materializing the release point from OUT_FENCE_PTR though?
09:41 MrCooper: materializing the relase point with a compositor GPU work fence and the client re-using that buffer does make sense though
09:43 emersion: MrCooper: hm, what's the difference between GPU work fence and KMS fence work? why does it make sense to use one but not the other?
09:43 emersion: i suppose because one will happen "sooner" than the other?
09:43 MrCooper: GPU fence is guaranteed to signal ASAP, OUT_FENCE_PTR might miss a display refresh cycle
09:44 MrCooper: and can't signal before the next cycle in the first place
09:44 emersion: right
09:44 lynxeye:seems to be equally confused as emersion and RAOF
09:44 emersion: i agree
09:44 emersion: now i'm kinda wondering why OUT_FENCE_PTR exists in the first place
09:44 lynxeye: What's the point of the out fence if you can use it to wait for the GPU (scanout) to be done with the buffer?
09:45 MrCooper: emersion: as a trap? ;)
09:45 emersion: :D
09:45 emersion: it would be useful for writeback, maybe
09:45 MrCooper: it could be used instead of completion events in some cases
09:46 MrCooper: e.g. when turning off a CRTC?
09:47 lynxeye: Wasn't the point of the fence that you could use it to pass back to whoever is waiting for the buffer to be free again instead of waiting for the atomic commit completion event and signaling that back to waiters?
09:50 MrCooper: lynxeye: if it was, it was ill-conceived
09:57 sima: MrCooper, emersion lynxeye so for some tiling gpus it actually makes sense to start rendering while the flip is scheduled but hasn't happened yet
09:58 sima: when you're extremely limited on memory you can kinda get 3 buffers but only allocate 2 by max pipelining everything
09:58 sima: and with tilers you can do the tiling preprocessing while the buffer isn't needed yet
09:58 sima: android apparently does that for some platforms, which is why that out fence exists
09:59 sima: there's also an entire can of worms on the kernel side around that out fence breaking the kernel-internal rules to avoid dma_fence deadlocks
09:59 lynxeye: MrCooper: agreed. I mean the point of in and out fence FDs was to allow explicit fencing. But the out fence is useless for that, as scanout is potentially still using the FB after the out fence has signaled if no flip away from this FB is scheduled
09:59 sima: and I haven't found out a good way to fix it yet
09:59 sima: so, it's a bit a mess
10:00 sima: lynxeye, that sounds like compositor bug to me
10:01 lynxeye: sima: Huh? Unless your tile buffer covers the whole FB (which I guess is pretty unlikely) you can't start rendering even on a tiler until the buffer isn't scheduled for scanout anymore.
10:02 sima: lynxeye, yeah hence the fence, because that allows you to queue up the entire batch to the gpu already and the gpu to start with vertex shader and putting the vertices into the right buckets
10:03 sima: hwcomposer v1 went even further and made that out fence a full future fence iirc
10:04 sima: and just assuming that both surface flinger and the app would get around to scheduling the flip
10:04 lynxeye: sima: Okay, so OUT_FENCE_PTR is really just a footgun for everyone not reading the docs carefully in the general (non-android) case.
10:05 sima: yeah I think unless your use-case is "extremely memory limited machine where you realistically can only ever allocate 2 buffers and are willing to trade latency for throughput by pipeling everything as deeply as possible" you do not want it
10:05 sima: maybe we should add that to the property docs
10:06 sima: if someone volunteers to type this patch I could upfront r-b it?
10:17 lynxeye: sima: I guess I can add some words of caution there. I still don't see how signaling that fence without a follow-up flip queued does improve pipelining. I do see how making this fence wait for queued flips before signaling would add lots of state tracking to the kernel, as the fence is on a atomic commit level, not the much simpler FB level.
10:20 sima: lynxeye, hm maybe we're talking past each another still
10:20 sima: so you have two buffers, A currently being scanned out, B finished rendering, flip queued up
10:21 sima: now in the old days you'd need to stall the opengl stack until A becomes available
10:21 sima: with the out-fence userspace/app can start to queue up rendering and push it to the kernel
10:21 sima: and the gpu even start with vertex processing
10:22 deliveryguysad: That you know better than me, who is at my side and who is not, is not true, your secret intelligence parts, do not know my case better than me, i hacked all the monsters, even though they thought i was stupid. I now i have changeable ip addresses based vpn's so that llvm irc retaliate fart guy is correct technically , you can not distinguish vpn dynamic ip address as offshore
10:22 deliveryguysad: signalling or scam. It is not possible 160bit HMAC 128-bit aes, it does not matter what vpn it is vray mesh, openvpn, or wireguard, if you renew the ip lease dynamically for public ip, instagram nor facebook nor any site understands that i use vpn. So he is there and in many other subject spot on, but not on me.
10:22 sima: even while the flip is still queued up
10:22 deliveryguysad: it's sha1 two rounds of emersion client you can not handle either, dynamic session.
10:22 sima: and if you miss the next vblank then sure there's going to be a stall, but the assumption is that you simply do not have memory for buffer 3, so there's no way to avoid it
10:23 sima: also with a more modern app/driver stack like vk you could do this yourself mostly
10:24 sima: but this was designed back when command parsing and relocations in the kernel was all the rave, and gl drivers where a lot dumber
10:24 sima: so queuing up in userspace means you still had to do that kernel work when the buffer was finally available
10:24 lynxeye: sima: Yea, I guess it makes sense for the case where you expect rendering and flips being busy. It's just that it doesn't make a lot of sense to signal that fence when there is no B flip queued up, which is what causes the footgun trap.
10:25 sima: lynxeye, yeah that's just compositor being broken
10:26 sima: but that's the same for gpu compositing, if you just hand back a random out-fence for a buffer that has no relationship with when the buffer isn't in use anymore, then yeah that's a bug
10:27 sima: like gpu compositing might also need to later on recomposite, and if you've already signalled the out_fence for the only current buffer you have from the app, you're broken
10:27 sima: exact same thing applies to kms, except with kms direct plane scanout the recompositing is guaranteed
10:28 lynxeye: sima: It means the compositor can't implement a fenced release of the buffer by using the KSM out fence. With GPU rendering you can do that: if you know the client has a new buffer lined up for the next time you render, you can use the GPU render fence to pass back to the client for fenced release. If you do the same with the KMS out fence you might miss the vblank for scheduling the next flip and the fence signals too early, so
10:28 lynxeye: ompositor/client can't use that for fenced release.
10:28 sima: oh I forgot: on some android they actually used the manual scanout buffer to hold the frame, so they could signal the buffer much earlier
10:28 sima: but that's again future fence semantcis
10:29 sima: lynxeye, you only get the out_fence when you've submitted the flip to the kernel
10:29 sima: at which point if you miss the vblank your entire screen eats the miss and there's nothing you can do
10:29 sima: if you try to hand the fence back earlier it's a future fence, with all the perils that entails
10:30 MrCooper: lynxeye: the fence doesn't signal before the atomic commit completes
10:30 sima: and if you drop a frame because it's not ready but still hand out the out_fence for that flip back to the app, you're just broken
10:31 lynxeye: sima: You get the out fence when you submit the flip _to_ this buffer and it will signal when the buffer has been scanned out once. What you want for fenced release is a fence that singals when a next flip _away_ from that buffer is scheduled and scanout is done.
10:31 sima: lynxeye, yeah that's just busted
10:31 MrCooper: which is indeed what RAOF's plan was presumably, it's still problematic for the reasons I describe
10:32 MrCooper: *described
10:32 sima: and unless you're ok with trading latency for more throughput with deeper pipelining, you shouldn't even do that 2nd approach, like MrCooper explained
10:32 sima: but the first one you've described is just plain broken
10:33 sima: and it would also be broken on the gpu compositor path, if your compositor ever needs that frame again for a recomposition
10:34 sima: unless you go with the X11 school of "surely background color is an acceptable fallback to a damage event"
10:37 lynxeye: sima: Again, it works for the render composition path if you know the client has lined up a new buffer for the next composition cycle, as the next render composition will use the new buffer for sure. So you can pass the render fence to the client for fenced release (which is what weston does today). With the KMS fence that's not possible, even if you have a flip to another buffer scheduled it might still miss the vblank and you e
10:37 lynxeye: with scanout reusing the buffer after the out fence has signaled.
10:38 sima: lynxeye, how?
10:38 sima: like buffer A is currently being scanned out, B is queued up with a kms flip
10:38 sima: you tell the app that it can render into A as soon as the out_fence for B has signalled
10:38 sima: the kernel misses a vblank
10:39 MrCooper: lynxeye: again, the fence doesn't signal before the commit completes
10:39 sima: the out_fence is delayed appropriately, it will not signal on the next vblank, but the next vblank after the flip actually happened
10:39 MrCooper: so what you describe can't happen
10:39 sima: so how does the app manage to render into A while it's still being scanned out?
10:39 sima: the docs should be really clear on this, if they're not we need to fix them
10:40 MrCooper: lynxeye: the commit missing a refresh cycle is a problem in itself
10:40 sima: but that's not a "latency/throughput tradeoff" but a "it's just broken" thing
10:42 lynxeye: sima: Now I'm confused again. Why would the app want to wait until the out fence for B signals if it want's to render into A? That's a full scanout cycle of latency. Surely it would want to start rendering into A as soon as the display engine has flipped away from A?
10:43 sima: lynxeye, that's what will happen
10:43 sima: where do you see an additional vblank happening?
10:43 sima: the out_fence for B signals the moment the hw stops scanning out A and starts scanning out B
10:44 sima: it does _not_ signal when the hw has finishes scanning out B for the first time
10:44 lynxeye: argh, seems I read the docs wrong _again_
10:44 sima: it's an out_fence for the flip itself, not for buffer B
10:44 sima: same applies to gpu compositing
10:45 sima: you don't need to wait with the out_fence for buffer A until you've finished rendering with B
10:45 sima: all you need is wait until you've committed to rendering with B and don't need A anymore
10:45 sima: so the out_fence for A would be whatever the end-of-batch fence for the last render job that used A is
10:45 sima: not the end-of-batch fence of the first render job that uses B
10:46 sima: otherwise you have a notch too much latency in your signalling
10:46 sima: exact same story with kms
10:49 lynxeye: right, so it's totally fine to plug the out fence from commit with buffer B into the fenced release for buffer A.
10:54 MrCooper: it works correctly, so it's "fine" if you don't care about the issues I described
11:04 lynxeye: MrCooper: But isn't this a policy decision you would want the client to make? If it doesn't want to allocate more buffers for any reason, it can pick a buffer with a unsignaled release fence for the next rendering, potentially waiting for a missed vblank or whatever. If the client cares about avoiding that, it should pick a buffer with the fence already signaled or potentially allocate a new one. How is this different from a GPU
11:04 lynxeye: r fence received from the compositor?
11:05 MrCooper: how can the client know if it's a KMS fence or a GPU fence?
11:05 MrCooper: (warning, rhetorical question :)
11:06 MrCooper: if it can't, it can't make that choice
11:07 lynxeye: MrCooper: Why does the client care? If the buffer release fence is unsignaled it may stay in that state for quite a while, regardless if it's a KMS or GPU render fence. If you want to avoid blocking at any cost, you must use a buffer with a signaled fence.
11:08 MrCooper: as explained before above: GPU fences are guaranteed to signal ASAP, KMS ones aren't
11:09 MrCooper: look, feel free not to trust me on this, you can't say I didn't warn you though :)
11:12 lynxeye: MrCooper: I still don't get the difference. The flip for the KMS fence is queued, so it's ASAP as-in whatever the next reachable vblank is. The job signaling the compositor GPU render fence might be delayed in the same way by another job hogging the GPU queue.
11:13 lynxeye: If the client want to avoid blocking on buffer availability it must choose a buffer where the release fence is already signaled.
11:13 MrCooper: to put it differently, it's very unlikely that any future GPU work by the client could start before a GPU fence from the compositor signals anyway, this isn't true for a KMS fence though
11:14 MrCooper: a job which blocks the compositor GPU work also blocks future client GPU work
11:15 lynxeye: MrCooper: Agreed, if you talk about a single GPU with one execution queue. In a hybrid setup you might run the composition on a different GPU than the client.
11:15 MrCooper: true
11:16 MrCooper: (having a déjà vu right now :)
11:18 MrCooper: that just makes GPU fences more problematic though, not KMS ones less so
11:18 lynxeye: right
11:19 lynxeye: I guess what I'm saying is that the client should always expect the fences to be problematic ;)
11:20 MrCooper: indeed, so clients should probably only re-use a buffer with unsignalled release point if they can't allocate another one
11:25 lynxeye: yep, randomly picking a buffer just because you received a fenced release for it is a recipe for hurt, maybe just more pronounced right now if the release fence happens to be a KMS one.
11:30 sima: MrCooper, there's also multi ctx scheduling in pretty much all hw
11:31 sima: or most at least
11:31 sima: plus if someone hangs on a ringbuffer gpu you look at a multi-second timeout
11:31 sima: so not sure why the gpu fence is less problematic than the kms one
11:32 sima: my take is this is down to how much memory you can waste, and where you are on the latency/perf tradeoff
11:32 sima: which is really tricky
11:32 MrCooper: right (amdgpu being a notable exception, it's getting there though :), still unlikely that future client work can preempt already-flushed compositor work though
11:32 sima: yeah you generally don't win against the compositor
11:33 sima: but the latency/throughput still applies
11:33 sima: like maybe app wants to queue up the next frame as soon as possible, because it's cpu intensive to do that
11:33 sima: or it wants lowest latency and does everything super late close to next vblank
11:33 sima: or whatever the app feels like
11:34 sima: so I'm with lynxeye that I'm not seeing why returning a buffer with an unsignalled out_fence is harmful
11:34 sima: no matter which one
11:34 sima: if you don't care about memory, just allocate more winsys if you'd block otherwise
11:34 sima: if you don't, pick the right choice according to your latency/throughput goals
11:35 sima: ofc if the app is dumb and just blindly starts rendering the moment it gets a frame back
11:35 sima: you get to keep all the pieces
11:35 MrCooper: point is that assuming the client uses the same GPU as the compositor, it can use a buffer with unsignalled release GPU fence without penalty, whereas this isn't true with a KMS fence
11:35 MrCooper: *my point
11:35 sima: unless the goal was intentionally to not ever waste memory and prioritize throughput
11:35 sima: MrCooper, but does that case exist?
11:36 sima: like with a reasonable compositor you don't start the next frame before the previous one finished
11:36 sima: so if the compositor then picks a new buffer for that rendering, the old buffer is already not in use
11:37 sima: because the out-fence for the buffer is when it was last used for rendering, _not_ the out-fence for the first rendering of the next buffer
11:37 MrCooper: not sure what you're asking, it's like the majority of GL and possibly Vulkan apps
11:37 sima: I'm asking whether the compositor in the gpu path actually ever hands back a buffer with a non-signalled outfence
11:37 sima: unless it's kinda busted
11:38 MrCooper: mutter does
11:38 sima: or the compositor already decided to toss latency overboard, at which point the app trying is pointless
11:38 sima: how does that happen?
11:39 MrCooper: it may set the GPU fence of its last compositing work on the release point as soon as the client has attached another buffer, which may be before that work has finished
11:40 lynxeye: same for weston iirc
11:40 sima: but that's the "we tossed latency already" case
11:41 sima: if the compositor doesn't toss latency, it waits with picking which buffer when it composites the next frame
11:41 sima: at which point the previous has finished
11:41 MrCooper: not sure what you mean by "toss latency"
11:41 sima: unless your app renders so fast that the new app frame finishes faster than the gpu compositing of the dekstop
11:41 MrCooper: if anything this helps latency, doesn't hurt it
11:42 sima: if the compositor already commits to using B while A is still being used it might miss a frame because B isn't done yet
11:42 MrCooper: not what I'm saying
11:42 sima: so I'm assuming we have a compositor which does a late decision about which frame, shortly before the point of no return for the next vblank
11:42 MrCooper: commits to using B while GPU work using A is still in flight
11:43 sima: then quickly queues up gpu work and issues the kms flip
11:43 MrCooper: gamescope maybe
11:43 sima: MrCooper, at that point, is B finished rendering or fence still pending?
11:44 MrCooper: finished (in mutter and most other compositors)
11:45 MrCooper: actually mutter also does something like what you describe (though the mechanics work a bit differently)
11:46 sima: so if B is finished, but A isn't yet, the app is rendering much faster than your compositor
11:46 sima: you're not going to have any problem at all
11:46 MrCooper: unless you care about benchmark numbers ;)
11:46 sima: but if the compositor commits to B before it's finished, it's not prioritizing latency
11:47 daniels: fwiw OUT_FENCE_PTR was indeed written to support the case where people wanted to queue up deeper pipelines of work without necessarily caring about immediate latency or hitches
11:47 sima: MrCooper, the app allocates more frames to keep the benchmark people happy and the compositor hopefully does mbox semantics for flips?
11:47 daniels: if you are gunning for the absolute minimal possible latency and getting some kind of new content on every refresh no matter what, then that is not the hammer for you
11:48 sima: unless you're super constrained and want to limit to just 2 buffers, at which point you'll block until the previous one is available no matter what
11:48 sima: and would much prefer you can block on a dma_fence since that block point is later
11:48 MrCooper: sima: the point is that if the client re-uses a buffer with an unsignaled KMS fence, its frame rate will be capped to the display refresh rate
11:48 sima: MrCooper, isnt' that the point?
11:49 sima: if you want free-wheeling mbox winsys flips, you need to make sure those happen
11:49 MrCooper: not if the client wants to go as fast as possible, which it can with unsignalled GPU fences
11:49 sima: and sure usually the kms flip fence takes a bit longer than the gpu flip fence, but it would still not be mbox
11:50 sima: MrCooper, who says your compositor is not super dense and queued up that gpu fence behind a kms flip out_fence?
11:50 sima: if you want free-wheel, you need to do that
11:50 MrCooper: k, this is getting too hypothetical
11:50 daniels: sima: it's not just about being super-constrained, but by the time you've allocated a buffer with all the disruption that ensues (mmu etc), you're probably too late anyway
11:52 sima: I'm still not sure what's the use-case beyond benchmark numbers
11:52 sima: like if you have something like gamescope you still want to free-wheel and absolutely ignore every buffer with unsignalled out-fence
11:53 sima: daniels, yeah but aside from startup that shouldn't happen during runtime
11:53 daniels: yeah. if you want to build something like kmscube, then use OUT_FENCE_PTR and it'll be useful to you. if you want to build something like gamescope, build gamescope instead and don't use OUT_FENCE_PTR because it's not useful to you.
11:53 daniels: I don't think there's anything controversial there
11:53 lynxeye: I guess that takeaway is simple: if you don't want to block unexpectedly, don't use random buffers with unsignaled release fences, in which case you don't care if it's a render or kms fence and also don't care about compositor policy regarding latency vs. throughput.
11:53 MrCooper: I never claimed anything else; if you're fine with your compositor potentially producing orders of magnitude lower benchmark numbers than others, go for it! ;)
11:54 daniels:shrugs
11:56 sima: daniels, there at least was more than just kmscube that wanted out_fence for deeper queues
11:56 sima: but it's really only "I can allocate 2 buffers but not 3 because simply not enough memory"
12:08 daniels: or, you do have 3 or 4 buffers, but for whatever reason (ease of design, deep hardware pipelines, slow hardware, whatever), you queue work up long in advance
14:35 rgallaispou: Hi guys, regarding this patch https://lore.kernel.org/dri-devel/20241115-drm-connector-mode-valid-const-v1-3-b1b523156f71@linaro.org/ Is it better to wait for someone to merge the whole serie, or can I apply it on -next independently ?
15:15 alyssa: karolherbst: How do we feel about enqueue_kernel in Mesa? *sweat*
15:18 alyssa:doesn't love the syntax but
15:37 karolherbst: alyssa: I haven't though about how to implement that one at all
15:37 karolherbst: but it's required for CL C 2.0 support
15:46 alyssa: my objection to it is that the argument passing model feels weird
15:46 alyssa: I want to just call `kernels` with arguments
15:46 alyssa: instead of this weird closure trampoline thing which will surely not translate to good code without backflips
15:50 alyssa: like I can't enqueue the same kernel from both host and device without compiling it twice, effectivley
16:15 alyssa: yeah.. studying the spir-v more, this is definitely implementable but it would mean extra variants
16:15 alyssa: maybe that's ok though?
16:16 alyssa: Just feels really silly
17:38 robclark: sima, mlankhorst: any idea what this lockdep splat is about:
17:38 robclark: https://www.irccloud.com/pastebin/ZpISiGMV/
17:41 sima: robclark, huh, never seen one of those
17:42 sima: bit dinner time already, but let me try at least
17:42 robclark:neither
17:44 sima: hm held_lock->references is an uint, at least here
17:44 sima: so overflowing that with lots of gem bo seems unlikely
17:47 robclark: tbf this is sparse/vm_bind type thing, so same bo could appear many times (but I think drm_exec should just be skipping the dups)
17:48 robclark: still, I don't think it would be 2^^32
17:50 sima: robclark, we should skip the already locked ww_mutex for the EDEADLCK case, those should never get to lockdep
17:51 robclark: right
17:52 sima: well for the initial trylock case, but the lock_acquired is only for the success case
17:52 sima: robclark, feels like quicksand and I'm scared
17:52 sima: can you repro this reasonably well?
17:53 robclark: so far just saw it once when running dEQP-VK.sparse_resources.buffer.\*
17:53 sima: hm
17:54 robclark: I can see if it is repeatable, but so far sample size of 1
17:54 sima: I guess first try to repro reliably or faster because I have no idea what's up here
17:54 sima: and then maybe we can try to trac ww_mutex_lock for dma-buf and see what's up
17:55 robclark: k.. just wanted to see if anyone else was familiar with that before I spent more time on it vs debugging $my_bugz
17:55 sima: nah this sounds like lockdep internals gone very wrong potentially
17:56 sima: but it's current->held_locks and if you somehow manage to corrupt that I'd expect the entire kernel to crash&burn much earlier
17:58 robclark: well, I _am_ playing with objs, locks, and mapping .. so can't rule out corrupting things, but yeah, I'd expect more of a fireball if things went wrong there
17:58 sima: robclark, before you waste time trying to repro
17:59 sima: ww_mutex_lock on the same lock in a loop, until you've gotten -EDEADLCK UNIT_MAX times?
17:59 sima: because I'm not entirely sure that's handled correctly, and it's about the only guess I have about what could go wrong
17:59 sima: because I'd guess you do not actually have UINT_MAX distinct ww_mutex in your machine
18:01 robclark: hmm, we could be re-using the same obj (and lock) many times..
18:02 sima: yeah
18:02 DemiMarie: MrCooper: why would a program ever want to render more than once per frame? That is just wasting the user’s GPU and electricity.
18:02 sima: and then maybe walk those a few times
18:02 sima: and then maybe an accounting bug in lockdep so that you exhaust much quicker than 2^32 attempts
18:02 sima: a stretch at best, but the only one I've come up with
18:02 robclark: seems plausible
18:07 sima: robclark, and allocate the lock from a gem_bo or so, because lockdep handles allocated locks differently from static ones
18:07 sima: just to make sure you're not chasing the wrong phantom here
18:08 robclark: yeah, will hack that up in ~5 or so.. just looking at a differnt bug first
18:12 sima: I'll get myself stuffed with raclette meanwhile, it's ready now
18:12 robclark: enjoy
18:24 mattst88: mmmm, raclette
18:43 karolherbst: alyssa: I didn't even check how it's represented in the spirv, but I expect that's implemented with some new nir intrinsic we'd tell drivers to implement? Don't really know if this feature is all emulated in runtimes or if there are actually hw supporting it natively
18:44 alyssa: karolherbst: ye, the spirv is a pretty straightforward translation of the cl
18:44 alyssa: clang hoists the block function into its own function, lays out a structure for the capture, and passes it in as a u8* generic pointer
18:44 karolherbst: right...
18:44 alyssa: but the heavyweight enqueue is going to be in driver runtimes
18:45 karolherbst: I was more wondering to what enqueue_kernel translates to
18:45 alyssa: similar problem space to DGC in vulkan
18:45 karolherbst: like what spirv instruction is used there
18:45 alyssa: it's just an EnqueueKernel spirv instruction
18:45 alyssa: or something
18:45 alyssa: with a function pointer thing
18:46 karolherbst: ohh "OpEnqueueKernel"
18:46 karolherbst: yeah, now I also find it in the specs 🙃
18:47 karolherbst: I'd have ideas on how to implement it on nvidia, just 0 ideas on how to do it in gallium, but maybe that's just gonna be a driver thing in it's entirety
18:47 karolherbst: + whatever lowering we do in nir
18:48 karolherbst: there is also the problem of fencing or rather.. events it's called here
18:48 karolherbst: but that's just "OpGroupWaitEvents" I guess
18:49 DemiMarie: Where is the right place to request a Mesa-specific extension? WebGL and WebGPU implementations generating SPIR-V are having to resort to ugly hacks because Vulkan requires that all shaders terminate.
18:50 airlied: why would that be mesa specific?
18:53 karolherbst: at least there is no relation between API and kernel events 🙃 that would be cursed
19:07 robclark: sima: no luck with the repeated locking in a loop, so I guess it is more complicated to repro than that
19:08 robclark: it seems to be dEQP-VK.sparse_resources.buffer.ssbo.sparse_binding_aliased.buffer_size_2_24 which triggers it
19:11 robclark: hmm, well maybe that was just coincidence
19:15 pixelcluster: DemiMarie: can you clarify? what hacks are you talking about? "shaders that don't terminate" doesn't sound like something that will really work ever to me but I might be missing something
19:15 sima: robclark, hm running low on ideas then
19:20 karolherbst: DemiMarie: shaders not terminating would need an entirely new entry point to be launched and probably have to be compute only
19:20 karolherbst: if you want to do compute, don't use graphics
19:21 pixelcluster: well, what valid webgpu/webgl app would ever want non-terminating shaders in the first place?
19:21 karolherbst: vulkan probably will need an extension for long running compute jobs anyway to implement CL on top of it tho
19:21 pixelcluster: oh yeah, for actual compute tasks I totally see the use case for persistent thread-style stuff
19:21 karolherbst: pixelcluster: at some point non-terminating or shaders running for 5 minutes doesn't really make much of a difference :D
19:21 pixelcluster: just not... webgl :P
19:22 karolherbst: so yeah.. I can totally see the use case for compute
19:22 karolherbst: but it might require a special entry point to launch such tasks
19:22 karolherbst: "just make shaders run long" won't fly
19:23 pixelcluster: yeah, I'm also kind of unsure about the lower levels of the stack
19:23 karolherbst: though with vulkan it could be an extension struct passed on the enqueue for compute have special timeout properties
19:23 karolherbst: *having
19:23 karolherbst: like "vk_ext_explicit_compute_timeout" or something
19:23 DemiMarie: pixelcluster: If a non-terminating shader was guaranteed to either cause `VK_ERROR_DEVICE_LOST` or just hang, that would be fine. The problem is that compilers are allowed to assume that shaders terminate, and a malicious shader can use this to defeat compiler-inserted bounds checks.
19:24 karolherbst: and then disallow using the fences for anything not compute
19:24 karolherbst: DemiMarie: well you can't prove they'll never terminate
19:24 pixelcluster: DemiMarie: well, such bounds checks failing should never constitute a security risk
19:24 pixelcluster: should they?
19:24 DemiMarie: pixelcluster: yes, they do
19:25 karolherbst: normal drivers will nuke the context anyway
19:25 karolherbst: so where is the issue?
19:25 karolherbst: or rather.. drivers should by default set up sane timeouts for GPU jobs
19:25 pixelcluster: how is OOB access by shaders a security concern?
19:25 DemiMarie: https://github.com/gfx-rs/wgpu/issues/6528 and https://github.com/gfx-rs/wgpu/issues/6572
19:25 pixelcluster: as in, this sounds like an issue that should rather be fixed
19:25 karolherbst: DOS is also a security issue
19:26 pixelcluster: oh yeah
19:26 DemiMarie: pixelcluster: The shaders are provided by websites, which are untrusted and might be malicious.
19:26 pixelcluster: but you get those even if you don't have the bounds check problem
19:26 karolherbst: DemiMarie well it's up to the browser to ensure there are no data leaks across tabs
19:26 karolherbst: if a browser uses the same VK "context" across tabs it's a bug in the browser
19:27 karolherbst: because it opted into sharing state across tabs by doing so
19:27 pixelcluster: DemiMarie: sure, ok, let me ask another way: how can OOB accesses result in worse behavior than having your context killed (which is what non-termination will do anyway, so it must be fine)?
19:27 DemiMarie: karolherbst: Is VK OOB guaranteed to not corrupt CPU memory, and do real browser implementations actually do this?
19:27 karolherbst: DemiMarie: GPU memory uses virtual memory
19:27 DemiMarie: pixelcluster: they can allow a website to access or tamper with data that it should not be able to access, and therefore perform a security exploit.
19:27 karolherbst: so it's like one application doing OOB can't mess with other applications
19:27 karolherbst: same thing
19:28 DemiMarie: karolherbst: Web browsers must not allow arbitrary native code execution, sandboxed or otherwise.
19:28 karolherbst: if your GPU doesn't have an MMU to do virtual memory, then maybe don't use the GPU
19:28 pixelcluster: well
19:28 pixelcluster: if the GPU doesn't have a MMU then how is it going to implement Vulkan
19:28 karolherbst: DemiMarie: well they do
19:28 karolherbst: and anyway, that's not the issue
19:29 karolherbst: an OOB access can't corrupt the state of other applications or GPU contexts
19:29 DemiMarie: karolherbst: there are of course vulnerabilities, but they are vulnerabilities, and that means that they must be fixed
19:29 karolherbst: if it can in your browser, it's a browser bug
19:29 DemiMarie: the issue is that an OOB access can corrupt the state of the application performing the access
19:29 karolherbst: yeah, life is rough sometimes
19:29 karolherbst: that's why browsers should sandbox their tabs
19:30 DemiMarie: That is good enough for almost everything, but browsers need to guarantee that there are no OOB accesses at all.
19:30 karolherbst: they don't
19:30 DemiMarie: karolherbst: that's why you should uses Chromium
19:30 DemiMarie: karolherbst: they try, and when they fail, they try again :)
19:30 karolherbst: an OOB access is pretty harmless if you ignore driver bugs
19:31 karolherbst: what's the threat model here anyway? you visit a website and.. it hangs your tab?
19:31 pixelcluster: actually
19:31 karolherbst: web browsers have all the tools necessary to isolate things properly
19:31 pixelcluster: which guarantees about OOB access are we even talking about
19:32 karolherbst: I don't even know, but I'm hope bda isn't part of webgpu 🙃
19:32 karolherbst: *I
19:32 pixelcluster: I don't think any API (Vulkan especially not) guarantees anything about what happens in OOB accesses
19:32 DemiMarie: karolherbst: If one isn’t running the UMD in-process, then yes, an OOB GPU access is harmless. That’s why native contexts work.
19:32 pixelcluster: it's UB in exactly the same way non-terminating shaders are
19:33 karolherbst: DemiMarie: OOB GPU accesses do nothing to other applications
19:33 DemiMarie: The problem is that browsers do not use kernel ioctls directly, but rather userspace APIs.
19:33 karolherbst: the security boundary is your GPU context
19:33 karolherbst: you can have many of them in vulkan
19:33 karolherbst: use one per tab
19:33 karolherbst: done
19:33 karolherbst: if browser share it between tabs it's their problem and their bug
19:34 pixelcluster: erh
19:34 pixelcluster: I think in practice you want to use a process per tab
19:34 karolherbst: sure
19:34 pixelcluster: but yes
19:34 karolherbst: and a real vulkan instance per tab and all that
19:34 zamundaaa[m]: pixelcluster: VK_EXT_robustness2 does make guarantees about OOB accesses
19:35 karolherbst: vulkan has a bit of documentation on security boundaries across objects
19:35 pixelcluster: oh right lol robustness exists
19:35 karolherbst: if those aren't enough, then one can always add an ext tightening it up more
19:36 pixelcluster: well in any case, terminating shaders really aren't the (only) problem, what you are asking for is a subset of SPIR-V that has no UB at all
19:36 karolherbst: right.. but then you have vulkan features like bda with throw aware your oob checks, so we can't only hope that bda isn't exposed in webgpu :D
19:36 pixelcluster: I don't think it is, for obvious reasons :D
19:36 karolherbst: it's the wrong solution for this problem anyway
19:36 karolherbst: it would be fun though (tm)
19:36 pixelcluster: yes, UB is a thing and we won't get rid of it
19:37 DemiMarie: pixelcluster: nah, the problem with infinite loops specifically is that it is very hard to get rid of them without incredibly disgusting workarounds, like explicit iteration counters.
19:38 karolherbst: UB isn't a security problem in applications, and it's not one in GPU programming either as long as you don't pick holes in your security boundaries
19:38 karolherbst: well
19:38 karolherbst: it's a problem in applications like sshd 🙃
19:38 karolherbst: but that's not what we are talking about here anyway
19:41 DemiMarie: My preferred solution to all of this would be to compile Mesa to WebAssembly and give it access to the kernel ioctls via a virtGPU native context interface. If the WebAssembly module gets popped, who cares, it's in the browser process which already runs arbitrary WebAssembly. The hardware protections ensure that the compiled shader code can't do harm.
19:41 DemiMarie: Unfortunately, browser vendors don't take this approach, probably because they have to support platforms with proprietary drivers.
19:41 karolherbst: it's also another platform specific path
19:42 pixelcluster: at this point what's the difference between a webassembly module and a properly isolated renderer process?
19:43 karolherbst: the API used
19:44 karolherbst: though I fully understand browsers not wanting to isolate to such an extreme extend, because having 500 open tabs can kinda cause funky problems 🙃
19:44 karolherbst: though not sure if GPU contexts are that limited on modern GPUs
19:44 karolherbst: prolly not at all
19:44 pixelcluster: you could also restrict the fancy isolation to the fancy apis
19:44 pixelcluster: if you have 500 tabs each running webgpu shaders you have bigger problems
19:45 DemiMarie: Maybe exposing GPU access to random websites without user consent was not a great idea 😆
19:45 DemiMarie: If WebGL and WebGPU required a permission it would prevent most of the abuse.
19:45 linkmauve: karolherbst, I currently have exactly 15321 open tabs. :p
19:46 linkmauve: And Firefox still works!
19:46 Sachiel: WebGL and WebGPU requiring permissions would just be an extra click from users away from the abuse
19:46 karolherbst: though did it create 15321 GPU contexts for it's per page rendering
19:46 karolherbst: I'm sure browsers are a bit smarter than that
19:46 linkmauve: I doubt so. ^^
19:46 psykose: a popup asking "Wanna get hacked? Y/N" wouldn't meaningfully change what browsers have to do to sandbox webgpu use at all
19:48 psykose: reminds me of cookies and privacy policies
19:48 karolherbst: reminds me of excel
19:48 psykose: hah
19:48 karolherbst: and people click accept anyway
19:49 DemiMarie: linkmauve: how many of those tabs are actual websites?
19:49 karolherbst: yeah apparently if you ask users "this thing wants to do X if you press deny it might not work" isn't really telling users to say deny 🙃
19:55 linkmauve: DemiMarie, all of them!
19:55 linkmauve: Of course not currently loaded.
19:55 DemiMarie: linkmauve: Wow!
20:04 sima: DemiMarie, so with mesa vk if you createa a vkdevice per security context nothing should ever escape that gpu box, not even to your cpu side process
20:04 DemiMarie: sima: Nice, thanks! That should be enough for browsers then.
20:04 sima: unless you just put your cpu side datastructures into cpu mmapped gpu memory ofc
20:04 sima: but don't do that
20:04 sima: at least that's I'd say what we're aiming for form the kernel side
20:04 DemiMarie: sima: does Mesa do that?
20:05 sima: for gl the situation is a mess, but with arb_robustness you should have enough isolation to also not blow up too badly
20:05 sima: DemiMarie, it would be really, really stupid
20:05 sima: I've seen one really old intel libva that abused gpu bo for cpu datastructures and that's by far not the worst thing that codebase did
20:05 DemiMarie: sima: gotcha, just checking
20:05 DemiMarie:wonders what the worst thing actually was
20:06 sima: I'm not going to uncover those nightmares
20:06 DemiMarie: Fair :)
20:06 sima: but I've watched developers who really don't shy away from reworking terrible code to fix it fold in less than a day sifting through it
20:07 DemiMarie: yeah at that point I would just do a from-scratch rewrite
20:07 sima: anyway for gl's arb_robustness I'm less sure how solid it is everywhere, so if you're paranoid maybe just limit to vk
20:08 sima: also don't enable the hsw vk implementation in anv, but I think that might have gotten nuked meanwhile
20:11 sima: DemiMarie, oh and ofc the usual "this is all aspirational, I'm not speaking for any vendor team including intels" disclaimer, but I think as a dri-devel stance it should be pretty solid
20:12 DemiMarie: sima: that's good enough for me :)
20:12 DemiMarie: (in particular, it is good enough for Qubes OS to enable native contexts at some point, which rely on this guarantee, at least as an opt-in option)
20:12 sima: it's after all still a horribly complex kludge of stuff written in C in both kernel and fw, plus hw is disappointingly also not bug free as we learned the hard way last few years :-/
20:13 DemiMarie: Is the FW generally of good quality code-wise?
20:13 DemiMarie:wonders if it is time to write drivers and FW in Rust
20:14 sima: DemiMarie, yeah, even more so going forward if someone comes with a mesa vk driver and the kernel side doesn't just use standard vm_bind with full blown gpu mmu and hw context it's really questionable we'll consider it for merging I think
20:14 sima: just too busted design imo
20:14 sima: DemiMarie, I've never seen any fw code in my life for any gpu
20:14 DemiMarie: sima: Ah, I was thinking that sine you work for Intel you would have at least talked to the people who did write it.
20:14 sima: for rust in the kernel, we'll hopefully get there, but it's going to be a while
20:15 DemiMarie: good news is that one can mix Rust and C, even in the same driver
21:05 benjaminl: how does the bot decide which labels to put on mesa MRs?
21:08 karolherbst: based on touched files
21:10 benjaminl: hmm, so I have an MR with a commit that touches src/vulkan/runtime, but it didn't get the vulkan label
21:10 benjaminl: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32127/diffs?commit_id=3ce9a5a6093c56beb32e0498f257fabdc31b402c
21:12 benjaminl: oh, is this because that commit wasn't in the initial MR?
21:12 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/.mr-label-maker.yml?ref_type=heads
21:12 karolherbst: ahh yeah
21:12 karolherbst: the bot only does the scan once
21:12 benjaminl: that makes sense, thanks!
21:13 benjaminl: is there a way to retrigger the bot, or should I just get somebody with permissions to change the labels if I run into this again?