00:11 DemiMarie: jenatali: I had not thought of secondary command buffers.
00:12 DemiMarie: I thought that the command buffers were handled by what is essentially a hardware state machine.
04:13 DemiMarie: bnieuwenhuizen: what I really want is some evidence from Intel of the security of the firmware.
04:39 Lynne: airlied: for some reason rdna2 supports av1 at 422, maybe we should wire support for it
04:40 Lynne: or at least vadumpcaps says it does, no idea if it lies or not, but it does say it outputs 422 with nv12 and p010, which is wrong
06:10 airlied: Lynne: not sure vaapi always reports that I think it'll do internal copies
06:28 Lynne: to convert pixfmts? yuck
06:29 Lynne: it does advertise p016 for all other codecs, though, which is direct
06:29 Lynne: probably a missing line
06:31 airlied: Lynne: I'd have to test vaapi with 420 and 422 to see what it actually programs the hw
06:33 airlied: Lynne: h264 might support it as well
06:39 airlied: Lynne: status queries for anv should work noe
06:39 airlied: now
06:45 Lynne: I'll test it
06:54 Lynne: yup, it
06:54 Lynne: it's working
06:56 Lynne: do you need a command line to generate 422 samples?
07:07 airlied: Lynne: yeah throw some at me
07:14 Lynne: ffmpeg -i <input> -c:v libx264 -pix_fmt yuv422p -y test_422.mkv
07:15 Lynne: for 10bit 422, s/yuv422p/yuv422p10
07:19 airlied: just going to try and get anv to pass cts before I get to that
07:19 Lynne: kk
07:20 Lynne: what is radv waiting for to get the patchset merged?
07:21 airlied: more review I think
07:22 airlied: so really either bnieuwenhuizen or hakzsam to throw some more criticism at it :-P
07:22 Lynne: btw you can replace libx264 with libx265 for hevc, just keep in mind it'll generate a rext file due to a bug
07:22 airlied: though in fixing anv I'm seeing some minor fixes for radv
07:23 Lynne: you can still decode them fine (probably) by passing -hwaccel_flags +allow_profile_mismatch before -i when decoding
08:05 airlied: okay anv passes all the current CTS tests
08:06 airlied: or will once I rebase/push it
08:06 airlied:is wondering can I do separate dpb/dst on inte
08:06 airlied: intel
08:08 airlied: ah no I just misread my code, coincide it is
08:22 dj-death: airlied: it's just on SKL?
08:27 airlied: dj-death: SKL+ though I have to test it on DG2 to make sure it works on both ends of the spectrum
08:28 airlied:doesn't have an integrated gen11/12
08:28 airlied: h265 is also going to be messy as I think for the as specced vulkan API it would need HuC
08:29 airlied: I have to work out if my DG2 board loads huc at all, I fell down the twisty mei paths
08:29 airlied: dj-death: Lynne tested on an SKL, and I've testing on a whiskylake so far
09:02 dj-death: airlied: appears to load fine on my dg2
09:02 dj-death: airlied: I'm on drm-tip 6.2.0-rc3+
09:02 dj-death: just need the right version of the blob
09:07 MrCooper: daniels danvet emersion jekstrand: if user-mode queues & fences and Vulkan wait-before-submit semantics can be handled in the kernel (which seems required for UMF to be usable by display servers), do we really need explicit sync in the display protocols?
09:14 danvet: MrCooper, can it be handled in the kernel?
09:15 MrCooper: which part specifically?
09:15 danvet: well thus far I've seen some hand-waving that we just stuff umf into dma_resv somehow on the sideline
09:15 danvet: and pretend everything keeps working
09:15 danvet: but also the handle umf wait before submit in the kernel
09:16 danvet: like generally with umf you handle this in userspace by putting the right waits into the userspace queue
09:16 MrCooper: if UMF is to be usable for display servers, the kernel has to be involved somehow, doesn't it?
09:16 danvet: I also haven't seen a reasonable plan for umf vs dma_fence compat mode
09:16 danvet: the hand-waving just assumes everything is umf in your system
09:16 danvet: why?
09:16 danvet: also how?
09:17 danvet: with umf you get no guarantee it'll ever happen
09:17 MrCooper: Wayland compositors want something which can be plugged into an event loop, not "putting the right waits into the userspace queue"
09:17 danvet: so either compositor waits until it's signalled before it submits
09:17 danvet: or you put a magic queue wait with timeout into the command queue
09:17 danvet: then it's not really pure umf anymore
09:18 MrCooper: yeah, "pure UMF" seems unusable for Wayland compositors
09:18 danvet: and the trouble with augmented umf so that it kinda looks like a futex with pollable fd
09:18 danvet: you get into the entire "looks almost like dma_fence, but is entirely incompatible with that" mess
09:19 danvet: and since we need legacy dma_fence supporting mode anyway for the foreseeable future
09:19 MrCooper: entirely incompatible how?
09:19 danvet: I'd just use that and call it done
09:19 danvet: kernel deadlocks in memory reclaim
09:19 danvet: so you can do dma_fence built using umf primitives
09:19 danvet: even with userspace submit and all that
09:19 MrCooper: even with a timeout?
09:19 danvet: but then you ditch the entire nice future fences semantics of umf
09:19 danvet: mutex_lock_timeout does not fix a deadlock
09:20 MrCooper: sounds like Vulkan wait-before-submit semantics can't be safely supported at all?
09:20 danvet: there's the tdr timeout for jobs that take too long, but usually when people talk about timeout in umf context they mean essentially replacing mutex_lock with mutex_lock_timeout and having no plan for what happens when the timeout fails because you've managed to hit the architectural deadlock
09:20 danvet: yes
09:21 danvet: which is the big scary thing and the reason jekstrand wants to ditch
09:21 MrCooper: somebody should tell James Jones that
09:21 danvet: well, wait-before-submit can be supported, that's what the entire drm_syncobj and mesa submit thread stuff is about
09:21 danvet: but there's some older wait before submit vk stuff which wasn't this carefully engineering (mostly könig saying no really)
09:22 danvet: and which just works mostly, but not by design
09:22 danvet: i.e. if you get a cpu fault at just the right time that gets into memory reclaim and hits just the right dma_fence, you functionally deadlock
09:22 danvet: tdr will clean up the mess, but you have a reset for something that vk spec says should work
09:22 MrCooper: submit thread won't fly either I'm afraid, per discussion on the explicit sync Wayland protocol
09:22 danvet: hm link for me to catch up?
09:23 MrCooper: sec
09:23 danvet: MrCooper, for the vk feature that's very gray on linux-dma_fence world ask jekstrand
09:23 danvet: I always forget the exact name
09:24 MrCooper: danvet: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_1017021
09:28 MrCooper: vkQueuePresentKHR is not supposed to block either
09:38 MrCooper: danvet: FWIW, I can't seem to find the reference right now, but I understand AMD's plans for user-space queues is to signal dma_fences from user space for implicit sync
09:38 danvet: yeah you can do that
09:38 danvet: if you do it right
09:38 danvet: I'm honestly not confident from the discussions thus far
09:38 MrCooper: isn't it fundamentally the same thing as wait-before-signal though?
09:39 danvet: not if you do it right :-)
09:39 danvet: essentially what you need is a) pin memory like with ioctl submit until each request is done
09:40 danvet: have an in-kernel timeline to make sure your ctx is advancing monotonically and in finite time
09:40 danvet: (that was b))
09:40 danvet: c) nuke the entire gpu context if userspace violates this
09:40 danvet: d) which means any wait-before-signal trick you do in the command stream that doesn't work with ioctl submit will also not work with this
09:41 danvet: at least work reliably as in "guaranteed to not occasionally deadlock and end up nuking the app"
09:42 danvet: MrCooper, ok read the wl proto issue, and yes daniels is right
09:42 danvet: unless you handle the dma_fence materialization in the protocol on the compositor side
09:42 danvet: then there's no way to implement this client-side only without breaking some guarantee somewhere
09:43 MrCooper: dma_fence materialization as in once the signal has been submitted for the wait?
09:43 danvet: yeah
09:43 MrCooper: would still need something usable in an event loop for that
09:43 danvet: yeah that's why I put my POLL_PRI comment at the bottom
09:44 danvet: if we want to tack it onto the drm_syncobj fd
09:44 danvet: or we do a new one
09:44 danvet: whatever compositors like really
09:45 MrCooper: but that still couldn't avoid the deadlock issues in the kernel, could it?
09:45 danvet: MrCooper, I think for compositor design it's useful to distinguish dma_fence + drm_syncobj case, where we can do all kinds of things like pollable fd and clear point of "the kernel is committed and guarantees it'll happen"
09:45 danvet: and pure umf, which is a memory location with no guarantees whatsoever about what'll happen to it in the future
09:46 danvet: MrCooper, drm_syncobj doesn't deadlock, because you cannot submit an un-materialized fence as a dependency anywhere
09:46 danvet: the "wait for fence materialization" is why the mesa submit thread is a thing
09:47 danvet: which means the kernel doesn't die, but otoh mesa might not be able to guarantee what the vk/wl specs say
09:47 danvet: which is why this is all a bit annoying :-/
09:48 MrCooper: yeah, seems like a thread can't really work, and it's not supposed to block either
09:49 danvet: yeah I mean fundamentally the problem doesn't disappear
09:49 danvet: by moving it into userspace
09:49 danvet: so you still have the mismatch between what the spec wants and what linux delivers
09:50 MrCooper: James Jones still seems to be under the impression it can work, that's why he's insisting on explicit sync support (in X as well)
09:50 danvet: I think fundamentally you can't have a) dynamic gpu memory management b) future fences c) compositors not getting stuck (i.e. cross security boundary guarantees) d) bottomless queues everywhere (no blocking)
09:50 danvet: nvidia blob largely drops a) on the floor
09:51 danvet: then this is all trivial
09:52 MrCooper: what a mess :/
09:52 danvet: I think they defacto also drop c) and just shrug
09:53 danvet: but one of these you really have to drop or it boils down to fully abstract diagram that demonstrates the deadlock
10:02 emersion: the protocol should support submitting fences that didn't materialize yet
10:06 emersion: what's wrong with waiting for UMF in the kernel with a timeout, like a Wayland compositor would?
10:07 dolphin: emersion: that's indeed the question :)
10:08 dolphin: I guess it boils down to a lot of code has been designed that just assumes that fence always materializes in the next 10 seconds, so that UMF would have to be a new construct compared to dma_fence
10:10 emersion: is a 10s timeout bad?
10:11 dolphin: the waiters don't have a timeout, the exporter gives a guarantee to either fulfill or fail the fence in that time
10:23 emersion: i don't really understand any of this
10:23 danvet: emersion, yeah if you wait for umf in the kernel you just rebuild the mesa submit thread in the kernel
10:23 emersion: yes
10:23 danvet: that's a pile of code on the wrong side of a security boundary
10:23 emersion: that bad?
10:23 emersion: well, it would only be for backwards compat
10:23 danvet: mesa exists already
10:24 emersion: is waiting a security issue?
10:24 danvet: nah but doing all the parsing and copying and stuffing it into a kthread :-)
10:24 emersion: oh there is parsing involved?
10:24 danvet: if you want perfect backwards compat with existing ioctl at least
10:25 MrCooper: it would waste CPU cycles too?
10:25 danvet: if not, then ... mesa submit thread and drm_syncob wait-for-materialize all exists and works
10:25 emersion: i thought it was just waiting for a bit in some chunk of memory
10:25 danvet: MrCooper, well if you don't write a fastpath
10:25 danvet: so more complexity in a legacy submit ioctl which a) tend to be complex already b) why would you touch them
10:25 emersion: i mean there is probably a way to wait for a UMF without wasting CPU cycles?
10:26 danvet: emersion, what dolphin said, you need to wait for umf at a different place than where all the current drivers wait for dma_fence
10:26 emersion: where is that place exaclty?
10:26 danvet: emersion, the cpu wasting is copying all the stuff to the kthread
10:26 danvet: after drm_sched_job_submit() essentially
10:26 danvet: umf wait must be before that
10:27 danvet: which also means you get propagating umf, because the dma_fence you hand back to userspace also becomes an umf
10:27 danvet: it's an infectious property
10:27 danvet: and with drm_syncobj we have the in-kernel semantics to at least stall in the right place
10:27 danvet: with sync_file not even that is a thing
10:28 emersion: okay, then we are in a situation similar to format modifiers, where everybody needs to support it in the whole stack before they can be used?
10:28 danvet: neither in dma_resv
10:28 danvet: and if you don't patch all these and the drivers using them, it's not good for more than a quick "hey this looks easy" demo
10:28 danvet: emersion, I think so
10:29 danvet: but I think I'm also more pessimistic on this than others
10:30 danvet: I don't think it's more code in the kernel than userspace
10:30 danvet: but also not less at all, and putting all that for compat in the kernel instead of userspace just feels rather wrong
10:31 emersion: i'm not convinced about that, but i'm no kernel dev
10:31 bnieuwenhuizen: on the other hand if nobody comes up with alternatives for years ...
10:31 danvet: bnieuwenhuizen, I haven't seen more than handwaving on the kernel side either
10:33 Lynne: how would mcbp work with user-mode submits?
10:36 MrCooper: so to recap, it sounds like Vulkan wait-before-signal can't be fully supported with upstream drivers; assuming user-mode fences are handled at the kernel level in a way suitable for Wayland compositors, is explicit sync really needed in the display protocols?
10:36 MrCooper: Lynne: GPU FW needs to handle it presumably
10:37 emersion: wait-before-submit* (?)
10:37 MrCooper: wait-before-signal as in the GPU wait is submitted before the corresponding signal
10:38 danvet: MrCooper, imo yes
10:38 emersion: i mean that's what's happening in general, people wait before signal, and the wait is unblocked when the signal happens
10:38 emersion: or maybe my terminology is completely wrong
10:38 danvet: like if protocols really can't be fixed, then I guess we can add an umf slot to dma_buf as another iteration of the most hilarious ipc ever
10:38 danvet: which mesa stuffs in on one side and takes out on the other
10:39 danvet: kinda like the import/export ioctl that jekstrand landed
10:39 danvet: but I'm really not sure whether ipc-on-dmabuf is a great design
10:39 emersion: yeah, i'd rather not
10:39 MrCooper: danvet: the protocols can be fixed, I'm just wondering if there will be any tangible benefit from the churn
10:39 dolphin: right, but if kernel is just metadata carrier, still can't do dynamic memory management really
10:39 danvet: it's probably the quickest hack forward which isn't a design dumpster fire at a fundamental level though
10:40 danvet: dolphin, you'd need to use the mesa submit threads still
10:40 emersion: well, having a good design sounds like a tangible benefit to me
10:40 emersion: instead of stashing more stuff onto dmabufs
10:40 danvet: it's really just a "wl proto is immutable, let's add the missing field to the dma-buf ipc sidechannel" :-)
10:40 emersion: wl proto is not immutable
10:40 emersion: i can work on the proto side
10:40 emersion: i just need folks to fix the rest :P
10:40 danvet:firmly tongue-in-cheek :-)
10:40 dolphin: emersion: asking everyone to move away from dma-buf to new_fence, will be bit of a job
10:41 danvet: but yeah the dma-buf ipc approach would only need changes to dma-buf.c and mesa winsys
10:41 danvet: well, more or less in exactly the same places as the dma-buf fence import/export
10:41 MrCooper: wouldn't it use the dma-buf fence import/export?
10:42 emersion: it doesn't have to be a flagday migration
10:42 danvet: ok gtg now for lunch/workout/sauna, ttyl
10:42 danvet: MrCooper, it's not a dma_fence/sync_file, so it'd be need
10:42 emersion: it can be a gradual migration like format modifiers
10:42 danvet: *new
10:42 danvet: or you're back to the "magic compat layer in the kernel" pandora's box
10:43 MrCooper: emersion: nvidia users would keep suffering from synchronization artifacts until the gradual migration completes
10:43 emersion: is nvidia migrating to UMF today?
10:44 emersion: hm, or do you mean nvidia is completely broken today?
10:44 MrCooper: emersion: re wait-before-signal, so far the kernel supports submitting GPU waits only after the corresponding signal was submitted
10:45 emersion: i'm not following
10:45 emersion: i submit a page-flip IOCTL before the GPU work for the frame has completed
10:45 dolphin: emersion: today you're not allowed to publish a fence unless you guarantee to be able to resolve it
10:45 MrCooper: emersion: yes, https://gitlab.freedesktop.org/xorg/xserver/-/issues/1317
10:45 emersion: the kernel will wait until the completion is signalled
10:45 emersion: this is wait-before-signal
10:45 MrCooper: not the same thing
10:46 MrCooper: wait-before-signal is about GPU semaphores
10:46 emersion: what is signal() then? it's not signalling completion of the fence?
10:46 emersion: is signal() the thing that materializes the fence?
10:47 MrCooper: emersion: BTW, nvidia doesn't even do implicit sync for page flips yet, so may get incomplete frames even for normal compositor drawing
10:47 MrCooper: signalling the semaphore
10:48 emersion: somehow i'm more confused at the end of the discussion than at the start
10:49 emersion: okay, so nvidia is broken, but since this is a sync issue, it only manifests itself in some edge cases
10:49 MrCooper: it is a complex topic :)
10:49 dolphin: emersion: just having a fence somewhere returned by IOCTL means it'll have to be signalled in <10 seconds
10:49 dolphin: if you have a reference to a fence, it exists and you can expect it to fulfil or fail in that time period
10:50 emersion: right, if we have UMF waits in the kernel, we can keep that guarantee
10:50 dolphin: so you can do stuff like just plain wait it, holding the system making forward progress until it signals
10:50 emersion: yes
10:51 dolphin: well, you'd have to fix all users of dma fences in the kernel not to do that
10:51 emersion: why?
10:51 emersion: can't the dma_fence emulation for UMF record the creation time
10:51 emersion: and timneout if the wait exceeds that timestamp+10s?
10:51 dolphin: UMF fence means there's no guarantee when it will complete => potential memory management deadlocks
10:52 emersion: UMF with a timeout fixes that issue
10:52 dolphin: well, everybody is mostly after UMF for the fact that there would be no exporter enforced timeout
10:53 dolphin: so UMF with fixed timeout doesn't really sound that much better than current dma fences
10:53 emersion: i mean, if there is a timeout for backwards-compat only, could be fine?
10:54 dolphin: well, then you would essentially have two classes of dma fence
10:54 dolphin: which really is kind of equal to the new fence which danvet and I referred to
10:54 emersion: yeah, it would be a new fence, but with possible backwards compat with old dma_fence
10:55 emersion: and then we can have a new-fence wl proto, KMS uAPI, etc
10:55 dolphin: the thing is, because you can chain those fences
10:55 dolphin: you basically have to fix pretty much all of it before you get the benefit
10:55 emersion: yeah, you need to fix one driver + one compositor + mesa
10:56 emersion: to not hit the backwards compat dma_fence codepath
10:56 X512: Why 2 types of syncronization FD exists (fence file, syncobj)? Is fence file obsolete?
10:56 emersion: X512: i'd suggest reading the kernel docs for drm_syncobj
10:57 X512: It seems that it is possible to implement everything in userland by using only syncobj FDs.
10:57 emersion: https://dri.freedesktop.org/docs/drm/gpu/drm-mm.html#drm-sync-objects
10:57 X512: fence faile is redurant.
10:57 X512: faile -> file
10:58 dolphin: emersion: well, the problem is if you export the fence, it can be imported by pretty much anybody
10:58 dolphin: so you'd have to change the rules based on who imports it
10:59 dolphin: suddenly it would be your responsibility as exporter to ensure 10 second timeout, compared to the importer deciding it depending on if they are up for indefinitive timeout
10:59 emersion: there are multiple ways to go around this
11:00 emersion: hm, are you talking about export inside the kernel, or export as a FD?
11:00 dolphin: FD, mostly
11:00 emersion: so you could have a new FD type for new-fence
11:00 emersion: and exporting to a dma_fence would change the rules
11:01 dolphin: well, then you can loop back to what I said about getting all the userspace libraries to jump a new fence FD type
11:01 dolphin: everyone seems to be figuring things out by doing downstream hacks on top of dma_buf for now
11:02 dolphin: in upstream, you only export immovable pinned memory
11:03 emersion: there are not that many userspace libraries
11:03 X512: I asked about fence file and syncobj accessed from userland. Why not use syncobj FD for Wayland explicit sync protrocol?
11:03 emersion: X512: because UMF may need a New Thing
11:03 X512: It have no wait before submit problem.
11:03 emersion: so the wl work is stalled on kernel folks figuring out what the New Thing can be
11:03 dolphin: emersion: how's that? if you include media and compute
11:04 kchibisov: How would I debug memory issues within just libEGL? Are there options to build only libEGL under ASAN?
11:05 kchibisov: I have libEGL segfaulting when handling linux dmabuf v4 with a specific client on my system due to malloc errors.
11:05 emersion: dolphin: i mostly care about KMS and Vulkan
11:05 dolphin: emersion: and with compute, I also mean the scale-out network stuff
11:06 dolphin: well, that doesn't mean the libraries are not there, just means you don't care about them :)
11:07 emersion: do you mean oneAPI crap?
11:07 emersion: VA-API?
11:07 emersion: or something else?
11:08 emersion: in any case, i don't care if these degrade to the backwards compat path
11:08 dolphin: well, the kernel backwards compatibility rules really apply to everyone
11:09 dolphin: I think libfabric is the hip thing these days for compute
11:09 MrCooper: X512: don't think you can get a syncobj before the signalling work has been submitted; anyway, that particular use cases seems a lost cause for upstream drivers
11:10 dolphin: the problem is that those libraries and the compute in general is the reason why folks want the new fence
11:10 emersion: MrCooper: you can with timelines
11:11 emersion: send a timeline drm_syncobj with a point which doesn't exist yet
11:11 dolphin: yeah, I think there is a corner case in Vulkan that will outright fail on upstream
11:11 X512: And DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT flag.
11:11 emersion: hm actaully you might also be able to do it with binary syncobj
11:11 dolphin: fixing just that corner case is probably not enough of motivation to do all the work
11:11 emersion: the syncobj is a container and the drm_fence in it can be NULL
11:12 emersion: dma_fence*
11:12 X512: WAIT_FOR_SUBMIT will cause to wait until dma_fence will be inserted to syncobj.
11:14 emersion: related: https://patchwork.freedesktop.org/patch/506761/
11:17 MrCooper: maybe it's "just" prone to deadlocks in the kernel then
11:19 emersion: X512: btw, the wl proto based on drm_sycnobj is here: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90
11:35 MrCooper: or maybe you can get a syncobj like that, but still can't actually submit the GPU wait work before the signal work
11:36 emersion: vulkan lacks an ext to import a drm_syncobj, so yeah you can't do that right now
11:36 emersion: you can if you stay in vulkan-land
11:37 X512: Isn't Vulkan timeline semaphore exported opaque FD actually syncobj FD?
11:38 emersion: oh, right
11:38 emersion: but there's no guarantee it will be
11:38 emersion: it's just the mesa implementation-defined behavior
11:38 X512: It cah be qualified by new Vulkan extension like dma_buf export FD extension.
11:39 emersion: yes, but jekstrand rejected that idea
11:39 emersion: on the basis that UMF will likely need a new fence, and we should skip directly to that
11:39 emersion: the tl;dr is that all explicit sync work is stalled on UMF being solved, as you can see
11:40 bnieuwenhuizen: I feel like we're falling into a trap of blocking everything on UMF and then being very slow to get UMF though
11:40 emersion: yes, i agree
11:41 dolphin: the real question is, is there some use case you can't implement without new fence when it comes to 3D/compositors
11:41 bnieuwenhuizen: like Jason mentioned 2-4 years for UMF yesterday, which kinda sucks to block any ecosystem improvements
11:42 dolphin: usually, new fence enters the discussion when you want to do multiple-hour compute workloads
11:42 emersion: dolphin: fix NVIDIA, and future AMD hw
11:43 emersion: also wait-before-submit for WSI
11:43 X512: What is Vulkan semaphore FD for proprietary Nvidia drivers?
11:43 emersion: X512: no idea
11:43 dolphin: emersion: nvidia can also adapt the same model everyone else is doing
11:44 emersion: dolphin: future hw won't support the current model
11:44 dolphin: why wouldn't 10 second timeout be supported?
11:45 emersion: might as well move on now, instead of staying with the old model which will become obsolete eventually
11:45 dolphin: if one failed to perform an action in 10 seconds, your compositor/desktop experience is not going to be great
11:46 emersion: i don't know the details, i just know AMD will require UMF in the future
11:47 dolphin: they probably would like to take advantage of HW features that you can achieve with UMF (like everyone else)
11:47 dolphin: doesn't make it any more required for the desktop compositing and 3D workloads
11:48 emersion: the ML has more info
11:48 dolphin: now matter how complex your hardware, you can always assert 10 second timeout
11:49 X512: Does it mean that applications will be able to freeze whole GUI for 10 seconds?
11:49 emersion: no
11:50 dolphin: if you do your compositor right, should just freeze single window
11:50 dolphin: but if your compositor has a bug (or the underlying userspace driver), then you might
11:51 dolphin: meaning if the compositor itself hangs, then you may get 10 second stutter
11:51 dolphin: and that is the rationale for the timeout too, any more time and the user is going to hit the power button
11:54 dolphin: emersion: long story short, you can always assert a timeout even if your hardware could do something else, and you can make/keep the fence waiting kernel code more straightforward
11:54 dolphin: as a new FD would be needed, sharing the same name in kernel internally may just add confusion
11:55 dolphin: everyone would probably just want the new fence, but nobody has enough incentive to do it as everyone would have to be on board to get the benefits
11:55 dolphin: as one can always resolve the problems just for your own driver and stack in downstream
12:07 MrCooper: emersion: future AMD HW can work fine with what we have now (and so can nvidia HW, if they want to)
12:08 emersion: AMD devs are saying otherwise
12:08 MrCooper: upstream drivers can't stop supporting dma_fence anyway, it would be a UAPI regression
12:08 dolphin: emersion: probably some miscommunication going on
12:09 dolphin: nothing stops anybody from starting the 10 second timer when creating the fence
12:09 MrCooper: AMD devs seem to understand this perfectly
12:11 emersion: alright, then sounds like a good idea to just let explicit sync work bitrot a few more years
12:11 X512: I am working on proof-of-concept Radeon GPU driver that run in userland as server (daemon) process.
12:12 emersion: less work for me, i won't complain :)
12:12 X512: Targeted GPU (Radeon SI) seems too old for prototyping UMF :(
12:12 dolphin: well, as far as I know, (apart from the corner cases that Vulkan API allows) there's really no use-case for 3D/desktop compositing where you need indefinite workloads
12:13 dolphin: emersion: the answer to "why?", is really the compute libraries
12:14 X512: Haiku have no implicit sync and X11 legacy so whole synchonization model can be designed from scratch.
12:15 dolphin: X512: in userspace, unless you can pin All the Memory (TM), you can't do anything but "new fence"
12:15 dolphin: as you won't have any guarantees for memory being available
12:16 dolphin: and if you are doing shmem() + futex() in userspace, then it goes back to danvet's point about why make it in KMD?
12:17 X512: My driver guarantee that all buffer objects have actually allocated memory and creating new buffer will fail if no more physical memory.
12:17 X512: For simplity.
12:18 dolphin: if the kernel does overcommit, the allocated memory is a lie
12:18 dolphin: your thread will be paused, memory taken away, and returned at later point
12:18 X512: No overcommit. GTT buffers must be locked in physical memory.
12:19 X512: Driver reject CPU memory that is not locked.
12:19 dolphin: right, if you allow userspace to lock unbounded amount of memory, then a lot of things are possible but that's an another story
12:22 dolphin: however your CPU thread may still be paused due to CPU timeslicing, so even if the GPU work submitted finishes, the CPU thread may not be there to pick it up
12:22 dolphin: so still can't give a guarantee of 10 seconds
12:24 X512: Do GPU preemption?
12:24 dolphin: hm?
12:25 dolphin: just that the userspace CPU thread can't make promises to complete anything in N seconds
12:25 X512: Than it is only a problem of one particular userland process and other processes unaffected?
12:26 dolphin: yes, but is the reason why you can't guarantee to signal a fence in 10 seconds
12:27 dolphin: aka. you can only do a "new fence" really if you do it in userspace
12:30 X512: That is the problem with doint in userspace if ignoring legacy compatibility problems?
12:31 X512: Userland processes can be fully isolated by separate GPU contexts and GPU context switching.
12:31 dolphin: doing what, exactly?
12:31 dolphin: you mean the old style dma_fence?
12:32 X512: No. Memory mapped userland fences.
12:33 dolphin: hmm, I don't think there is a problem. every waiter is responsible for specifying a timeout
12:34 dolphin: you don't have any hard guarantees of the system making forward progress, though
12:36 X512: If freeze is isolated to single GPU context and don't affect other GPU contexts then it seems no problem.
12:39 dolphin: yeah, if it's not your compositor, then probably not a problem
12:41 X512: Compositor can render old surface buffer (double buffering) if rendering new buffer by client process is frozen for some reason.
12:42 dolphin: yeah, but if your compositor itself is frozen, then user sees no new frame :)
12:43 dolphin: of course if you assign the compositor to a different cgroup and allow it to mlock memory, the chances will be lower
12:46 X512: I think that it is good idea to prohibit overcommit for compositor and design it so every allocation failure must be gracefully handled (fail to open window etc.).
12:46 X512: Compositor is critical real time task.
13:22 Lynne: umf isn't that far away, it's just a few patchsets away from being usable
13:52 MrCooper: danvet: hmm, can the kernel prevent wait-before-signal with user-mode queues though?
15:04 agd5f: AMD hardware going back to navi1x can do user mode queues for gfx in theory
15:05 ishitatsuyuki: but does that imply errata-free? :p
15:13 MrCooper: Lyude: FYI, https://gitlab.freedesktop.org/drm/amd/-/issues/2171#note_1734122 and its grand-parent comment indicate hwentlan's series doesn't fully fix the MST regressions
15:35 nroberts: are there any plans to make meson support generating the rustdoc documetation?
16:15 DavidHeidelberg[m]: going to merge kernel uprev from 6.0 to 6.1, any objections? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20855
16:16 DavidHeidelberg[m]: Sergi stress-tested it a lot; I also gave it a few runs, so stability turned out to be good.
16:22 agd5f: what we did to support implicit sync is to add a new secure semaphore packet which you put into the user mode queue. UMD tells the KMD what ring write pointer value is for the user queue associated with the implicit sync. The new secure packet then writes the wptr value to a location in the KMD's GPU address space. KMD can then use that value to sync against when it needs to. That memory can also be mapped RO into the GPU address space of
16:22 agd5f: other processes and they can use wait packets to sync against
16:42 LaserEyess: agd5f: is there any circumstances where the driver, given access to PIXEL_ENCODING_RGB *should* not use that? Assuming, of course, the display supports it
16:43 LaserEyess: for example
16:43 LaserEyess: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c#L5229
16:43 LaserEyess: shouldn't these checks on YCRCB444 and RGB be reversed?
16:43 LaserEyess: if it *cannot* use it, I understand, but I don't understand why RGB wouldn't be preferred here
16:56 danvet: MrCooper, it can't prevent that right now for most drivers either, you'd need a cmd parser to make sure userspace didn't sneak in a wait-before-submit
16:57 danvet: so all we really need is an uapi contract that userspace must not do that, and the pieces I listed on the kernel side to make sure the resulting dma_fence don't break any rules
16:57 danvet: if userspace breaks the contract it gets to keep the pieces
16:57 danvet: which results in garbage rendered into that buffer
16:57 danvet: which userspace can do anyway if it feels like
17:00 danvet: agd5f, I still think not even that was really necessary
17:00 danvet: you can also kmap that page and leave it in the gpu va to write into
17:01 danvet: and then have a little bit of validation on the kernel side to make sure the fence timeline doesn't walk backwards
17:01 agd5f: LaserEyess, not sure. question for hwentlan
17:02 agd5f: danvet, yeah, it just guarantees a monotonically increasing value
17:02 LaserEyess: ok thanks
17:02 danvet: agd5f, oh the secure write is an rmw cycle to make sure it's only ever going up?
17:03 danvet: still needs a bunch of kernel code to make sure the timeout&ctx nuking happens, but I guess you can in some cases consume them directly somewhere else
17:03 MrCooper: danvet: so if there is explicit sync in the display protocol, and user space is silly enough to attempt wait-before-signal, it may appear to work fine most of the time?
17:04 danvet: MrCooper, yeah
17:04 danvet: well, wait-before-sync even works nowadays most of the time
17:04 danvet: until you stack things deep enough and run out of luck at the right time
17:05 danvet: you don't really need a protocol to shovel them around, between engines/context in one process is good enough
17:05 danvet: or just engine/cpu
17:05 danvet: which vk allows
17:06 MrCooper: well, this is specifically about wait-before-signal drawing to a BO shared between a client and a display server
17:06 danvet: it's just wait-before-signal
17:06 danvet: no further conditions needed to go boom
17:07 MrCooper: I'm afraid I'm also starting to get more confused than I was when I started asking questions this morning :/
17:09 danvet: so the problem is that if you do wait before signal using dma-fence, then that fundamentally deadlocks
17:09 danvet: so you need to unwind it in userspace and do the entire drm_syncobj fence materialization dance
17:09 danvet: which mostly works, except protocols
17:09 danvet: and it's also a mess
17:10 danvet: but nothing is stopping userspace from just ignoring all these rules, and happily mixing wait-before-signal and dma_fence
17:10 danvet: and it will mostly work
17:10 danvet: even under memory pressure
17:10 danvet: until your luck works out
17:10 danvet: *runs out
17:10 danvet: so the protocol thing isn't really fundamental, it just forces these various gaps and mismatch more clearly into the light
18:59 agd5f: danvet, yeah the write pointer is a 64 bit monotonic number since vega
19:02 DemiMarie: bnieuwenhuizen emersion: I strongly recommend not waiting on UMF
19:06 DemiMarie: danvet: is the problem that dma-fence is a terrible API?
19:44 airlied: dj-death: testing on my dg2 throws up a tiling error, since the engine wants Y tiling will have to figure out what it wants on dg2
21:24 danvet: jekstrand, tilers that allocate more memory while executing a batch
22:05 dcbaker: zmike: I had to pull a couple of not-nominated patches to get one of the nominated zink patches to apply. Its the top 3 on the staing/23.0 branch currently, could you let me know if you're okay with that?
22:23 zmike: dcbaker: whoops, I thought I was on top of conflicts there
22:24 zmike: dcbaker: ah yeah, that was originally one patch so the fixes tag didn't get propagated when it was split
22:24 zmike: lgtm
22:31 jekstrand: danvet: Bad tilers!
22:32 danvet: jekstrand, I just realized that maybe I shouldn't think about memory handling deadlocks that much
22:32 jekstrand: heh
22:32 danvet: I also wonder how many kinda funny-to-buggy drivers we have already
22:32 jekstrand: danvet: I'm not aware of any tilers that absolutely have to allocate mid-batch.
22:33 zmike: jekstrand: I pinged you on a zink MR, any chance you could take a look in the next couple days
22:33 jekstrand: Some of them can go faster if the up a pool size
22:33 danvet: jekstrand, hm right, so GFP_NORECLAIM is good enough
22:34 danvet: I still don't want to read all the drivers to check that :-)
22:36 bnieuwenhuizen: jekstrand: didn't Mali have issues there? (https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/memory-limits-with-vulkan-on-mali-gpus)
22:43 jekstrand: zmike: translated, more-or-less.
22:43 jekstrand: bnieuwenhuizen: On Mali, varying buffers are allocated in userspace
22:43 jekstrand: Apple and IMG have a heap that gets allocated by the kernel based on metrics but those can both handle OOM by spilling part-way through the render pass.
22:44 jekstrand: So if the kernel goes to allocate and fails, it's ok. It just needs to be careful to not throw the old buffer away until it's sure it has the new one.
22:54 danvet: lina, ^^ might also need more GFP_NORECLAIM ...
23:06 zmike: jekstrand: awesome, thanks
23:46 airlied: dj-death: okay dg2 needs some work, once I got past all the missing MOCS, will try and figure that out
23:52 dj-death: airlied: thanks