IRC Logs of #dri-devel on irc.freenode.net for 2023-03-10

00:37 DavidHeidelberg[m]: anholt_: CI passed, time to merge new mesa-swrast machines?
00:37 gfxstrand: airlied: RE: Vulkan Beta...
00:38 gfxstrand: airlied: Yeah, part of the XML parser changes was to unify the skipping of provisional stuff.
00:38 gfxstrand: I didn't realize we had video in the tree at that time.
00:38 gfxstrand: (Might be good for CI to build-test that....)
00:39 gfxstrand: airlied: IDK what the right thing to do is. I kinda want to keep as much beta stuff disabled by default as we can. That way we're guaranteed that we never ship beta stuff accidentally.
00:39 gfxstrand: But, also, I don't want to make the build system a mess by passing an extra flag into every single python generator.
00:39 airlied: gfxstrand: not sure we can build stuff without the define at all
00:40 airlied: so I don't see how we could accidentally ship stuff
00:40 gfxstrand: Yeah
00:40 gfxstrand: Let me see what breaks when I drop the check
02:33 alyssa: my MR is waiting for sanity
02:33 alyssa: oddly poetic
04:26 alyssa: eric_engestrom: oh wow enough banging my head on things later, I think I got clang-format ci working
04:27 alyssa: bruteforce works!
07:34 eric_engestrom: alyssa: awesome! I'll review it now :)
08:20 bbrezillon: danylo, danvet: Hi. I'm been looking in more details at the nouveau VM_BIND stuff, and there's still one thing that's not clear to me. Is drm_sched_backend_ops::run_job() consider to be part of the signaling path. IOW, can we allocate from this callback (most/all? drivers seem to allocate their fence object from there with GFP_KERNEL flags), and if we can allocate these fences with
08:20 bbrezillon: GFP_KERNEL, what make things different for page table allocations.
08:20 danvet: yeah those are all driver bugs unfortunately
08:20 danvet: I thought amd reworked to fix this, but the rework never got out of amd into other drivers
08:20 bbrezillon: ok, so it's indeed considered to be in the signaling path
08:20 danvet: yeah
08:21 bbrezillon: is this documented?
08:21 danvet: no
08:21 danvet: I did a patch to add signalling annotations and broke everything
08:21 danvet: so I think best to do is to resurrect that, but add a knob to disable it per-driver
08:21 danvet: and then set that knob for all current drivers except amdgpu
08:22 danvet: https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffwll.ch/
08:22 danvet: https://lore.kernel.org/dri-devel/20200604081224.863494-15-daniel.vetter@ffwll.ch/
08:23 danvet: I think I've discussed this with thomas hellstrom or mlankhorst or mbrost in the past too
08:24 danvet: opt-out might need to be per-callback even since iirc amdgpu still has issues with console_lock in the tdr path
08:25 javierm: danvet, bbrezillon: I believe that topic was also discussed by Christian and Lina in the thread about the agx driver
08:25 danvet: yeah I need to catch up on that
08:25 MrCooper: DemiMarie: "kernels from distros like RHEL are often out of date" do not let the base kernel version fool you, most of the code is pretty close to upstream (in RHEL 8 & 9 ATM, older versions are more or less frozen at this point)
08:25 javierm: danvet: I read everything but have to admit that only understand half ot it :)
08:25 danvet: I also need to come up with some idea for the annotations in the display helpers
08:26 danvet: because they fire in a bunch of common but unfixable cases for something that not really anyone is using ...
08:26 danvet: javierm, yeah that's the usual end result when talking about memory reclaim
08:26 danvet: utter confusion by the people who got lost halfway through
08:27 danvet: and terminally sad faces by the leftover few :-/
08:27 javierm: I see... is not only me then
08:27 danvet: someone I pointed at the plumbers discussion with könig, gfxstrand and me from 2 years ago summarized it with "I don't really understand, but your facial expressions really worried me"
08:28 bbrezillon: Unless I'm missing something, for simple drivers (panfrost, etnaviv, ...), it's mostly a matter of pre-allocating the fence object at submit time, and filling it at run_job() time, so no allocation happens in there
08:29 bbrezillon: but no matter how we solve it, it should probably be added to the drm_sched_backend_ops::
08:29 bbrezillon: doc
08:29 danvet: bbrezillon, yup
08:30 danvet: (on both actually)
08:30 danvet: hm if amdgpu switched to shadow buffer then the console lock thing is also sorted now, and they should work with the full annotations
08:30 bbrezillon: are there other callbacks that are supposed to run in the signaling path?
08:31 danvet: all of them actually :-)
08:31 danvet: but run_job and timedout_job are the ones where people usually get it wrong
08:32 danvet: bbrezillon, this is why my first patch annotates the entire scheduler thread
08:33 bbrezillon: yeah, but I'll probably leave that to you. Just wanted to update the docs, and fix panfrost :P
08:34 danvet: well with the annotations you get much better testing generally
08:34 bbrezillon: unless you want to update the docs along with the annotation
08:34 bbrezillon: sure, I'm not claiming we shouldn't add the annotation, just saying it's worth documenting it too
08:35 bbrezillon: because the current situation is, we don't have the annotation, and people keep doing the same mistake
08:38 bbrezillon: actually, this came up while I was discussing the the page table pre-allocation stuff with robmur01 on #panfrost, and he rightfully pointed me to all those drivers allocating stuff in the run_job() path, and this is where I got confused, not knowing is this was part of the signaling critical section and all drivers were buggy, or if it was actually allowed to allocate from there
08:43 danvet: bbrezillon, yeah probably the docs are a good first step, you're volunteering?
08:44 danvet: maybe in the main struct text a link to the dma-fence signalling doc section for further details, and then a one-line in each callback that they're signalling critical sections and therefore very limited in what they can do, specicially can't allocate memory with GFP_KERNEL
08:45 danvet: and then commit message with the note that sadly most drivers get this wrong
08:49 bbrezillon: danvet: yep, I can do that, and fix panfrost along the way
08:49 danvet: thx
08:50 danvet: https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#dma-fence-cross-driver-contract and the section right below probably what you want to link to
13:36 daniels: this is basically like The Purge but for lavapipe/llvmpipe
13:41 zmike: haha
14:03 daniels: fwiw it's looking like our remaining showstopper - apart from missing swrast - is a GitLab bug where it just doesn't hand jobs out to runners. if you see anything stuck in pending for wild amounts of time like 30-40min it's that. if the pipeline is still running other jobs, you can cancel & retry the pending-forever one and it'll sail straight through
14:03 daniels: I've collected enough about it to figure out how to unbreak it, so we should have a monumentally stupid workaround in place later this afternoon
14:09 javierm: tzimmermann: thanks for your explanations, makes sense. Feel free to add my r-b to patch #1 too if you resend/apply
14:16 tzimmermann: javierm, thank you. i do expect that some code can be shared at some point. i simply don't want to get ahead of myself
14:20 javierm: tzimmermann: yes, makes sense
14:47 DavidHeidelberg[m]: where I can grep for the wayland-dEQP-EGL.functional.negative_api.create_pixmap_surface ? piglit/deqp?
14:49 daniels: DavidHeidelberg[m]: VK-GL-CTS
14:49 DavidHeidelberg[m]: thx
14:50 daniels: though afaict that should be fine as we return EGL_BAD_PARAMETER
14:50 daniels: is it failing?
14:52 DavidHeidelberg[m]: daniels: it's fixed by mesa with wayland; but when mesa is without wayland, it should be skipped, not failed I guess?
14:52 daniels: it should still be passing on x11
14:52 DavidHeidelberg[m]: oh, then I just workarounded issue by enabling wayland :D
14:52 DavidHeidelberg[m]: see https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21786/diffs
14:55 daniels: heh yeah, I think that's just broken CTS
16:15 DemiMarie: bbrezillon: is the whole dma-fence design the actual problem?
16:16 DemiMarie: Where core MM stuff gets blocked on async GPU work instead of paging the memory out and having the GPU take an IOMMU fault.
16:17 DemiMarie: Any chance this can be fixed in drivers other than AMD?
16:21 alyssa: daniels: i wrote a CI thing and it *works*?!
16:21 alyssa: send help something must have gone terribly wrong
16:23 zmike: I'm here to help, who needs a spot
16:23 alyssa: daniels: also, re unassigning, this is what Needs merge"
16:23 alyssa: helps with :D
16:24 alyssa: "backlog of MRs to assign to Marge on a rainy day"
16:24 zmike: would be nice to have a cron job to sweep that a couple times a day
16:25 bbrezillon: DemiMarie: not sure I follow. Did I say the dma-fence design was the problem?
16:28 DemiMarie: bbrezillon: no, but it seems to me that the dma-fence design will keep causing bugs until it is changed
16:30 daniels: alyssa: hrm?
16:33 alyssa: daniels: which part
16:35 daniels: alyssa: 'wrote a CI thing'?
16:36 alyssa: daniels: clang-format lint job
16:36 alyssa: i thought i would just end up waiting for eric or okias to do it :p
16:36 daniels: oh, nice
16:37 alyssa: (..next step is getting panfrost clang-format-clean so I can flip it on there because apparently it isn't right now, whoops)
16:44 bbrezillon: DemiMarie: I'm no expert on these things, but my understanding is that it's not the dma_fence design itself that's problematic, but more the memory reclaim logic. With mem shrinkers waiting on dma_fence objects to release memory, and waitable allocation happening in the job submission path, you might just deadlock.
16:47 Hazematman: Hey any people familar with nouveau, tegra, virgl, or svga able to look over my MR to change the default PIPE_CAP_DMABUF behavior? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21654
16:47 Hazematman: This MR adds support for the `get_screen_fd` API to all the gallium drivers, so I need some people to look over changes to those gallium drivers
16:51 gfxstrand: The dma_fence design itself is fine. It's designed that way for very good reasons. There are problems we need to solve but they're more around how things have become tangled up inside drm than they are about dma_fence.
16:54 bbrezillon: DemiMarie: if you have a way to swapout some memory accessed by in-flight jobs, you might be able to unblock the situation, but I'm sure this 'no allocation in the scheduler path' rule is here to address the problem where a job takes too long to finish and the shrinker decides to reclaim memory anyway.
16:56 bbrezillon: I think the problem is that drm_sched exposes a fence to the outside world, and it needs a guarantee that this fence will be signaled, otherwise other parties (the shrinker) might wait for an event that's never going to happen
16:56 gfxstrand: Yup
16:57 bbrezillon: that comes from the fact it's not the driver fence that's exposed to the outside world, but an intermediate object, which is indirectly signaled by the driver fence, that's created later on when the scheduler calls ->run_job()
16:58 gfxstrand: Once a fence has been exposed, even internally within the kernel, it MUST signal in finite time.
16:59 gfxstrand: If you allocate memory, that could kick off reclaim which can then have to wait on the GPU and you're stuck.
17:00 bbrezillon: so the issue most drivers have, is that they allocate this driver fence in the ->run_job() path with GFP_KERNEL (waitable allocation), which might kick the GPU driver shrinker, which in turn will wait on the fence exposed by the drm_sched, which will never be signaled because the driver is waiting for memory to allocate its driver fence :-)
17:01 bbrezillon: what gfxstrand said :-)
17:05 robclark: embedding the hw fence in the job struct is probably easy enough to avoid that.. but then when you start calling into other subsys (iommu, runpm, etc) it starts getting a bit more terrifying
17:29 bbrezillon: robclark: yeah, that's actually where the whole discussion started. I was trying to see if we could pass a custom allocator to the pgtable/iommu subsystem for page table allocation, so we can pre-allocate pages for the page table update, and avoid allocation in the run_job() path
17:31 bbrezillon: didn't really think of the runpm stuff, but if allocations can happen in the rpm_get_sync() path, that will be challenging too...
17:31 bbrezillon: I mean, blocking allocations, of course
17:32 robclark: bbrezillon: maybe spiff out iommu_iotlb_gather a bit more.. to also handle allocations
17:32 bbrezillon: yep, was pestering robmur01 with that yesterday :-)
17:32 vsyrjala: at least acpi is absolutely terrifying in terms of doing memory allocations in runtime pm paths/etc.
17:32 robclark: the gather struct is already used to defer freeing pages to optimize tlb flushing
17:33 bbrezillon: the other option would be to just re-implement the page table logic
17:34 robclark: that is something I'd prefer avoiding
17:35 bbrezillon: which we might have to do if we want to use some fancy TTM helpers and get advanced memory reclaim involving reclaims of page-tables, not just memory backing GEM objects (didn't really check what the TTM TT abstraction looks like)
17:35 robclark: I did have a branch somewhere a while back that plumbed the gather struct more in the map path (since map can also trigger free's and tlb flushes)
17:36 robclark: we don't need to store pgtables in vram, so I don't think that is useful
17:36 robclark: (and not really sure how good ttm is in general with reclaim)
17:38 bbrezillon: don't know how good TTM reclaim is, but I'm sure panfrost reclaim is not great :-)
17:38 robclark: does panfrost have reclaim other than just madv?
17:39 bbrezillon: nope
17:40 bbrezillon: just ditched the whole reclaim stuff in pancsf, hoping someone could come up with a good reclaim-implementation solution for new drivers :-)
17:40 bbrezillon: and then I realized TTM had some of that
17:40 robclark: so I did add common lru and lru iteration
17:40 robclark: drm_gem_lru_scan()
17:41 bbrezillon: that's a good start
17:41 robclark: use that and reclaim is mostly not too bad except random places that might allocate memory
17:41 bbrezillon: I'll have a look, thanks for the pointer
17:42 robclark: _but_ I'm not doing iommu map/unmap from scheduler like a VM_BIND impl would do.. that kinda forces the issue of hoisting allocation out of io-pgtable.. but it is a useful thing to do because we can be more clever about tlb invalidates that way
17:43 bbrezillon: robclark: just curious, where do the page table live if they're not in vram?
17:43 robclark: there is no vram ;-)
17:43 robclark: it is all just ram
17:44 bbrezillon: yeah, I mean, it's the same on pancsf
17:44 bbrezillon: but it's still memory that's accessed by the GPU
17:44 robclark: not _really_
17:45 bbrezillon: okay, the MMU in front of the GPU :)
17:45 robclark: it is memory accessed by the (io)mmu which might happen to be part of the gpu
17:45 robclark: right
17:45 bbrezillon: but I think I get where you were going with 'we don't need to reclaim pgtable mem'
17:46 bbrezillon: tearing down a mapping will automatically release the memory, since it's all synchronous and dealt on the CPU side
17:47 bbrezillon: *the pgtable memory
17:48 robclark: hmm, that is kinda immaterial.. we don't need the pgtable to be in special memory that isn't system memory.. but you can still get into deadlock, because allocations that don't set GFP_ATOMIC or similar flags that allow the allocation to fail can recurse into shrinker
17:48 bbrezillon: sure, we still need to pre-allocate
17:48 robclark: yup
17:49 bbrezillon: guess the question is, should we could pgtable memory in the shrinker
17:49 robclark: (or, you do for the VM_BIND case.. if you do iommu map somewhere else where it is safe to allocate that is fine)
17:49 bbrezillon: because currently we don't
17:49 bbrezillon: *should we count
17:49 robclark: don't count.. but that isn't the problem
17:50 robclark: the problem is that fence signals allows other pages to be reclaimed
17:50 robclark: but the allocation of pages for pgtable can trigger shrinker which could depend on other gpu pages to become avail to free
17:51 robclark: so, one idea.. which 90% solves it (and at least reduces the # of pages you need to pre-allocate)
17:51 bbrezillon: sure, I'm just thinking about why TTM keeps track of page table memory in its reclaim logic. The alloc in signalling path is completly orthogonal and needs to be addressed for async VM_BIND
17:51 robclark: hmm.. or nm.. I was going to say do iommu map synchronously but unmap from run_job(). but that doesn't quite work
17:52 robclark: TTM is probably doing that because if you have vram (dGPU) you have pgtable in vram
17:52 bbrezillon: unmap might need to allocate too
17:52 robclark: reading pgtable over pci is not going to work out great
17:53 robclark: right, both map and unmap can free and alloc... but limited cases so you probably could put an upper bound on # of pages you'd need to pre-allocate
17:56 bbrezillon: there's an upper bound for map operations too. I mean, in addition to what you'll always need for the map anyway (the more you map, the more level of page tables you'll have to pre-allocate, but compared to what you'll need for the map operation itself, it should be negligible)
17:57 bbrezillon: so maybe allocating for the worse case is not such a big deal
17:58 bbrezillon: I guess you can run into problems if you have a lot of small async map/unmap operations, because then the amount of pages you allocate for the worst case can be much bigger than the amount of page you'd allocate if you knew the state of the VA space
17:59 robclark: probably don't free pages you didn't happen to use for current map/unmap and keep them avail for next time?
18:11 bbrezillon: sure, we can keep a pool of free pages around to speed up allocation
18:26 daniels: I'm aware our shared runners are completely starved; this is exacerbated by the job distribution being unfair, which I'm typing up a patch for
18:53 daniels: ok, hopefully that's done now
19:14 DemiMarie: bbrezillon gfxstrand: the idea I had is that the shrinker could call some sort of MMU notifier callback, which is not allowed to block or fail. That callback is responsible for unmapping the needed MMU/IOMMU entries, flushing any necessary TLB, and canceling any not-yet-submitted jobs. Jobs that are already in progress will just fault when they try to access paged-out memory, at which point it is up to the GPU driver to recover.
19:14 DemiMarie: Of course, this
19:14 DemiMarie: * course, this is equivalent to requiring all GPUs to support pageable memory on the host side.
19:20 DemiMarie: More generally, having memory reclaim blocked on userspace-provided shaders seems very unsafe. I’m curious why drivers are not required to pin all memory that a shader might have access to, unless the GPU supports retryable page faults.
19:26 robclark: gpu faults because the system is overcommitted on memory isn't going to be hugely popular ;-)
19:26 robclark: I mean, your average 2GB or 4GB chromebook is under constant memory pressure ;-)
19:27 robclark: demand-paging on gpu side is something that some gpu's could do.. although compared to CPU where you are stalling one task, stalling the gpu for a fault is stalling 100's or 1000's of tasks.. so not sure if that is exactly great
19:30 DemiMarie: robclark: what about requring memory to be pre-pinned (and marked unavailable for shrinking)? Also, can’t even `GFP_KERNEL` fail?
19:32 robclark: GFP_KERNEL in practice can't fail for small allocations, IIRC
19:34 robclark: you can in the shrinker skip over memory that has unsignaled fences.. and this is probably what you want to do for early stages of shrinking. But under more extreme memory pressure you want to be able to wait for things that will become avail to reclaim in the near future
20:18 airlied: bbrezillon: I think the latest nouveau bits have refactored that or are in the process of refactoring
20:23 airlied: DemiMarie: pin all the things isn't really a winning method, eventually everyone pins all of memory with chrome tabs
20:24 airlied: dma-fence is pretty much pin all memory until a shader has finished with it, and you know it's finished when it signals a dma-fence
20:35 airlied: dakr: ^ probably should read above just to reconfirm
21:04 dakr: airlied, bbrezillon: Yes, it's not even entirely fixed in V2.
21:04 DemiMarie: airlied: So there are a couple problems there.
21:04 DemiMarie: First is that a shader can loop forever. Timeout detection can handle that, but it runs into another problem, which is that resetting the GPU denies service to other legitimate users of the GPU.
21:05 airlied: DemiMarie: yes welcome to GPUs
21:05 DemiMarie: The second is that some workloads (notably compute) have legitimate long-running shaders.
21:05 dakr: In V2 I still have the page table cleanup in the job_free() callback. The page table cleanup needs to take the same lock as the page table allocation does. Since job_free() can stall the job's run() callback, this is still a potential deadlock.
21:05 dakr: I'll fix this up in V3.
21:06 airlied: DemiMarie: for compute to do long running you have to have page faults
21:06 airlied: at which point you don't need to wait for dma fences in throty
21:06 airlied: theory
21:06 DemiMarie: airlied: which GPUs support page faults?
21:06 airlied: so compute jobs that are long running are not meant to use dma-fences
21:07 airlied: DemiMarie: for compute jobs, I think all current gen, most last gen
21:07 DemiMarie: the other option would be to use IOMMU tricks to remap the pages behind the GPU’s back
21:07 airlied: not sure where it falls over
21:07 DemiMarie: airlied: what about for graphics jobs?
21:07 airlied: nope grpahics jobs aren't pagefault friendly
21:07 DemiMarie: airlied: why?
21:07 airlied: and you don't really want to take a pagefault in your fragment shader
21:07 airlied: too many fixed function piecs
21:08 DemiMarie: and those cannot take page faults?
21:08 airlied: eventually they might get to where it could work, but it will always be horrible
21:08 airlied: not usually
21:08 airlied: texture engines and ROPs generally
21:08 DemiMarie: what about letting the IOMMU remap pages transparently to the GPU?
21:10 airlied: don't think we always have an iommu, and I'm not sure if you could do that in any race free way
21:11 DemiMarie: If the GPU could retry requests one could use break-before-make
21:12 DemiMarie: airlied: GPUs really need to get better at hostile multi-tenancy
21:12 airlied: they have been getting better, just not sure when they'll be finished
21:13 DemiMarie: what progress has been made?
21:16 airlied: you couldn't even pagefault a few gens ago :-)
21:18 DemiMarie: If I were designing a GPU I would be very tempted to have each shader core run its own little OS (provided by the driver) with full interrupt and exception handling support.
21:21 DemiMarie: Or at least have the host be able to migrate pages on the GPU.
21:46 robclark: then you'd have llvmpipe :-P
21:46 robclark: but I wouldn't share a gpu between hostal tennants
21:47 dottedmag: DemiMarie: don't you have
21:47 dottedmag: _special_ GPUs for hostile multitenancy
21:48 robclark: there is server class stuff that supports sr-iov
21:48 robclark: I'm sure there is a whole big class of u-arch info leak issues hiding there ;-)
21:55 airlied: generally they are often compute only
21:56 DemiMarie: robclark: personally, I would be fine with using LLVMpipe, but unfortunately for Qubes users, the GUI tookit and application writers are not.
21:57 DemiMarie: dottedmag robclark airlied: Intel Gen12+ iGPUs support SR-IOV, though driver support for it is not (yet) upstreamed. That said, Qubes OS needs to work (and have decent performance) even without such hardware.
21:58 DemiMarie: Just how long does it take to reset a GPU? Because at least Apple M1+ GPUs reset so quickly that one could reset after every frame and still have a usable desktop.
22:01 robclark: probably on the order of few ms .. so might be ok for desktop workloads but probably not for games.. I don't have #'s for reset but resuming gpu is ~1.5-2ms for modern adreno's..
22:03 robclark: if resetting the gpu was the fastest way to do it then qcom wouldn't have this "zap" shader mechanism to take the gpu out of protected mode since same information leak concerns there
22:14 agd5f: DemiMarie, All gfx9 derived GPUs can support recoverable GPU page faults, but that was dropped in gfx10 and newer because it takes a lot of die area and the performance generally makes games unusable. If you want fast games, everything needs to be resident
22:15 agd5f: you can also preempt to deal with long shaders
22:17 agd5f: if there is memory pressure, stop the jobs, deal with the pressure, let them run again
22:36 DemiMarie: agd5f robclark: thanks for taking your time to answer my questions!
22:37 DemiMarie: airlied dottedmag too
22:38 robclark: np
23:14 anholt_: daniels: I'm supposed to be off work today, but it I've taken down the swrast runners again now. Going to need someone competent to maintain them if we're going to, so we're asking around, but probably going to need to move the load back to equinix for a bit. :/