01:20RAOF: Hm. Poking around our multi-GPU output support code has me once again annoyed at the "import EGLImage -> render to a fullscreen rect" pipeline that is required to migrate an external dmabuf into scanoutable memory.
01:22RAOF: It looks like gbm_bo_import almost does enough to render this unnecessary, but USE_SCANOUT doesn't actually change behaviour, just checks afterwards whether the bo is scanoutable.
01:23RAOF: It looks like all the infrastructure is there to make gbm_bo_import do a blit into scanoutable memory if it needed to.
01:25RAOF: But perhaps a better API would take an in_fence and associate an out_fence with the gbm_bo?
01:30RAOF: Is that something that would be vaguely sane to implement? Something like an ALLOW_MIGRATION flag to gbm_bo_import, valid with GBM_BO_IMPORT_FD_MODIFIER (maybe _FENCE)?
02:48DemiMarie: Is Nouveau known to be buggy and crashy in general?
03:02airlied: DemiMarie: in a lot of cases yes
03:02DemiMarie: airlied: is that because the performance was so horrible nobody considered it to be worth working on?
03:03DemiMarie: and is this expected to improve in the future?
03:03airlied: maybe, and maybe? :)
03:04airlied: there's a lot of nvidia hw, and lots of quirks and lots of corner cases
03:04airlied: along with a lack of developer manpower
03:04psykose: something something 'nvk+zink is the future' etc
03:05DemiMarie: How much of that will be dealt with by the GSP firmware in the future?
03:05airlied: depends on what is causing the crashes :)
03:05airlied: but most of the lower level hw stuff should be, the GL driver not so much
03:06DemiMarie: I ask because a Qubes user just reported a nouveau crash and my recommendation was to disable the nvidia card if they could get away with it.
03:06DemiMarie: Is the plan to use NVK + Zink?
03:07psykose: i don't think there is any "plan" you can count on
03:08DemiMarie: These are kernel crashes
03:12airlied: recent kernels did see a bunch of stabilisation work, so the kernel recovered better from userspace screwing up
03:13airlied: but not sure what the crash is to classify it
03:16DemiMarie: https://pastebin.com/iPWeZrc9
03:17DemiMarie: Kernel NULL pointer dereference in interrupt context
03:18airlied: yeah that's not good one
03:18DemiMarie: what is?
03:18DemiMarie: the crash?
03:33airlied: yeah it does seems to more kernel internals problem
03:39DemiMarie: is that enough to (hopefully) get the bug fix?
03:39DemiMarie: *fixed
03:43airlied: not likely
03:45airlied: if there's a reliable reproducer then it's a bit more likely someone might look
10:24karolherbst: DemiMarie: yeah soo.. there is a memory corruption
10:25karolherbst: I'm trying to figure this one out for a week already...
10:25karolherbst: something really trashes kernel memory and I still have no idea what exactly
10:25karolherbst: all the memory debugging tools give me nothing and kmsan kernels don't boot
10:26karolherbst: I also see `kmalloc` crashing on random pointers, it's really worrying, but something is seriously wrong and I don't know what yet
12:44sgruszka: daniels: hi, could you create ssh account for me (for drm-misc access) https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/666 ?
12:50daniels: sgruszka: sure, done
12:51sgruszka: daniels: thank you very much!
13:00daniels: np
13:15karolherbst: jenatali: ever looked at CL subgroups, specifically how one would implement the SubgroupsPerWorkgroupId part?
13:17jenatali: No, I've done Vulkan subgroups but that's it
13:17karolherbst: I'm inclined to just ignore SubgroupsPerWorkgroupId for now...
13:21karolherbst: mhhh.. uhh...
13:22karolherbst: I'm stupid... ignore me :D
13:22karolherbst: we should parse those things in vtn and not clc
13:44DemiMarie: karolherbst: this sounds like something is corrupting kernel memory via DMA. I recommend forcing everything to use shadow buffers and seeing if anything that the device should not be writing to gets written to.
13:46karolherbst: DemiMarie: I even boot with iommu enabled
13:47karolherbst: and there have been iommu related bugs at least.. not sure if a GPU could just get around it or if we map memory we shouldn't?
13:50robmur01: karolherbst: if the IOMMU is actually being used for DMA and you're not seeing any faults, I'd be inclined to suspect a bug the other way round
13:51robmur01: i.e. something in the driver on the CPU side assumes a DMA address ( == arbitrary IOMMU VA) is a physical address and ends up stomping the wrong memory that way
13:51karolherbst: yeah..... dunno really
13:52karolherbst: I just wished this would be easier to debug...
13:53robmur01: assuming the GPU supports >32-bit DMA, I'd try booting with "iommu.forcedac=1" and seeing if that makes any obvious difference
13:55robmur01: (that will move DMA addresses up to larger IOVAs that are less likely to collide with physical memory and more likely to make bugs of that type blow up visibly)
14:04DemiMarie: karolherbst: booting with IOMMU enabled isn’t enough.
14:07DemiMarie: You need to ensure that IOTLB flushes are not deferred and that sub-page regions at the start or end of a mapping are bounced via shadow buffers.
14:09DemiMarie: This is not the default because it has a severe performance penalty, but it is necessary when dealing with untrusted devices. A device that may be corrupting memory certainly qualifies, and right now that could be any device in the system.
14:10karolherbst: so what kernel parameters do I have to set to figure this out?
14:11robmur01: "iommu.strict=1" does the former, I don't think we have a way to force the full-on external-facing port behaviour in the kernel
14:13robmur01: you might need a small hack in set_pcie_untrusted() if you want to try that
14:17robmur01: (assuming that the idea of constructing entire ACPI table overrides in your initrd to do it "properly" is unappealing :D)
14:18karolherbst: well.. let's see what happens
15:05DemiMarie: karolherbst: what about adding a command-line option to treat all devices as untrusted? TrenchBoot needs this anyway.
15:06karolherbst: I just want to fix the bug I'm having
15:08DemiMarie: This should help
15:10DemiMarie: If the bug is a rogue DMA transfer, strict IOMMU invalidation will hopefully catch it.
15:11karolherbst: yeah... no idea
15:12DemiMarie: I suggest trying iommu.strict=1 and seeing if you get any faults.
15:12karolherbst: not yet at least
15:12karolherbst: but maybe I should remove iommu.forcedac=1 again
15:12karolherbst: it's a pain that it might take like 2 hours for the bug to trigger
15:14DemiMarie: Is this on i915 or Xe?
15:15karolherbst: nouveau
16:21karolherbst: mhhhhh "[ 559.500131] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP NOPTI"
16:22HdkR: That's a powerful address
16:22karolherbst: #define POISON_FREE 0x6b /* for use-after-free poisoning */
16:24karolherbst: maybe it's just a core kernel bug...
16:25karolherbst: maybe I try KASAN again, because now I have a different reproduce which might make it more reliable...
16:31mareko: I'm considering making pipe_resource* a 32-bit pointer on x86_64, using fixed high 32 bits
16:33karolherbst: what would be the expected perf increase? Kinda doesn't feel worth the effort
16:33mareko: the question is how can we reserve/allocate a 4GB range of virtual address space and then make physical allocation within that range
16:33karolherbst: mmap
16:34karolherbst: so you cut out a private hole and then manage that region yourself
16:35karolherbst: e.g. with util_vma_heap
16:36karolherbst: but it kinda doens't feel worth the effort as you will have to concat the address on the consumer side? And the memory savings don't feel like worth the effort?
16:36karolherbst: maybe a hash table would be a more robust approach here
16:36karolherbst: mhh but then comes with higher mem usages...
16:37karolherbst: potentially
16:37karolherbst: I think the main question here is, what's the problem which is solved by this
16:38mareko: we are going to have buffer lists as array of pointers in TC which are going to be large, and then we get a lot of 64-bit pointers in the gallium interface that we push into TC command buffers
16:38mareko: the goal is to save cache space
16:43karolherbst: alternatively, all pipe_resources are 0x40 aligned, with that the pointer would fit into 5 bytes on most/all consumer platforms
16:44karolherbst: ehh.. 6 I guess
16:44jenatali: Please keep in mind that Linux isn't the only Gallium consumer here
16:44karolherbst: mhh
16:46HdkR: 128-bit pointer ISAs want a word with that assumption :P
16:46karolherbst: maybe the solution here is to make all of our pointer hashing thing aware of the VM pointer size at runtime and use optimized caching/storage
16:46karolherbst: so if the VM is 48 bit, just use that instead of full 64
16:47karolherbst: alignment of the data being hashed _could_ be factored in as well
16:47karolherbst: to drop another byte if things align well
16:50karolherbst: but I don't think making the consumption of pipe_resources in general is worth the effort here of making them 32 bit pointers everywhere
16:50mareko: only in the interface
16:50mareko: you can call them 32-bit handles instead of pointers
16:52karolherbst: I think my point is rather, it makes everything more complex with gains we don't know are even worth it, because everybody will have to translate from pointer to handle and back
16:52mareko: that seems like the easy part
16:52DemiMarie: karolherbst: ugh, _that_ driver. I would not be surprised if the long-term fix is to rewrite it from scratch, preferably in Rust.
16:52karolherbst: mareko: I don't say it's hard, I say it's _annoying_
16:53mareko: I think it would be _lovely_
16:53karolherbst: I'm sure we have other places we can save more memory than that
16:54mareko: I think that's the only improvement we can make for TC
16:54karolherbst: I meant it more general
16:55robclark: mareko: alternative is big table of prsc ptrs with each prsc having a unique idx into the table?
16:55robclark: you should then be able to get away with less than 32b to id a prsc
16:55mareko: each pipe_resource already has a unique index
16:56mareko: a table adds cache usage
16:57mareko: and an indirection
16:57karolherbst: you already have that by cutting the high bits
16:57karolherbst: because you have to store and read those from somewhere anyway
16:57robclark: depends on when and where you deref it via the table
16:57mareko: that's not an indirection
16:58karolherbst: it kinda is
16:58mareko: load(load()) is an indirection
16:58karolherbst: okay.. so load(load() + offset) isn't?
16:58DemiMarie: TC = ?
16:58karolherbst: threaded context
16:58mareko: more like load(low32 | high32)
16:58karolherbst: and where do you get the high32 from?
16:58karolherbst: they are not magically there if you need them
16:59mareko: L1 cache
16:59hch12907: high32 is small enough to not need a load() I think
16:59DemiMarie: How about using relative addresses?
16:59karolherbst: so you'd pin that value into L1?
17:00mareko: I don't follow
17:00DemiMarie: Are these CPU or GPU addresses?
17:00karolherbst: CPU
17:00karolherbst: how do you make sure the high bits stay in L1?
17:00DemiMarie: I think the idea is that high32 would be stored in one place and used a lot, so it should be in the L1 cache most of the time.
17:00karolherbst: I dunno, this feels like premature optimization without even knowing the gains
17:00karolherbst: what if it's <1%
17:00mareko: there is a massive perf difference between an indirection table and a single value in memory that's equal for all resources
17:00karolherbst: this kinda feels like not worth the effort honestly
17:01karolherbst: oh sure
17:01karolherbst: but you still have to construct the actual pointer via data in some memory location
17:01mareko: no
17:01DemiMarie: Why are these large?
17:01DemiMarie: Lots of buffer objects?
17:01mareko: the compiler will CSE loading the high bits
17:01karolherbst: it still has to load the high bits...
17:02mareko: there is probably 1000 other loads that have nothing to do with this
17:03mareko: it's a win for cache usage
17:03mareko: TC braches will be smaller and TC buffer lists will be smaller
17:03mareko: *batches
17:03karolherbst: if TC wants to do anything smart internally, that's fine, I just say it's not worth the effort making life harder for everybody else
17:04karolherbst: I already suggested that high bits could be cut out generally in hash tables where pointers are keys or something
17:04karolherbst: and then we don't talk about 64 vs 32 anymore, but more like 48 vs 32
17:05karolherbst: could condense it even more if you try hard enough
17:05karolherbst: could align allocations to 0x100 or even 0x10000
17:05karolherbst: which is still better than cutting high bits
17:06mareko: sounds the same to me
17:06karolherbst: no, because you can do the magic transparent to everybody
17:06mareko: pipe_state.buffer = wrap(resource); .... unwrap(pipe_state.buffer);
17:06karolherbst: yeah, it's annoying
17:07mareko: lol
17:09karolherbst: if you'd have data showing "this saves us 10%+ memory usage" I'd reconsider, but without data I say it's not worth it
17:09mareko: only perf data
17:13robclark: mareko: if driver tells TC sizeof(struct drv_pipe_resource) then they could be suballocated out of a big array of drv_pipe_resource
17:13mareko: robclark: it needs to be an accessible pointer
17:13robclark: then you could easily map btwn idx and ptr.. and not even need special alignement of that block
17:14robclark: run but you can easily covert idx to ptr or visa versa via ptr math
17:14mareko: ok
17:15robclark: and that only touches drivers that do use TC, so at least it isn't treewide
17:15jenatali: You'd probably still want to reserve the array instead of committing it upfront and then commit bits of it as you allocate more resources
17:15robclark: just need to convert TC drivers to use new allocator for their drv_resource structs
17:15robclark: jenatali: yeah, allocate it as a big bunch and rely on unused parts not getting faulted in ?
17:16jenatali: At least on Windows, there's a difference between commit and working set. Working set is what you touch, where malloc adds to commit. You can also reserve just VA without any commit backing it
17:16jenatali: I don't know too much about Linux in this space
17:16karolherbst: kinda the same
17:17robclark: yeah, you might want to allocate it via mmap instead of malloc
17:17mareko: 32-bit pointers seem the least intrusive
17:17robclark: with table, you could probably get away w/ 16b
17:18mareko: oh we do have much more buffers than 2^16
17:18karolherbst: could also split it in two 16 bit values and if we run out of space you just allocate a second table and use the higher bits to offset the table
17:19karolherbst: but mmaping space is annoying because than you increase VM usage quite a bit
17:19mareko: no tables
17:19karolherbst: yeah...
17:19karolherbst: I'd rather aligned_alloc all pipe_resources with a suitable alignment
17:19karolherbst: so you only have 32 bits used
17:20karolherbst: and then you can handle it all inside TC
17:23karolherbst: but I guess that has other drawbacks...
17:24mareko: yes, sparsely mapped page tables
17:59karolherbst: DemiMarie: finally... https://gist.githubusercontent.com/karolherbst/6181e846a6f59dff98f3c70efe4633fb/raw/7aeba38388414e03b543f9364d01be7b46b27fbe/gistfile1.txt
18:00karolherbst: that explains so much...
18:00karolherbst: this is like 5 bugs in one I think..
18:40DavidHeidelberg[m]: MrCooper: hey! The Fedora changes. You want to keep the custom builds? I mean, they can be always added when new version is needed. What do you think?
19:36airlied: uggh is the ppc64le ci buildroot broken?
19:36airlied: oh maybe it was temporary
19:51karolherbst: wasn't there an option in the kernel to trace who was calling a kworker thing?
21:21mareko: "git log DEAD" in Mesa opens a time machine
21:32alyssa: mareko: woah, so it does, neat!
21:32alyssa: I don't think I realized Emma went that far back =D
21:42FLHerne: If you mean the name, isn't that just .mailmap ?
21:44FLHerne: if you mean the person, then I didn't either, so my previous line is kind of pointless, sorry
21:45mareko: that's because the commit sha1 begins with dead
21:45mattst88: oh, hahaha
21:56idr: SiS driver. Dang.
22:16gfxstrand: dcbaker: Little ping on meson + crates. Are we getting to where I should be trying to use a magic meson branch again and drop my mesonified crates?
22:17gfxstrand: dcbaker: I'm eyeing some other really nifty crates I'd like to use like enum_map
22:18alyssa: FLHerne: the person
22:19alyssa: idr: I don't even know what SiS is =D
22:20idr: alyssa: https://youtu.be/OB9VVTJ48gE
22:20idr: "Over 7 million souls suffered"
22:20alyssa: wild
22:21dcbaker: gfxstrand: The plan is to start landing pieces of it. I have a series that adds the core of the translation layer (lexers, parsers, some meson AST builder helpers, etc). I *think* that part will land this week, there's no serious review requests left on it
22:21gfxstrand: dcbaker: Cool!
22:22dcbaker: Then the plan is to start adding/fixing stuff piece by piece
22:22dcbaker: There a gstreamer/gnome dev who's been doing a ton of the followup work already
22:22dcbaker: so I'm hopeful by 1.2 we'll have at least some experimental support
22:24karolherbst: ohh nice..
22:24karolherbst: I want to use syn :3
22:26gfxstrand: dcbaker: When is 1.2?
22:26dcbaker: there's no set date yet
22:26dcbaker: and there's still a lot of stuff that devs want merged that's not even reviewed yet, so...
22:27gfxstrand: So, if I'm planning to merge NVK and NAK into Mesa main yet this year, will the stuff I need be in a released Meson version?
22:27dcbaker: Usually we have a release ~every two months
22:28dcbaker: and the last release was in April...
22:28dcbaker: I'd guess 1.2 will freeze by the end of the month
22:41gfxstrand: Ok, that sounds reasonable.