10:47MandiTwo: Hi! Im having an issue with mesa since mesa was split off in mesa and mesa amber. I have a intel hd graphics 4600 in my notebook and since the split, every compositor i tried, no matter if x or wayland, refuses to start any graphical application using it. Is there another driver i could use? I think crocus is the successor or?
10:53karolherbst: yeah, will need to use crocus. Wasn't it enabled/installed?
11:29MandiTwo: karolherbst: it is enabled and installed but the compositors fail to start using it. I can make sway work on mesa 25 by using pixman but i never got beyond that
11:58dj-death: jnoorman: about !35252, is adding BASE to load/store_ssbo something people agree on?
11:58dj-death: jnoorman: can I pull that in and drop my intel intrinsic variants? :)
13:17jnoorman: dj-death: note sure yet! I pinged in !34344 so let's see if people agree.
13:23dj-death: jnoorman: thanks!
13:24dj-death: jnoorman: on Intel we could add the BASE for load_ubo as well
14:42stsquad: I've noticed that my newer vkmark test images fail with a host libvirglrenderer 1.1.0-2 - should mesa be introducing non-backward compatible changes to venus?
14:48stsquad: it looks like the commands failing are VIRTIO_GPU_CMD_RESOURCE_UNREF, VIRTIO_GPU_CMD_RESOURCE_CREATE_BLOB, VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB and VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB
14:49digetx: venus protocol maintains compatibility and should work with libvirglrenderer 1.1.0, will test with older virglrenderer
14:49digetx: assume you have tested with latest libvirglrenderer and it works?
14:50stsquad: digetx the test images used by ./pyvenv/bin/meson test --setup thorough func-aarch64-aarch64_virt_gpu still work with the system virglrenderer - the newer image does indeed work with a hand built libvirglrenderer (07982b48d1967a - Uprev Mesa to 65e18a84944b559419aceaf2083936cf68ac3e79)
14:51stsquad: digetx: the mesa in the new test images is 25.0.6
14:53digetx: haven't see that problem, but I'm using latest virglrenderer all the time
14:55stsquad: digetx the old images where mesa 25.0 with vkmark 2025.01 - the other change is I had to build a HEAD for vkmark
14:56stsquad: there isn't much in the vkmark series:
14:56stsquad: https://paste.rs/1izYK
14:56stsquad: ^ the one patch I needed was display: Properly handle Vulkan errors during probing
14:57digetx: the only know mesa/venus critical issue today is the gcc-15 bug https://gitlab.freedesktop.org/mesa/mesa/-/issues/13242
15:10stsquad: the image is built with gcc-14 so hopefully this isn't the trigger
15:21robclark: karolherbst: enqueueReadBuffer seems to be really bad.. and all memcpy within rusticl. Something you are aware of?
15:21karolherbst: yeah
15:21karolherbst: I need to special case situations where there is just one device
15:22karolherbst: because this code is supposed to transparently migrate content of buffers in multi device contexts, and there isn't really a sane way to map resources if you have more than one device
15:23karolherbst: the other issue is, that directly mapping pipe_resources is a bit of a pain, because in CL it's inherently an asynchronous operation
15:23karolherbst: meaning you get the result pointer very early, and then at some point the result will be available
15:24karolherbst: like you will have a mapped resource while it's used later
15:24karolherbst: and some drivers don't really like this
15:24robclark: hmm, at least iGPU would be happy enough w/ persistent mappings
15:25karolherbst: yeah...
15:25karolherbst: I was considering being able to use different impls for certain things internally
15:25karolherbst: so I can properly abstract between those use cases
15:26karolherbst: persistent mappings are fine, but then it kinda broke with dGPU drivers and zink
15:27karolherbst: robclark: the thing is.. I don't need a persistent mapping, I just need to be able to tell a driver to map at a specific virtual address :) the mapping can go aware in the meantime as long as the address stays reserved
15:27karolherbst: then I wouldn't even need the copy, because then I can just map the resource where the current content is at even in multi device contexts
15:27karolherbst: anyway.. many ideas, not enough time
15:28karolherbst: the current code at least works 🙃
15:28robclark: karolherbst: extending transfer_map for this wouldn't be all that hard
15:28robclark: as long as rusticl takes care of reserving the VA range
15:28karolherbst: no it probably wouldn't, it's just a lot of work
15:28karolherbst: could make it opt-in by drivers
15:29karolherbst: but then I need a solution for drivers not having it implemented yet
15:29karolherbst: though I think all driver do the mapping with mmap internally, and they could get a target address
15:29robclark: is it something zink can do on vk? If so, why bother with backwards compat path
15:29robclark: right
15:29stsquad: digetx: is it possible to run venus with a pure llvmpipe backend? I'm just working out if there is a better way to do the buildroot testcase for vkmark so it doesn't rely on host GPU hardware
15:29karolherbst: robclark: no idea :)
15:31karolherbst: robclark: but there is also clEnqueueReadBuffer, which doesn't have to bother with this nonsense, and I do hope that nothing perf critical relies on mapBuffer to be fast :D
15:32robclark: clEnqueueReadBuffer is specifically what I was looking at right now.. just looking at things that clpeak is significantly worse than closed cl driver
15:33robclark: (idk, maybe there are better benchmarks than clpeak.. but gotta start somewhere)
15:33karolherbst: yeah fair
15:33karolherbst: I'd start with the alu perf tho
15:33karolherbst: unless that's already all faster :D
15:34karolherbst: mhh...
15:35karolherbst: robclark: soo.. clEnqueueReadBuffer being slow is a bit of a driver problem. Rusticl only maps the resource and copies it to the destination
15:35karolherbst: _however_
15:35robclark: alu is surprisingly better than I expected, even with all the extra load/store_global pointer math
15:35karolherbst: I was considering if e.g. the target address could be imported via resource_from_user_memory, and then doing the copy on the GPU :)
15:36karolherbst: but that will require drivers to map arbitrary addresses
15:36digetx: stsquad: venus tested with lavapipe in fdo ci, though ci is headless, assume vkmark should just work with lavapipe
15:37karolherbst: so if freedreno fails resource_from_user_memory with anything not page aligned, that might be a good place to start
15:37robclark: karolherbst: yeah, userpointer isn't supported on kernel side yet, and would be a heavier lift
15:37karolherbst: ahhh
15:37karolherbst: yeah...
15:37karolherbst: it's going to be required for good perf tho
15:37karolherbst: long-term
15:38karolherbst: there are also places where rusticl shadow buffers, because mapping from user memory is required in CL
15:38digetx: stsquad: think I was running vkmark with lavapipe couple months ago, let me know if it won't work
15:38karolherbst: in required in the most insane way: you can't fail the API request :)
15:38karolherbst: and the API sets up 0 requirements for the application
15:39karolherbst: but yeah.. besides that... it also depends on how resource_map is implemented in drivers
15:40karolherbst: like if the pipe_transfer already shadow buffers the mapping + copy, and rusticl also does a copy to the final test, then we copy a bit too often :)
15:40karolherbst: but I really just would like the GPU to do the copy, which would then also get rid of stalls in the queue
15:41karolherbst: I think some drivers already shadow buffer it if the resource is busy
15:42robclark: in this case, I don't _think_ we should be hitting the shadow copy path, but only just starting to look at it... perf says it is all memcpy in rusticl::core::memory::Buffer::read
15:42robclark: idk if it helps, but we could do quazi-svm (ie. gpu allocated but set gpu va and cpu va to the same)
15:43karolherbst: well..
15:43karolherbst: the API is supposed to copy from the GPU to application memory
15:43karolherbst: so at least can't get rid of the main copy there unless you manage to move it to the GPU, which then requires userptrs :)
15:44karolherbst: robclark: yeah so the thing is, that CL supports multi-device contexts, and memory objects allocated in those are transparently migrated between the devices
15:44karolherbst: so we'd need to do it on all drivers
15:44karolherbst: which I'm doing with proper SVM already anyway
15:44karolherbst: and the interfaces exist for rusticl to manage the virtual GPU addresses instead
15:45robclark: this feels bad enough that it's got to be like multiple extra memcpy's ;-)
15:45karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32942/diffs?commit_id=8658529e8836fc5abed3b6df0fb059ae30abf578
15:45karolherbst: with this interface I got SVM working across drivers/vendors, which is neat
15:46karolherbst: robclark: how big is the perf difference reported by clpeak?
15:46robclark: 12x
15:46karolherbst: oof
15:47karolherbst: is the prop stack supporting userptr?
15:47robclark: idk, it is windows
15:47robclark: so no clue
15:47karolherbst: mhhh
15:47karolherbst: I have no idea how to make clEnqueueReadBuffer faster without userptr then
15:47robclark: but cpu should be able to saturate memory bandwidth, and I can use cached-coherent gpu buffers
15:48karolherbst: it's just a plain copy from mapped to application memory
15:48robclark: ok, well I'll poke around more
15:48karolherbst: is the hw capable of doing such copies?
15:49karolherbst: or just GPU -> GPU memory ones?
15:49karolherbst: could be that "ptr::copy" is doing something that is slow or so...
15:49robclark: I mean it is iGPU so everything is visible to cpu and gpu
15:49karolherbst: or resource_map doing something weird, but if it's all the `ptr::copy` one.. then no idea hoenstly
15:50karolherbst: could also be something else, like stalling and that's why the number is so low, but also no good idea how to figure that out
15:53robclark: this test doesn't even seem to launch any grids, it's just a loop of clEnqueueReadBuffer()
15:53karolherbst: yeah
15:53karolherbst: maybe it's just memcpy not being optimized for aarch64 or something silly
15:53karolherbst: which I doubt, but...
15:54karolherbst: it's properly vectorized and stuff, no?
15:54karolherbst: on x86 it shows up as memcpy_avx2 or whatever in profiles
15:55karolherbst: robclark: uhm.. are you doing a optimized release build with all the perf things enabled? A debug build might give you bad perf there
15:55robclark: hmm, is a good question why I get __memcpy_generic().. glibc should have optimized memcpy/memset/etc for this core.. but I don't expect that is a 12x thing
15:56karolherbst: mhhh
15:56karolherbst: who knows, could be x12 could be x2
15:56karolherbst: could also be -O0 making things slow
15:56robclark: results are basically same btwn debug and release mesa build
15:56karolherbst: but yeah.. it's probably a CPU only test and if the CPU side is slow, then yeah...
15:56karolherbst: I see
15:58karolherbst: but yeah.. if your core has SVE it shouldn't use generic
15:58robclark: ahh, it gets quite a bit faster if I actually set FD_BO_CACHED_COHERENT
15:58robclark: :-P
15:58karolherbst: :D
15:58karolherbst: nice
15:58robclark: this is a hack tho, rusticl should be setting some bits to tell me that it wants coherent ;-)
15:59karolherbst: yeah...
15:59karolherbst: I just focused on getting things working for now, and if there are nice flags important for perf, I can figure out how to set them
15:59karolherbst: coherent resources are just a bit of a pain in mesa, because some dGPU drivers place them in system RAM
15:59karolherbst: or map_coherent needs the coherent flag at resource creation time
16:00karolherbst: coherent also requires persistent
16:01karolherbst: but maybe it's fine if I only do it for devices with caps.uma set to true
16:02karolherbst: but then again, I have no idea what the implications are of enabling coherent + persistent for every resource on uma systems
16:03robclark: bit higher cost on gpu I guess, much lower cost for cpu reads
16:03karolherbst: could also change the gallium semantics. if PIPE_MAP_COHERENT is used without PIPE_MAP_PERSISTENT, PIPE_RESOURCE_FLAG_MAP_PERSISTENT, PIPE_RESOURCE_FLAG_MAP_COHERENT, it's best effort and drivers might ignore the flag
16:04karolherbst: what matters more is, if creating the resource with PIPE_RESOURCE_FLAG_MAP_PERSISTENT and PIPE_RESOURCE_FLAG_MAP_COHERENT adds additional costs
16:04karolherbst: (besides dGPU drivers placing them in staging/system memory)
16:06karolherbst: the one important aspect is, that I don't really need a coherent mapping, the content is synced at sync points, so a barrier should do the trick to guarantee content is visible. What matters more is that the copy is fast
16:15stsquad: digetx can you point me at the CI job so I can crib the command line ;-)
16:31digetx: stsquad: set `VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json` in env var and run qemu, checked that vkmark works with that
16:43zmike: mareko: I'm looking at st_nir_unlower_io_to_vars; is the idea just that you unset nir_io_has_intrinsics to use it?
17:41stsquad: ➜ env VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json vkmark
17:41stsquad: fish: Job 1, 'env VK_ICD_FILENAMES=/usr/share…' terminated by signal SIGSEGV (Address boundary error)
17:41stsquad: oops
17:41stsquad: ahh works with vkmark master
17:54zmike: mareko: I'd like to switch to using your common io unlower pass (and delete a lot of zink code), but there's the xfb issue I mentioned
17:54zmike: can you fix that?
18:49mareko: zmike: see my comment on the MR
19:36mareko: alyssa: any chance you can ack the invariant removal?
19:38alyssa: mareko: ack