00:14 zmike: jenatali: any ballpark on a timeline (haha) to get all that fixed? I'd like to get this merged ASAP
00:14 jenatali: zmike: I made progress today, need to sync (haha) with Sil tomorrow to finish it
00:18 zmike: jenatali: alright cool, just signal (haha) me whenever
02:44 zzyiwei: Hi, I have a question regarding ANV_IMAGE_MEMORY_BINDING_PRIVATE. Since that's a private bo to workaround non-ccs modifier, is it better to relocate that to the bound device memory instead? Then all the special rejections for image aliasing can be dropped given the VkDeviceMemory now contains both the main binding and the private binding.
04:51 Lynne: ...so glsl defines f16vec4, nice
04:52 Lynne: and also a -hf suffix so you can have native 16-bit floats (e.g. 0.0hf)
04:52 Lynne: ...but no scalar 16-bit float format
04:52 Lynne: making everything mentioned incredibly useful indeed
04:54 Lynne: right, one more for the list of crimes to charge everyone involved with glsl's evolution (I refuse to call it design, no intelligent thought was involved)
05:00 Lynne: oh, its float16_t, and you have to enable the GLSL-era AMD half float glsl extension
05:01 Lynne: which was released after f16vec4 was defined, and is not present in the official GLSL extension list of khronos
05:24 zzyiwei: The context for my question is this MR i've just sent out: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35875
06:14 xeyler: if anyone finds the time to review these patches i sent out over a week ago, i’d be grateful: https://lore.kernel.org/all/?q=20250620180258.132160-1-me%40brighamcampbell.com+OR+20250624062728.4424-1-me%40brighamcampbell.com
06:15 xeyler: if i don’t get any response in another week or so, i’ll RESEND the patches
14:37 eric_engestrom: dcbaker: ack; I think we were all happy with the solution we went with in the end though, so while it would be better if meson provided that functionality itself, I think it's not a problem if it takes a long time to get there :)
14:39 eric_engestrom: reminder for mesa devs: 25.2 branchpoint is 2 weeks from today (https://docs.mesa3d.org/release-calendar)
14:40 robclark: daniels: are the "special" SRGB fourcc's meant to show up at egl api level? `Skipping XB2�` .. I guess that is one of the special ones
14:43 daniels: robclark: no, SARGB8 and friends must be hidden away
14:44 robclark: ok
16:15 pepp: sima: wanna take a look at the updated version of https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34447 (F_DUPFD_QUERY)?
16:26 sima: pepp, did you test this on older kernels without F_DUPFD_QUERY already?
16:26 sima: I think the fallback errno should be EINVAL, everything else is unexpected and I guess you should assert on those
16:27 sima: and if errno == 0 you can rely on the result
16:27 sima: or I'm misreading the kernel code
16:33 pepp: sima: EBADF can also be returned if the fd is invalid IIRC
16:35 sima: pepp, yeah but that's pretty bad programming mistake
16:37 sima: like from a quick read all the variants are fairly undefined for when you do that, so feels a bit silly to special case that one
16:37 sima: plus I do think you need a fallback for errno==EINVAL
16:37 sima: anyway going to drop a comment
16:39 sima: oh I misread your code
16:57 karolherbst: zmike, mareko: one of you might know this: what's the "proper" way of compiling multiple shaders in parallel from a frontend pov? Do we have proper APIs for that or should I just create mulitple contexts and compile through those? Or do we already have a wrapper for that doing it with a queue + threads?
16:58 zmike: in depends how parallel you want it
16:58 zmike: if shareable shaders is supported you can use multiple pipe_context objects
16:58 zmike: but if you want single context then you have a couple options
16:59 zmike: check out download_texture_compute in st_pbo_compute.c for state of the art
17:00 karolherbst: in CL I don't have a queue available when compiling things, so atm I use a global helper context for creating compute state objects
17:00 karolherbst: but some applications need like multiple minutes until compilation is all done, so I kinda want to parallelize this
17:01 zmike: yeah so copy what I did in compute pbo if you really want to max it out
17:01 karolherbst: ahh we have "driver_thread_add_job"
17:01 zmike: that's the super hammer
17:02 zmike: which effectively lets you add work directly to the driver's shader compiler thread
17:03 karolherbst: mhhh
17:04 pepp: sima: thx for the feedback, I'll tweak the code a bit
17:04 karolherbst: guess I need to play around with those things a little then
17:06 karolherbst: luckily CL allows async compilation, so a queue + a fence thing kinda matches that pretty well
17:06 karolherbst: though I don't need the fence really
17:06 karolherbst: I think.. maybe I do to be safe
17:06 zmike: the GL scenario that prompted it was extremely latency sensitive, so I'd imagine it should be able to do whatever you need
17:07 karolherbst: mhhh, though only zink and radeonsi support it, so I need a fallback myself... maybe I just wrap `util_queue`
17:07 zmike: well there's the base parallel shader compile stuff
17:08 zmike: which is also used in the compute pbo logic
17:08 karolherbst: yeah.. not caring about latency at all, just want to turn synchronization compilation into async ones as this is permitted by the CL API
17:10 karolherbst: mhhh
17:11 karolherbst: though I think I have to rethink compilation in a more broader sense, because I also have parts which are driver agnostic... or I just use all driver threads and just load balance between them or something...
17:11 zmike: what if the shaders compiled themselves?
17:12 karolherbst: heh
17:12 karolherbst: the expensive part is the OpenCL C to SPIR-V compilation... and that's driver indepentent anyway
17:13 karolherbst: but I already have all the screens, and the threads are already created, so might as well just use them for random stuff 🙃
17:13 karolherbst: though so far I only want to use them for compiling things
17:14 karolherbst: but even "set_max_shader_compiler_threads" isn't supported by all drivers...
17:15 zmike: nothing is supported by all drivers
17:16 karolherbst: but looks like if driver_thread_add_job isn't provided it's all synchronous
17:16 alyssa: zmike: shame people keep writing gl drivers instead of using zink smh
17:16 alyssa:runs
17:17 zmike: alyssa: GET BACK HERE!!!!
17:17 zmike: damn kids
17:17 karolherbst: anyway.. I don't mind creating my own threads..
17:17 alyssa: what can i say, gallium is a nicer api than vk
17:17 alyssa: (and vk is nicer than gl, obviously)
17:17 zmike: karolherbst: pls don't NIH the wheel
17:18 karolherbst: yeah.. I should just use util_queue
17:18 zmike: I meant just use existing api
17:19 karolherbst: but then it's slow on other drivers :D
17:19 zmike: so tell those drivers to fix their shit
17:19 karolherbst: mhh
17:19 zmike: be a real frontend
17:19 karolherbst: guess adding a driver compilation thread isn't hard
17:19 karolherbst: even if the drivers don't use it themselves
17:20 zmike: if drivers want to be slow that's their problem, not yours imo
17:20 karolherbst: well.. iris already has a thread, but doens't implement driver_thread_add_job
17:20 jenatali: zmike: This fence stuff is worse than I thought :( it's gonna take me a bit longer to untangle. Things like pipe_fence_handles not being refcounted correctly too
17:20 jenatali: Bleh
17:20 zmike: :/
17:20 karolherbst: so maybe I fix iris and move on
17:21 zmike:whispers delete iris
17:21 karolherbst: it's 5 loc at most
17:48 alyssa: karolherbst: iris? it's like 25,000 lines!
17:48 alyssa: Kayden: be like "alyssa you're not helping"
17:48 alyssa: :P
17:53 zmike: I don't think he can hear you over the sound of infinite meetings
18:11 Kayden: in fact I can :P
18:13 idr: Lol
18:15 mareko: karolherbst: pipe_context::create_compute_state (and create_xs_state) just pushes that compilation onto an async thread, and the next bind or draw waits for it
18:15 mareko: in radeonsi
18:15 mareko: so it's already parallel in that driver
18:16 karolherbst: right... I'm not too concerned about creation of the CSOs itself, but there is a lot more I do: OpenCL C to LLVM, LLVM to SPIR-V, SPIR-V to nir + a bunch of passes
18:16 karolherbst: and driver finalization
18:16 karolherbst: and I'd like to parallelize the entire thing
18:16 mareko: any driver that enables TC must also be able to accept create_(shader)_state from any thread
18:17 karolherbst: right
18:17 karolherbst: so what I'm wondering is, if I just roll out my own compilation queue handling or just use driver_thread_add_job to push jobs to driver threads
18:18 mareko: that should work
18:19 karolherbst: but I also wanted to look into reusing clang instances because atm we recreate it over and over again... there is a bit for me to figure out, but if I can just use driver_thread_add_job to schedule such jobs that would make it a lot easier for me
18:20 mareko: I don't recommend messing with set_max_shader_compiler_threads
18:21 karolherbst: it's only for that one GL extension, right?
18:21 mareko: yes
18:21 mareko: the radeonsi default is 3/4 of CPU cores are dedicated to driver_thread_add_job, and there is a reason for it
18:21 mareko: as the core count gets lower, the ratio decreases
18:22 karolherbst: mareko: right.. so the issue with that is, CL is multi-device natively, so if you have let's say 4 AMD devices you also have 4 screens each allocating 3/4 of CPU cores threads
18:22 karolherbst: which might be fine regardless
18:23 mareko: let the kernel deal with it
18:23 karolherbst: but it complicates things a little if you don't want to spend all CPU cores on compiling things
18:23 karolherbst: right
18:24 mareko: radeonsi also has another queue using idle priority for low priority shader compilation (using 1/3 CPU cores)
18:25 mareko: there is an amazing synergy between TC and the shader compile queues
18:28 mareko: as draws get enqueued in TC, shaders created between draws get scheduled on compiler threads; TC is deep enough to hold ~1000 draws, so if we get 1000 new shaders between all draws, we compile them all in parallel because they are scheduled to compiler threads immediately while draws are waiting in TC
18:31 karolherbst: mhhh I see
18:31 mareko: if we get lots of shaders in 1 frame, we basically compile them in parallel even if they are compiled sequentially by GL and between different draws
18:32 karolherbst: luckily I don't have any of those issues really. though I was considering using TC at some point, but not sure that with compute only workloads there is much of a benefit
18:39 mareko: zmike: this might interest you: https://gitlab.freedesktop.org/mesa/mesa/-/commit/a42775c03ec88025f6793b71fdf3d81d91cf3926#2c8b4a9f3ead3a7c4f085bd9b156e2923c838226
18:40 zmike: mareko: removing the first point size is valid though since that's the default since maintenance5
18:40 zmike: or
18:40 zmike: hm
18:41 zmike: no, I think that should be valid
18:41 zmike: radv would need to use the default value for that case
18:41 jenatali: Right, you'd need to treat emit as a barrier, wouldn't you?
18:42 zmike: if you're passing vkcts with this then I'd think it means it's missing coverage
18:55 Sachiel: are you sure that's correct? Looking at ANV, we check if the shader writes pointSize and if so tell the HW the point size will come from the shader, otherwise we it comes from HW state, but I don't think there's anything handling the "some paths will write pointSize and others won't" case
18:56 mareko: it's probably broken everywhere
18:59 Sachiel: VUID-VkGraphicsPipelineCreateInfo-maintenance5-08775
18:59 Sachiel: If the maintenance5 feature is enabled and a PointSize decorated variable is written to, all execution paths must write to a PointSize decorated variable
19:00 Sachiel: though...
19:00 Sachiel: VUID-VkGraphicsPipelineCreateInfo-shaderTessellationAndGeometryPointSize-08776
19:00 Sachiel: If the pipeline is being created with a Geometry Execution Model, uses the OutputPoints Execution Mode, and the shaderTessellationAndGeometryPointSize feature is enabled, a PointSize decorated variable must be written to for every vertex emitted if the maintenance5 feature is not enabled
19:01 Sachiel: makes it sound like for geometry shaders you can omit it sometimes? Can't tell if that's contradicting the previous VU or I'm just misunderstanding things
19:02 mareko: it's confusing
19:02 mareko: if you don't write it in all execution paths, it should be 1, right?
19:02 zmike: I think those are different cases
19:03 zmike: the first one is saying you can't do like if (x) pointsize = y; else {}
19:03 zmike: the second one is enforcing the "no default point size exists" idea
19:03 zmike: but it could be clearer
19:03 zmike: and I'd guess there's no cts coverage eitehr
19:05 mareko: either way, pointsize must be set for every emit in radv or not at all
19:05 zmike: sounds like a radv bug according to the current spec
19:24 zmike: filed https://gitlab.khronos.org/Tracker/vk-gl-cts/-/issues/5849
20:03 alyssa: i am.. unsure that i agree with this
20:05 alyssa: but i also don't have a spec citation so.
20:34 mareko: "can be eliminated" is a weird wording, it's more like "first pointsize is optional and it defaults to 1 if it's missing"
20:34 mareko: "can be eliminated" means it's OK to break it
20:46 zmike: ricardo will know what I mean
21:29 jenatali: zmike: I'm going to stage a branch with fixes/cleanups to video stuff, and get that landed ASAP. I think the thing that makes sense is to rebase your branch on that, where the fixups to split the fence value stuff out will be more obvious
21:29 jenatali: I think the thing that makes sense is to wait to rebase your branch, but let me know if you want me to do it sooner
21:30 zmike: alrighty
21:30 zmike: thanks for prioritizing
21:30 jenatali: This found some... real bad stuff
21:30 zmike: haha I bet
21:30 jenatali: Mainly just refcounting gone missing
21:30 zmike: imo just ship zink+dozen
21:31 jenatali: Yeah but dzn's missing a ton of stuff and I don't have the prioritization in my schedule to make it work :(
21:31 zmike: oof
21:31 jenatali: Especially around video, getting vk video to map nicely would be a lot
21:31 zmike: vk video is a lot
21:31 jenatali: Video is a lot
21:32 zmike: amen