00:49kurufu: https://files.catbox.moe/qzifpv.png what triggers drm_atomic_helper_prepare_planes being called on drm commit? Normally it seems i can only find stacks in check but sometimes prepare planes ends up getting called which takes ~16ms. More precisely it seems to only be slow every other frame for some reason?
12:41Lynne: GPUs are scalar these days, so is the fact I'm seeing twice the perf from fp16 vec4 compared to scalar because of dual dispatch being able to parallelize better?
13:05pendingchaos: it's probably because GPUs are not scalar (within an invocation), just mostly scalar
13:05pendingchaos: in particular, some GPUs have vec2 fp16 instructions
13:12karolherbst: yeah.. a lot of GPUs have vec2 fp16 these days, because registers are most of the time 32 bit, so it actually makes sense to provide that
14:58kisak: Summary of the kisak-mesa PPA update cycle to 24.2.0. No more rusticl rollbacks to get to build on Jammy. NVK enablement changes made for meson troubles. NVK is noble+, with i386 being blocked by librust-syn-dev:i386 not existing.
15:01kisak: (expected to also block on librust-paste-dev:i386 missing)
15:08glehmann: also, in mesa's case, NIR vectorization of scalar fp16 code is not great
15:13Lynne: thanks, that was good to learn
15:14soreau: kisak: good to know, thanks for the rundown
15:15Lynne: I'm surprised scalar code doesn't get vectorized, this is pretty trivial to, but nothing's probably trivial with GPU code after at least 3 translations in various IRs
15:28glehmann: there is a NIR vectorizer, but it's only top down and requires inputs to be vectors
15:30glehmann: especially for amd hw, we would really wants something a bit more aggressive, since packing two scalar fp16 values into a vec2 can often be free
15:32glehmann: but fp16 on desktop is also a bit of a chicken and egg problem: it doesn't see a lot of use, so there isn't a lot of time spend on optimizing it. meaning if you use it, perf is also not ideal
15:33karolherbst: also.. it's always better for the programmer to write explicitly against vec2 anyway
15:33karolherbst: relying on an auto-vectorizer is a bad proramming model
15:35karolherbst: load/stores can be vectorized anyway, so fetching 32 bit vec4 worth of data is probably a good idea anyway
15:35karolherbst: even if it's scalar in the end
15:35karolherbst: but the loads/stores won't be
15:36karolherbst: and nvidia e.g. needs 128 bit alignment for 128 bit (32 bit vec4) load/stores anyway
15:39glehmann: > relying on an auto-vectorizer is a bad proramming model
15:39glehmann: DX12's IR only has scalar alu instructions, so we don't get to choose
15:39karolherbst: annoying
22:08DemiMarie: How bad is it (performance-wise) to copy all buffers (from all clients) from GPU buffers to CPU buffers and back?
22:09DemiMarie: Also, is this the right place to ask that question?