08:37 mlankhorst: airlied: hey I saw your cgroup patches, anything I can do to help advance them? Been running them locally without any issues since you posted them
10:48 mlankhorst: Can I get a review for https://patchwork.freedesktop.org/patch/706995/ I also added a corresponding testcase for xe, but all drivers will benefit.
12:43 alyssa: pendingchaos: thanks for the range analysis zoomies
12:44 alyssa: I will try to review both MR's today but no promises
12:56 alyssa: btw, u_sparse_array is probably the wrong data structure here
12:56 alyssa: since thats for multithreading
12:56 alyssa: so there's a bunch of atomics unnecessary here
12:59 pendingchaos: it's still faster, and afaik there's isn't currently an alternative
12:59 pendingchaos: at least, it's faster on my machine
12:59 pendingchaos: not sure how much faster it would be if the atomics are removed
13:02 wens: anyone know what happened to the DRM_MODE_DUMB_KERNEL_MAP proposal?
13:02 pendingchaos: _mesa_hash_table_create_u32_keys: "key == 0 and key == deleted_key are not allowed"
13:02 pendingchaos: what's deleted_key? who knows, it's just the integer bitcast of a static variable's address
13:03 pendingchaos:goes over users of _mesa_hash_table_create_u32_keys
13:05 pendingchaos: we could change _mesa_hash_table_create_u32_keys to set the deleted key to something that isn't a u32, like UINT64_MAX, but that wouldn't work on 32-bit
13:06 alyssa: pendingchaos: yeah i mean im inclined to merge your MR as-is
13:06 alyssa: but More Improvements Are Possible
13:06 alyssa: anything's better than a hash table.
13:06 alyssa: BTW did you try with a dense array? is the memory usage unacceptable?
13:07 pendingchaos: memory usage was worse sometimes, for what seemed like too little of a compile time improvement
13:07 alyssa: (I'm a little surprised the array is sparse at all given how many instructions participate in range analysis but who knows)
13:07 pendingchaos: I tried doing something a sparse_array for unsigned_upper_bound and lsb_zero, but it was faster for uub and slower for lsb_zero for some reason
13:07 alyssa: wild
13:08 alyssa: anyway I'm happy to see your MR and georg's merged this week hopefully
13:08 alyssa: I am still half asleep on the sofa which is why I haven't left comments on gitlab :p
13:08 pendingchaos: memory usage numbers: https://www.irccloud.com/pastebin/dt92Q65I/
13:13 alyssa: pendingchaos: tbh I'm not totally convinced
13:13 alyssa: the other issue here is having 32-bit results
13:13 alyssa: when less than 16 bits are used
13:14 alyssa: and if the results are halved, the dense array mem usage would be halved but the sparse/ht would be.. reduced but not by half
13:14 alyssa: unless those #s were with a 16-bit
13:15 pendingchaos: these were with 32-bit array elements, I guess we can use 16-bit actually
13:15 pendingchaos: ht wouldn't be reduced, but sparse array can be reduced by half too
13:16 pendingchaos: I just blindly continued using 32-bit array elements because that's what the old code did, but with an array, there's an opportunity to save memory by using 16-bit
13:16 alyssa: right
13:16 alyssa: a "yes, and" situation then :]
13:17 alyssa: sparse array would be less-than-half though because you have other overheads
13:17 alyssa: I think?
13:17 pendingchaos: compile time change from using sparse instead of dense: https://www.irccloud.com/pastebin/8ggH673r/
13:18 pendingchaos: (still 32-bit, these are my old bench numbers before the MR was opened)
13:18 pendingchaos: and this is the cost of nir_analyze_fp_range(), not the entire process
13:18 alyssa: right, I see why you made the choice you did.
13:21 alyssa: pendingchaos: ok, how does this sound - we merge the sparse256 MR, and we merge Georg's stuff, and then we shrink the 16-bit and reeval sparse vs dense since the #s there will look different by then
13:23 pendingchaos: I think that's fine
13:23 pendingchaos: maybe we can also experiment with removing atomics
13:24 alyssa: the sparse array atomics are critical for correctness for the data structure's intended purpose
13:24 alyssa: it's a well-engineered data structure, it's just not meant for use in compilers.
13:24 pendingchaos: I meant creating a copy of the functions with atomics removed, that range analysis uses instead
13:25 alyssa: ah
17:25 mareko: I have a pass that lowers non-uniform UBO and SSBO loads to load_global instead of waterfall loops, the only requirement is that ssbo_address, ubo_address, ssbo_size, and ubo_size intrinsics are implemented, would that be useful in core NIR?
17:27 alyssa: mareko: AGX does that itself for UBOs, could switch but idk if there's a point
17:27 alyssa: AGX & Panfrost share code in common for doing that for SSBOs (nir_lower_ssbo)
17:28 alyssa: Merging your pass into nir_lower_ssbo and generalizing into a nir_lower_buffer pass seems sensible to me
17:28 mareko: well, my pass also includes bounds checking as per robustness2
17:28 alyssa: The above drivers get robustness via nir_lower_robust_access
17:29 alyssa: You run nir_lower_robust_access first to insert bounds checks, and then run nir_lower_ssbo or whatever on the resulting bounds checked code
17:29 alyssa: this is the path GL4.6 SSBOs take on Asahi for ex
17:29 alyssa: (the vulkan drivers don't use any of this code because it's redundant with nir_lower_explicit_io)
17:30 mareko: robustness2 requires that each component is bounds-checked and loaded separately if the load is partially out-of-bounds
17:30 alyssa: yes lower_explicit_io deals with that for the Vulkan drivers
17:34 mareko: I don't see it do bounds checking though
17:41 mareko: nir_lower_robust_access changes the offset to 0 for out-of-bounds, which doesn't meet robustBufferAccess2 requirements
17:54 mareko: since vectorization can merge in-bounds and out-of-bounds loads/stores, we need per-component bounds checking if the whole load isn't fully in-bounds
17:54 mareko: or never use load/store vectorization
18:11 karolherbst: gfxstrand: given you wrote (or copied?) the code initially, do you want to chime in on this MR? It's about fp16 and rounding... https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40163
18:32 alyssa: mareko: yes lower_explicit_io scalarizes loads if robust2 is enabled
18:37 alyssa: anholt: btw I've found nir_shader_bisect.py really useful, thank you for it :)
18:38 alyssa: jnoorman: and you too by extension
18:40 mareko: sweet; it could still be better by doing if (wholly_in_bounds) load as vector; else load as scalars;
18:40 alyssa: Possibly yes
18:40 alyssa: likely hardware-dependent
18:41 mareko: and for 8-16 bits, it could split the load to 32-bit segments, not full scalarization
18:42 mareko: or MIN2(alignment, robust_buffer_access_size) segment
18:42 mareko: *segments
18:58 anholt: alyssa: glad it worked for somebody else!
20:32 airlied: mlankhorst: convince ckonig they are valuable for some usecase he hasn't decided is useless in advance :-)