14:26 jenatali: karolherbst: I think I just figured out a way to do BDA in dzn, the exact same way that CLOn12 emulates pointers (idx+offset pairs), but instead of index being a locally bound array index, just make it a global resource ID which we already have for descriptor_indexing
14:27 jenatali: I don't think I would've come up with that if you hadn't suggested rusticl on zink on dzn, so thanks for that
15:03 karolherbst: jenatali: cool. But would that also work with arbitrary pointers?
15:04 karolherbst: though _might_ be fine
15:04 karolherbst: or at least for most things
15:05 karolherbst: jenatali: I think the only problem you'd need to solve is to keep the idx valid across kernel invocations for things like global variables or funky stuff kenrels might do
15:05 jenatali: What do you mean arbitrary pointers? When the app asks for a buffer address, I just give back an index and offset
15:05 jenatali: Yeah, that's what I mean by a global index
15:05 karolherbst: sure, but applications can do random C nonsense
15:05 karolherbst: right..
15:05 karolherbst: yeah.. then it should be fine
15:06 jenatali: Yeah, it won't be stable for capture/replay but that's a different feature so that's fine
15:06 karolherbst: so set_global_bindings would return an index and offset packed into 64 bits, I pass this into the kernel via ubo0 (kenrel arguments) and then it should be good to go
15:06 jenatali: Right
15:07 karolherbst: and gallium doesn't use load_global(_constant) and store_global for anything, so you can deal with the madness there
15:07 karolherbst: I wonder if I want to support different pointer layouts directly, but....
15:08 jenatali: Well I don't have that bindless path in the gallium driver currently, only in dozen
15:08 karolherbst: the CL path is really special sadly
15:08 karolherbst: we have this `set_global_bindings` api which is a bit funky...
15:08 karolherbst: but that's everything you'd need
15:08 jenatali: Yeah makes sense
15:09 karolherbst: luckily there are no bindless images or anything
15:09 karolherbst: and `set_global_bindings` basically means: give me the GPU address for those pipe resources, and make them available on compute dispatches
15:09 karolherbst: *for
15:10 karolherbst: there is also some funky offset business going on, but iris/radeonsi/zink have it correctly implemented
15:11 karolherbst: jenatali: uhm.. there is another thing: `pipe_grid_info::variable_shared_mem`, no idea if you can support that
15:12 karolherbst: how are CL local memory kernel parameters currently implemented on your side?
15:12 jenatali: Only by recompiling shaders
15:12 karolherbst: mhhh
15:12 jenatali: Same with local group size because that's a compile-time param in D3D
15:13 karolherbst: I see, so you have to deal with pain like that already anyway
15:13 jenatali: Yeah
15:14 karolherbst: kinda sucks, but not much you or I could do about it...
15:15 jenatali: karolherbst: btw, I noticed you're computing a dynamic local size by using gcd() with the SIMD (wave) size and the global size. That's always going to return 2 for even global sizes and 1 for odd, since SIMD sizes are powers of 2
15:16 jenatali: I was looking because CLOn12's handling of odd global dimensions was... Bad
15:16 karolherbst: yeah...
15:16 karolherbst: I reworked that code tho, just never landed it as it was part of non uniform workgroup support
15:16 jenatali: Cool
15:16 karolherbst: it doesn't matter anyway as most applications aren't silly enough to run into this edge case
15:17 karolherbst: can you support non uniform work groups?
15:17 karolherbst: if so.. doesn't matter long term anyway
15:17 jenatali: Not natively
15:18 karolherbst: mhhh
15:18 jenatali: karolherbst: apparently Photoshop does
15:18 karolherbst: figures...
15:18 jenatali: At least that's what one of our teams is telling me
15:18 karolherbst: yeah.. it makes perfect sense if they use image sizes for stuff
15:20 karolherbst: but uhhh.. why do you think I'm using the simd size with gcd?, I'm using the thread count and the grid size
15:20 karolherbst: subgroups only as a last ressort if things align really terribly
15:20 karolherbst: *SIMD size
15:21 karolherbst: `optimize_local_size` is what I'm looking at
15:23 karolherbst: so if you have 512 threads and a grid of 500x1x1, you'd get 500x1x1 still
15:24 karolherbst: it just has some weirdo edge cases where it uses terrible local sizes
15:24 karolherbst: I don't like the third part of that function and it could be better, but it's not _as_ bad
15:30 jenatali: Hmm ok, I thought I saw SIMD size in there
15:30 jenatali: The gcd is still always going to be 2 or 1 though, since that thread count will also be a power of 2
15:44 karolherbst: it can be any pot number
15:44 karolherbst: if your gpu supports 1024 threads, you have 2^10 on one side, and anything else on the other one
15:44 jenatali: ... Yeah that's what I meant
15:44 jenatali: A power of 2 or 1
15:44 karolherbst: ahh yeah, fair
15:45 karolherbst: the last block is supposed to fill it up if the middle one couldn't find a pot of a SIMD size or bigger
15:46 karolherbst: so if the loop manages to set local to the SIMD size, fine, nothing else to do. I just wanted to prevent sub optimal distribution of threads
15:46 karolherbst: _however_
15:46 karolherbst: threads doesn't have to be pot
15:46 karolherbst: intel is kinda weird there...
15:47 karolherbst: jenatali: https://github.com/KhronosGroup/OpenCL-CTS/issues/1716
15:48 karolherbst: there are some intel extensions to make better use of it, and I also kinda have to take that into account
15:48 jenatali: Fun
15:48 karolherbst: but I also kinda wanted to finish non uniform first
15:48 karolherbst: the intel extension e.g. allows you to set the subgroup size
15:49 karolherbst: but yeah.. that part of the code has a big TODO to take all of that into account