00:15 karolherbst: so mhhh
00:15 karolherbst: I was able to see a difference, but still no idea _what_ exactly changes
00:43 jekstrand: karolherbst: Seems like it's not working. :-(
00:46 karolherbst: I think I just figured out how that works on nvidia hardware
00:54 karolherbst: or.. maybe not :
00:55 karolherbst: weird..
00:57 karolherbst: jekstrand: what does in your isa actually exist?
00:57 karolherbst: we only have the group and local thread id
00:58 jekstrand: We have a group id and subgroup id
00:59 karolherbst: it's a bit strange, but no matter what I do, I can't get those to change :/
00:59 karolherbst: there is probably some magic to it
01:01 karolherbst: it could be that for us it's just some weirdo preemption feature
01:03 karolherbst: anyway.. it's not even big enough to hold the offset...
01:05 karolherbst: curro: anyway.. clovers offset handling is broken right now... or well incomplete. There is also the problem where the hw can't support the requested workload size, so you end up lowering internally and due multiple invocation in which case you offset the group_id :/
01:06 karolherbst: although I am not 100% sure this needs to be implenented, but some hw have quite limited capabilities here
01:20 curro: karolherbst: uhm, so you mean that the pipe driver's lowering of a too large compute dispatch call into multiple chunks would mess up the group id?
01:20 curro: wouldn't that be a bug in the driver's lowering code?
01:39 jekstrand: karolherbst: Reading the docs more closely, our thread group start fields are for context save/restore not for something actually useful and they don't do what you think they do.
01:39 jekstrand: Not that context save/restore isn't useful. :P
01:40 jekstrand: karolherbst: So, yeah. We have to push stuff and do shader math. :-(
01:41 jekstrand: In which case lowering in clover seems non-terrible.
01:41 jekstrand: Or we can lower in the driver.
01:41 jekstrand: It honestly doesn't matter much to me.
01:41 jekstrand: The amount of code in iris isn't big
08:43 karolherbst: curro: if you specify the global invoc offsets your work group id still starts at 0, no matter what you pass in, but if you lower your clEnqueueNDRangeKernel with multiple clEnqueueNDRangeKernel + offsets, you also need to adjust your work group ids
08:46 karolherbst: personally I don't care about caring about those workloads, but there might be hw with quite limited work group sizes
08:48 karolherbst: like tesla we only support { 512, 512, 64 } (max 512 in total) * { 65535, 65535, 1 }
08:48 karolherbst: which isn't much
15:42 jekstrand: cwabbott, jenatali, karolherbst, cmarcelo: An idea that occurred to me today... Would it be possible to build something like nir_opt_algebraic only for derefs.
15:43 jekstrand: For instance, I'd like to detect "T *x; T *y = (T *)(((char *)x) + i * sizeof(T));" and turn it into "T *y = &x[i]"
15:45 jekstrand: Another example which came up from a shader jenatali was looking at: "uint64_t x = foo(); uint y = ((uint[2])x)[0];" but we'd really like that to be "uint y = i2i32(x)"
15:45 jekstrand: Thinking about how to solve this stuff using copy-prop just makes my head hurt
15:46 jekstrand: But if we viewed it as a pattern-matching problem, maybe it's tractable?
15:47 jekstrand: Likely we'd need both search-and-replace pattern mattching for regular derefs to solve the "I'm using offsets instead of arrays" problem as well as pattern matching on loads and stores which would let us sort out "creative" bit-cast cases.
15:47 jekstrand: I have no idea yet what the meta-language should look like or how to go about the search-and-replace
15:52 jenatali: Interesting idea. Sounds tricky, but probably doable?
16:07 jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3317
16:07 jekstrand: Tried to write it down. If anyone has any thoughts, feel free to leave them there.
16:50 karolherbst: jekstrand: why do we have to opt this?
16:51 karolherbst: the end result should kind of look the same or get optimized already, no?
16:52 karolherbst: ohh.. for copy prop? mhhh
16:54 karolherbst: jekstrand: I think if we focus on the sizeof bits, that might be enough, no?
16:54 karolherbst: so we want to translate the array access from the inner to the outer type
16:55 karolherbst: so we could check if the inner stride can divide the outer one and adjust the array deref with that and skip the cast?
16:56 karolherbst: then cast(array(cast)) would turn into array(cast(cast)) -> array(cast)
16:56 karolherbst: and potentially the inner cast can get removed depending what the source originally was
17:07 jenatali: FYI jekstrand, the CL source that you were looking at yesterday didn't actually have a 0 as the index, it was variable (guaranteed to be either 0 or 1 though)
17:07 jenatali: Which of course makes the problem even harder, since it can't just be i2i32 or an unpack/split