00:02robclark: well this gets us part way there..
00:02robclark: https://www.irccloud.com/pastebin/xtGksy2t/
00:03robclark: I'm going to have to look at bfrev.b a bit more to see what the right thing to do there is
00:03karolherbst: I have a patch ready in a momemt
00:04karolherbst: robclark: https://gist.github.com/karolherbst/a31188c6aeb4fcc8b9ddd582a564a921
00:04karolherbst: if I didn't mess up the opcode name...
00:06karolherbst: can always improve it later, for now I don't want to break stuff
00:06robclark: no, it doesn't seem to work for short/ushort either, that doesn't change anything
00:07robclark: I'll need to poke at the instruction w/ computerator, but that will have to be tomorrow
00:07karolherbst: ahh
00:08karolherbst: then "return alu->def.bit_size < 32 ? 32 : 0;" for reverse
00:08karolherbst: but if you have native support for it, that would be better to support of course
00:11robclark: yeah ok, that worked
00:11robclark: https://www.irccloud.com/pastebin/uMcnpedh/
00:12karolherbst: okay, already pushed a patch like that to the MR
00:12karolherbst: can always improve it if the hw supports some of it, but..
00:12karolherbst: for now having it not broken is good enough :D
00:13karolherbst: thanks for testing!
00:13robclark: np
07:51tzimmermann: FYI, i just fixed a broken merge resolution in drm-tip. please update if you see any problems
12:51tzimmermann: jfalempe, thanks for reviewing the two include fixes quickly
12:52jfalempe: tzimmermann: you're welcome.
20:52robclark: karolherbst: is there something like apitrace for opencl?
20:56karolherbst: robclark: yeah: https://github.com/intel/opencl-intercept-layer
20:56robclark: thx
20:57karolherbst: it's been a while since I used it...
20:57robclark: this tflite thing appears to be trying to set an array of 128 identical pipe_image_view
20:57robclark: which doesn't work so well when the max is 64
20:57karolherbst: but yeah, there is a "cliloader" binary and a lot of flags: https://github.com/intel/opencl-intercept-layer/blob/main/docs/controls.md
20:58robclark: thx
20:58karolherbst: mhh
20:58karolherbst: FULL_PROFILE needs 128 sampler views... but 128 for image views is out of spec...
21:00karolherbst: but I was considering using bindless to support full profile on more devices
21:01zmike: I think only radeonsi and zink support bindless
21:02karolherbst: and nouveau
21:02zmike: whoveau?
21:02karolherbst: :P
21:02alyssa: asahi deliberately doesn't support bindless since there are perf tradeoffs
21:02robclark: I don't really support bindless.. I use bindless internally within driver, it would be kinda awkward to support both that and gallium bindless
21:02alyssa: i figure "bindless for vk, no bindless for gl" is good enough for everyone who isn't named mike
21:02robclark: (but I could support 128.. it would just be a bit annoying)
21:02alyssa: and anyone named mike can use zink
21:03karolherbst: robclark: I'm more confused why it's using 128 image_views ...
21:03robclark:too
21:03karolherbst: I could just reject such a kernel...
21:03karolherbst: aand maybe should
21:03robclark: well, it is because it adds that many kernel args
21:03karolherbst: yeah sure, but that's illegal
21:03robclark: yeah, you aren't checking the limits and then doing unsafe { set_shader_images() } which goes boom
21:04karolherbst: yeah...
21:04karolherbst: I should check if the kernel created actually fits within CL_DEVICE_MAX_WRITE_IMAGE_ARGS
21:04robclark: right
21:05karolherbst: looks like that's CL_OUT_OF_RESOURCES for clEnqueueNDRangeKernel
21:07karolherbst: kinda weird place for it...
21:07karolherbst: should probably just fail to compile
21:07karolherbst: but not sure if that's actually legal
21:08karolherbst: alternatively I could do indirect image operations...
21:09karolherbst: but not sure what's the perf trade off with those
21:09robclark: it defn makes it as far as set_shader_images() before anything gets rejected
21:10karolherbst: yeah, but my point is, it's an application bug anyway
21:11karolherbst: though I could try to support kernels where applications use the same image_view
21:11karolherbst: in multiple kernel args
21:11karolherbst: but it's still a problem that the application uses too many image args and it seems the spec only allows an error for that when the kernel gets launched, so that's a bit annoying
21:22robclark: https://www.irccloud.com/pastebin/TY3bJtk2/
21:22robclark: karolherbst: ^^^
21:24karolherbst: sure, but not much we can do about that
21:25karolherbst: though I guess it just requires 128 and won't run and probably will never fix it
21:26karolherbst: mut also...
21:26karolherbst: this code...
21:26karolherbst: 🙃
21:27robclark: yeah, sorry, I probably owe you a drink to help forget that shader :-P
21:27robclark: it's basically a giant demux
21:28karolherbst: so what can do really do about that one...
21:28karolherbst: could support 128 write images :)
21:29karolherbst: there are a couple of options here.. as I said: bindless would be one, but could also turn it into indirect accesses and load the index as a kernel parameter
21:29robclark: yeah, although it makes state changes more expensive.. maybe I'll do some different path for compute contexts.. idk yet
21:29karolherbst: if it's like 128 times the same image, then I'd only need to bind once and reuse the same index
21:29karolherbst: or well.. it loads the same index 128 times
21:29robclark: indirect would be better, I think
21:30karolherbst: but.. it's an additional context pull, but not sure how much it matters given that like.. kernels touch VRAM anyway
21:30karolherbst: could only do indirects either if drivers don't care or if it goes out of lmits
21:31karolherbst: indirects shouldn't be a major problem. I never got rid of the space I allocate for them in the kernel input buffer, so we already have a place to store the index..
21:31karolherbst: more concerned about the deduplication
21:32karolherbst: I'll think about it and maybe I come up with a good solution
21:33karolherbst: robclark: are there any costs with binding the same image 32 times?
21:33karolherbst: like.. is it more or less expensive over having 32 indirects
21:33robclark: I've not measured it but I wouldn't expect so
21:34karolherbst: okay
21:34karolherbst: well then I only need to focus on the case where the kernel has more args than "max_shader_images"
21:34robclark: anyways, I'm still wrapping my head around what tensorflow is doing.. I mean it seems like this case it should de-duplicate inputs to get a simpler kernel
21:34karolherbst: and then I turn to indirects and fail the invocation if it actually binds more images than max_shader_images
21:35karolherbst: yeah.. but maybe sometimes it's different iamges?
21:35karolherbst: who knows
21:35karolherbst: maybe it's just bad code
21:37karolherbst: maybe making it work with this one just means it will fail later.. who knows
21:38robclark: yeah, that is what I'm trying to figure out ;-)
21:48alyssa: robclark: "I paid for 128 images, and I'm gonna *use* 128 images!"
21:48robclark: you get what you paid for :-P