00:02 robclark: well this gets us part way there..
00:02 robclark: https://www.irccloud.com/pastebin/xtGksy2t/
00:03 robclark: I'm going to have to look at bfrev.b a bit more to see what the right thing to do there is
00:03 karolherbst: I have a patch ready in a momemt
00:04 karolherbst: robclark: https://gist.github.com/karolherbst/a31188c6aeb4fcc8b9ddd582a564a921
00:04 karolherbst: if I didn't mess up the opcode name...
00:06 karolherbst: can always improve it later, for now I don't want to break stuff
00:06 robclark: no, it doesn't seem to work for short/ushort either, that doesn't change anything
00:07 robclark: I'll need to poke at the instruction w/ computerator, but that will have to be tomorrow
00:07 karolherbst: ahh
00:08 karolherbst: then "return alu->def.bit_size < 32 ? 32 : 0;" for reverse
00:08 karolherbst: but if you have native support for it, that would be better to support of course
00:11 robclark: yeah ok, that worked
00:11 robclark: https://www.irccloud.com/pastebin/uMcnpedh/
00:12 karolherbst: okay, already pushed a patch like that to the MR
00:12 karolherbst: can always improve it if the hw supports some of it, but..
00:12 karolherbst: for now having it not broken is good enough :D
00:13 karolherbst: thanks for testing!
00:13 robclark: np
07:51 tzimmermann: FYI, i just fixed a broken merge resolution in drm-tip. please update if you see any problems
12:51 tzimmermann: jfalempe, thanks for reviewing the two include fixes quickly
12:52 jfalempe: tzimmermann: you're welcome.
20:52 robclark: karolherbst: is there something like apitrace for opencl?
20:56 karolherbst: robclark: yeah: https://github.com/intel/opencl-intercept-layer
20:56 robclark: thx
20:57 karolherbst: it's been a while since I used it...
20:57 robclark: this tflite thing appears to be trying to set an array of 128 identical pipe_image_view
20:57 robclark: which doesn't work so well when the max is 64
20:57 karolherbst: but yeah, there is a "cliloader" binary and a lot of flags: https://github.com/intel/opencl-intercept-layer/blob/main/docs/controls.md
20:58 robclark: thx
20:58 karolherbst: mhh
20:58 karolherbst: FULL_PROFILE needs 128 sampler views... but 128 for image views is out of spec...
21:00 karolherbst: but I was considering using bindless to support full profile on more devices
21:01 zmike: I think only radeonsi and zink support bindless
21:02 karolherbst: and nouveau
21:02 zmike: whoveau?
21:02 karolherbst: :P
21:02 alyssa: asahi deliberately doesn't support bindless since there are perf tradeoffs
21:02 robclark: I don't really support bindless.. I use bindless internally within driver, it would be kinda awkward to support both that and gallium bindless
21:02 alyssa: i figure "bindless for vk, no bindless for gl" is good enough for everyone who isn't named mike
21:02 robclark: (but I could support 128.. it would just be a bit annoying)
21:02 alyssa: and anyone named mike can use zink
21:03 karolherbst: robclark: I'm more confused why it's using 128 image_views ...
21:03 robclark:too
21:03 karolherbst: I could just reject such a kernel...
21:03 karolherbst: aand maybe should
21:03 robclark: well, it is because it adds that many kernel args
21:03 karolherbst: yeah sure, but that's illegal
21:03 robclark: yeah, you aren't checking the limits and then doing unsafe { set_shader_images() } which goes boom
21:04 karolherbst: yeah...
21:04 karolherbst: I should check if the kernel created actually fits within CL_DEVICE_MAX_WRITE_IMAGE_ARGS
21:04 robclark: right
21:05 karolherbst: looks like that's CL_OUT_OF_RESOURCES for clEnqueueNDRangeKernel
21:07 karolherbst: kinda weird place for it...
21:07 karolherbst: should probably just fail to compile
21:07 karolherbst: but not sure if that's actually legal
21:08 karolherbst: alternatively I could do indirect image operations...
21:09 karolherbst: but not sure what's the perf trade off with those
21:09 robclark: it defn makes it as far as set_shader_images() before anything gets rejected
21:10 karolherbst: yeah, but my point is, it's an application bug anyway
21:11 karolherbst: though I could try to support kernels where applications use the same image_view
21:11 karolherbst: in multiple kernel args
21:11 karolherbst: but it's still a problem that the application uses too many image args and it seems the spec only allows an error for that when the kernel gets launched, so that's a bit annoying
21:22 robclark: https://www.irccloud.com/pastebin/TY3bJtk2/
21:22 robclark: karolherbst: ^^^
21:24 karolherbst: sure, but not much we can do about that
21:25 karolherbst: though I guess it just requires 128 and won't run and probably will never fix it
21:26 karolherbst: mut also...
21:26 karolherbst: this code...
21:26 karolherbst: 🙃
21:27 robclark: yeah, sorry, I probably owe you a drink to help forget that shader :-P
21:27 robclark: it's basically a giant demux
21:28 karolherbst: so what can do really do about that one...
21:28 karolherbst: could support 128 write images :)
21:29 karolherbst: there are a couple of options here.. as I said: bindless would be one, but could also turn it into indirect accesses and load the index as a kernel parameter
21:29 robclark: yeah, although it makes state changes more expensive.. maybe I'll do some different path for compute contexts.. idk yet
21:29 karolherbst: if it's like 128 times the same image, then I'd only need to bind once and reuse the same index
21:29 karolherbst: or well.. it loads the same index 128 times
21:29 robclark: indirect would be better, I think
21:30 karolherbst: but.. it's an additional context pull, but not sure how much it matters given that like.. kernels touch VRAM anyway
21:30 karolherbst: could only do indirects either if drivers don't care or if it goes out of lmits
21:31 karolherbst: indirects shouldn't be a major problem. I never got rid of the space I allocate for them in the kernel input buffer, so we already have a place to store the index..
21:31 karolherbst: more concerned about the deduplication
21:32 karolherbst: I'll think about it and maybe I come up with a good solution
21:33 karolherbst: robclark: are there any costs with binding the same image 32 times?
21:33 karolherbst: like.. is it more or less expensive over having 32 indirects
21:33 robclark: I've not measured it but I wouldn't expect so
21:34 karolherbst: okay
21:34 karolherbst: well then I only need to focus on the case where the kernel has more args than "max_shader_images"
21:34 robclark: anyways, I'm still wrapping my head around what tensorflow is doing.. I mean it seems like this case it should de-duplicate inputs to get a simpler kernel
21:34 karolherbst: and then I turn to indirects and fail the invocation if it actually binds more images than max_shader_images
21:35 karolherbst: yeah.. but maybe sometimes it's different iamges?
21:35 karolherbst: who knows
21:35 karolherbst: maybe it's just bad code
21:37 karolherbst: maybe making it work with this one just means it will fail later.. who knows
21:38 robclark: yeah, that is what I'm trying to figure out ;-)
21:48 alyssa: robclark: "I paid for 128 images, and I'm gonna *use* 128 images!"
21:48 robclark: you get what you paid for :-P