12:47 pmoreau: karolherbst: Hey, could you please merge https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10256? Francisco seems to be okay with it. (Sorry if it does double-post, but it looks like it didn’t go through the first time as I wasn’t logged in.)
14:58 karolherbst: pmoreau: assigned to marge
15:00 karolherbst: soo... do we have tearing with the nouveau ddx and dri3?
15:01 karolherbst: I am under the impression that usually with dri3 there shouldn't be any...
15:17 pmoreau: karolherbst: Thanks!
15:57 karolherbst: pmoreau: you need to take care of the llvmpipe stuff
15:58 karolherbst: -api/clsetkernelarg/set kernel argument for cl_int3: fail
15:58 karolherbst: as you fixed this test, you also need to update the ci fail list
16:03 pmoreau: Ah, I’ll look at it in a bit.
16:28 pmoreau: karolherbst: Can I keep your Rb on that patch if I just modify it to remove that test from the CI fail list?
16:28 karolherbst: sure
16:29 pmoreau: Thanks; new version pushed
17:25 karolherbst: https://gist.githubusercontent.com/karolherbst/33729c7b46bdace79f247a948e852950/raw/27d72aaa3f7914d004e3ac1f975efdcbd5b9307b/gistfile1.txt :3
17:30 pmoreau: \o/
17:31 HdkR: \o/
17:31 karolherbst: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/188
17:32 karolherbst: :)
17:32 karolherbst: not sure if we should kill the application, but better to do random shit than to spin forever
17:33 karolherbst: and this spin timeout does hurt CL probably anyway
17:33 karolherbst: I think
17:36 karolherbst: and here the mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12189 :)
22:53 pmoreau: Do we not use the actual launch block size for configuring the register allocator? `info->prop.cp.numThreads` which is used to compute the number of threads per block used by the register allocator, is initialised for both NIR and TGSI frontends based a fixed block size being specified in the shader. And if nothing was specified (which probably happens relatively often with OpenCL), we default to 512 (1024 on Kepler+) threads, even if the
22:53 pmoreau: kernel was launched with a block size of 1.
22:54 karolherbst: pmoreau: that's correct
22:54 pmoreau: :-/
22:54 karolherbst: but we should know if when actually launching the shader :/
22:54 karolherbst: but yeah...
22:55 karolherbst: but does it actually matter all that much?
22:55 karolherbst: or is that a problem in case it's a complex shader and we wouldn't have enough thrads?
22:55 karolherbst: *threads
22:58 pmoreau: It seems stupid to start using quite a bit of private memory due to spilling when we still have registers available.
22:58 karolherbst: yeah.. but we don't know how many threads are used when it gets launched...
22:59 karolherbst: CL added this hint thing for a reason
22:59 karolherbst: so one could do variants
22:59 karolherbst: so maybe we should just compile twice if the kernel contains the size hint
22:59 karolherbst: and select based on how the kernel gets enqueued
23:00 karolherbst: but it doesn't matter
23:00 karolherbst: with CL we always have the fixed size as we compile once we launch
23:00 karolherbst: or not?
23:00 pmoreau: We do know how many threads are used when it gets launched: we have the grid size and block size specified!
23:01 pmoreau: Or rather, we at the moment only compile as the kernel gets launched, so we have all the data available.
23:01 karolherbst: yeah
23:01 karolherbst: when linking we only link to the ... nir?
23:02 karolherbst: most of the work is already done by then, so... "meh"
23:02 pmoreau: Might even be only the SPIR-V modules being linked.
23:02 karolherbst: don't think so
23:03 karolherbst: compiler::link_program calls "nir::spirv_to_nir(spirv::link_program(ms, dev, opts, log), dev, log);"
23:03 karolherbst: so we even linked in libclc and DCEed everything
23:03 karolherbst: it's not perfect, but also not terrible
23:04 karolherbst: but in theory we could compile to the final ISA earlier, I just don't really see the point especially with caching
23:04 karolherbst: ohhh
23:04 karolherbst: we could add some caching to clover
23:04 karolherbst: atm we only cache libclc
23:06 pmoreau: Ignoring the actual block size: `type: 5, local: 1320, shared: 0, gpr: 16, inst: 2853, bytes: 22824`
23:06 pmoreau: Using the block size: `type: 5, local: 0, shared: 0, gpr: 47, inst: 1680, bytes: 13440`
23:06 karolherbst: ahh
23:06 karolherbst: we don't pass in this info?
23:06 pmoreau: Nope
23:06 karolherbst: pmoreau: try chaning the block size in the next launch and see what happens :D
23:07 karolherbst: the problem is, that the nir doesn't contain that info
23:07 karolherbst: but...
23:07 karolherbst: *shrug*
23:07 pmoreau: But we get that info from clover
23:08 karolherbst: pmoreau: you have to adjust kernel::exec_context::bind as well
23:08 karolherbst: the issue is changing stuff
23:08 karolherbst: atm we only recreate the state if the local or input mem reqs changed
23:09 pmoreau: Ah right
23:09 karolherbst: uhm... how can the input size even change ¯\_(ツ)_/¯
23:09 karolherbst: but yeah..
23:10 karolherbst: I think the idea was to keep the kernel generic so we don't recompile all the time
23:10 karolherbst: soo...
23:10 karolherbst: as I said... maybe we should just keep two copies and create them on demand
23:10 pmoreau: Also, if I do not relax the number of threads some of the CTS tests fail due to the register allocator giving up trying to find space for all the registers. 🙃
23:10 karolherbst: *ondemand
23:10 karolherbst: or a few more
23:10 karolherbst: yeah....
23:10 karolherbst: RA is busted
23:11 karolherbst: the spilling logic
23:11 karolherbst: it's .... just broken
23:11 karolherbst: not in code
23:11 karolherbst: the algorithm is
23:12 pmoreau: It’s interesting how some things have changed since I last looked at OpenCL: I now have 4 tests that are passing while they were previously triggering some MP_TIMEOUT before. And another one that now triggers a null pointer dereference.
23:13 karolherbst: mhh maybe I have the TGSI somewherre
23:15 pmoreau: Ideally, would we want to trigger re-creating the state if the block size changes? Or would it be better to avoid it by having a different API than the one we have currently?
23:19 karolherbst: no idea
23:19 karolherbst: the issue are really applications changing the size all the time
23:19 pmoreau: Well, I currently have the issue without even changing the size once. :-)
23:20 karolherbst: what do you mean?
23:20 karolherbst: atm we only compile a generic kernel which works with all sizes, no?
23:21 pmoreau: We compile a generic kernel that assumes the biggest block size, resulting in unnecessary spills and sometimes the register allocator even giving up.
23:21 karolherbst: pmoreau: ahh.. found it: https://gist.github.com/karolherbst/8244b9dd0c1281efc6b9aeb3f913f66e
23:22 karolherbst: so tell me why RA should even _try_ to spill there :p
23:24 pmoreau: Those are some nice moves you have there :-D
23:25 karolherbst: took me a while to reduce a shader so much it actually makes no sense anymore