06:23dolphin: airlied, sima: dim is not picking up anything into drm-intel-fixes so no pull request this week
06:41sima: ack
09:04dj-death: jnoorman: do you want to land the reviewed bits from !34344 ?
09:15karolherbst: do we have per driver documentation about queue/context priorities? Like something that describes what guarantees a low/high priority context gives to the user. I'm sure it's all per driver if at all, just wondering if there is anything I can point out to.
09:49pac85: Like, for gallium?
09:49pac85: Or uapi level?
09:50karolherbst: "developer documentation" rather
09:50karolherbst: application developer I mean
09:51karolherbst: like "if you use a high prio context through GLX/EGL you get those guarnatees on this driver: ..."
09:51pac85: I see yeah
09:54pac85: I think it falls mainly in 3 camps: GPUs that have different hw queues with different priorities (eg. Amd), GPUs that do preemption (msm, amd, Mali, agx) and GPUs that have none of those (in which case it only affects scheduling)
09:57karolherbst: right
09:57karolherbst: but I'm interested in specific documentation
09:57karolherbst: it's for https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35456
09:58karolherbst: I map the CL stuff to PIPE_CONTEXT_HIGH_PRIORITY and PIPE_CONTEXT_LOW_PRIORITY, and just wondering if there is any driver specific docs on it
09:58karolherbst: rude for an ext to require you to provide docs...
10:03pac85: I see. Personally I documented it for msm but it is very low level and probably not what you are looking for https://gitlab.freedesktop.org/drm/msm/-/blob/msm-next/Documentation/gpu/msm-preemption.rst
10:05pac85: As in, if you read it you'll know all you need but you'll also know stuff you don't care about as an app developer :p so not sure if it counts
10:08karolherbst: yeah.. I'm interested in something more high level
10:08karolherbst: maybe "it matches EGL/GLX" semantics would fly, but I wouldn't be surprised if those are similarly vague
10:09pac85: Even vulkan is vague
10:10karolherbst: I mean this is highly platform specific, so it makes sense to point towards vendor documentation
10:41jnoorman: dj-death: I think I'd prefer at least one more r-b on "nir: add BASE index to load/store_ssbo" so would you mind if I give a bit more time for people to review?
10:56dj-death: jnoorman: okay
15:24robclark: karolherbst: any idea about this? It is "fixed" by commenting out cl_khr_fp16
15:24robclark: https://www.irccloud.com/pastebin/uOoc0OQl/
15:24robclark: maybe a llvm spirv translator bug?
15:25karolherbst: mhhhh
15:25karolherbst: do you have the nir?
15:25karolherbst: there are a few bugs in the libclc as well which could cause this..
15:25robclark: looks like it didn't get as far as producing the nir.. is there a way to dump the spirv?
15:26karolherbst: CLC_DEBUG=dump_spirv
15:27robclark: heh, ok.. any tip on mapping "38072 bytes into the SPIR-V binary" to location in the spirv?
15:27robclark: output might be too big to pastebin
15:28karolherbst: yeah, just wanted to say...
15:29robclark: well, I guess opcode=SpvOpImageSampleExplicitLod gives a hint
15:29karolherbst: it's a bit weird tho..
15:30karolherbst: ohh...
15:30karolherbst: the CL CTS might not test read_imageh...
15:31karolherbst: mhh there are tests actually
15:31robclark: I think this is it.. maybe?
15:31robclark: %1397 = OpLoad %v2uint %_compoundliteral Aligned 8
15:31robclark: %TempSampledImage = OpSampledImage %1336 %1392 %1339
15:31robclark: %call15 = OpImageSampleExplicitLod %v4half %TempSampledImage %1397 Lod %float_0
15:32karolherbst: sure.. but what would convert the dest to 32 bit on the nir side
15:32karolherbst: maybe some lowering going wrong...
15:33karolherbst: mhh wait, this is before lowering
15:34karolherbst: ohh wait
15:34karolherbst: it's the components that is different?
15:35robclark: well nir def thinks it is 4 components w/ bitsize 32.. I might be looking at the wrong one
15:35robclark:doesn't completely understand
15:35karolherbst: `type->type` and `*def`
15:36karolherbst: inside vtn_push_nir_ssa
15:36karolherbst: so it does expect a f16vec4
15:36karolherbst: wondering what def is
15:36karolherbst: ohh wait.. I read the last line wrong
15:36karolherbst: yeah you are right, it expects 32 bit
15:37karolherbst: you can do something simple there.. let me figure it out
15:37robclark: thx
15:38karolherbst: robclark: "p nir_print_shader(b->shader, stdout)" inside the debugger inside the vtn_push_nir_ssa frame
15:39robclark: heh, that segfaulted
15:39karolherbst: mhhh
15:39karolherbst: sad
15:40karolherbst: ohh...
15:40karolherbst: the 32 is hardcoded
15:40robclark: :facepalm:
15:40karolherbst: "nir_def_init(&instr->instr, &instr->def, nir_tex_instr_dest_size(instr), 32);"
15:40karolherbst: so now what to do about that...
15:40karolherbst: not in the mood of fixing every driver
15:40karolherbst: maybe just insert a "f2f16"
15:41robclark: ok, let me take a look
15:43karolherbst: not sure we require 32 bit dests for tex operations
15:43karolherbst: but also annoying that I haven't hit it with the CTS...
15:46robclark: \o/
15:46robclark: https://www.irccloud.com/pastebin/GgJdFN4j/
15:46robclark: I think that looks reasonable.. it works at least
15:46karolherbst: yeah.. the image CL CTS tests don't test fp16 support 🙃
15:47karolherbst: robclark: right.. the only concern is, that it might run into issues with drivers
15:47robclark: /o\
15:47karolherbst: like if everybody assumes 32 bits, then... it's not great
15:47karolherbst: _however_
15:47karolherbst: we can just lower inside nir_lower_cl_images for now
15:48karolherbst: I also added lowering for CL_DEPTH images, because in CL they are single component dests
15:48karolherbst: and every driver expects 4 components
15:49robclark: hmm, maybe ir3 is the only one that uses nir_opt_16bit_tex_image
15:49karolherbst: ohh.. intel as well
15:49karolherbst: and radeonsi
15:49karolherbst: mhhhh
15:50karolherbst: what does that pass do?
15:50karolherbst: try to use 16 bit instead of 32?
15:50robclark: yeah
15:51karolherbst: I see..
15:51robclark: folds narrowing into tex op
15:51karolherbst: so I see two options
15:51karolherbst: 1. always lower to fp32 in rusticl and let driver optimize with nir_opt_16bit_tex_image
15:51karolherbst: 2. add a pipe_cap and rusticl lowers for drivers not supporting 16 bit tex ops
15:52karolherbst: or well.. 3. lower to fp32 and make drivers call a new lowering pass
15:53robclark: maybe 1 is better, since drivers that want it already use the nir opt pass
15:54robclark: or just ask alyssa .. since I guess she knows the other compilers using rusticl that aren't ir3/amd/intel
15:55karolherbst: at least 1. doesn't change the status quo
15:55karolherbst: so I think it's way easier to do
15:56karolherbst: if you want you can write the fix and add something to "nir_lower_cl_images", but that pass needs some cleaning up because it's starting to become a mess
15:57glehmann: didn't marek just add a glsl_16bit_load_dst pipe cap for this?
15:58robclark: so panfrost sets that to true.. (as does zink)
15:58robclark: so maybe my existing fix is fine
15:59robclark: agx does too
15:59robclark: i915 and r300 don't support it :-P
16:00robclark: oh, and I guess radeonsi pre GFX9? Not sure how old that is
16:01karolherbst: glehmann: mhh I wished those caps would be better documented :)
16:01karolherbst: GFX9 is pretty new
16:02karolherbst: well.. "pretty"
16:02karolherbst: it's the last gen before RDNA?
16:02karolherbst: something like that
16:03glehmann: anything gcn is old
16:03karolherbst: isn't GFX9 like 7 years old?
16:04robclark: I guess the more relevant question is whether rusticl is a going concern on GFX9.. I guess/assume it is?
16:04karolherbst: yeah, should just work (tm)
16:04robclark: ok, let me see if I can come up with something
16:05karolherbst: the question is rather, what does that cap do...
16:06karolherbst: seems to force high precision in a few places
16:06karolherbst: mhhh
16:07karolherbst: anyway yeah.. rusticl probably will have to do lowering inside "nir_lower_cl_images" then if it's false
16:07karolherbst: not the worst lowering
16:08robclark: why not just promote the dest type to 32b and add f2f16/etc?
16:09robclark: why would that need something in nir_lower_cl_images?
16:09glehmann: how does cl define int16 shader load for 32bit images? UB, trunc, or clamp?
16:13robclark: I assume in this case the src img is f16
16:15robclark: karolherbst: also, is cl_khr_fp16 exposed on GFX8 and earlier? If not then my fix would be fine
16:16karolherbst: yes, on GFX8
16:16robclark: :-(
16:16karolherbst: GFX8 has slow fp16, but it does have fp16
16:16karolherbst: "slow" as in "as fast as fp32"
16:17robclark: hmm, that seems mostly pointless
16:18karolherbst: saves on memory bandwidth
16:18karolherbst: though
16:18karolherbst: loading from/storing to fp16 data is always supported
16:18karolherbst: but it can also remove some pointless conversions
16:18karolherbst: anyway.. "prop stack supports it", so users asked
16:19karolherbst: I mean.. the lowering will be like 10 loc or something
16:19karolherbst: not the worst
16:23mareko: gfx8 fp16 is as fast as fp32 but consumes less power
16:23glehmann: sub dword register addressing really sucks on gfx8 though, so in practice it's not even 1:1 alu usage with fp32 (ignoring conversions)
16:25mareko: I think it was designed and expected to use only low bits of VGPRs as a power optimization
16:27karolherbst: the other question is, if changing it in vtn, what to do about vulkan drivers
16:27glehmann: I prefer the gfx11 design :)
16:27karolherbst: or maybe it's fine for everybody
16:28glehmann: karolherbst: Vulkan only allows 32bit tex instructions
16:28karolherbst: fun
16:31jenatali: D3D also only allows 32 bit sampling results
16:32karolherbst: okay, so you would need lowering inside nir_lower_cl_images then anyway
16:32jenatali: Yeah
16:34robclark: karolherbst: why does it need to be lowered in nir_lower_cl_images instead of nir_opt_16bit_tex_image?
16:34karolherbst: if vtn always uses fp32 then drivers can also use nir_opt_16bit_tex_image
16:34karolherbst: it's just that if vtn emits fp16 tex ops, then we need some lowering somewhere
16:35robclark: I am doing the munging into 32b in vtn
16:35karolherbst: okay
16:35karolherbst: should be easier to handle then
16:35robclark: is there a convenient helper for "give me the right nir alu opt for dst type"?
16:36karolherbst: "nir alu opt"?
16:38glehmann: nir_type_conversion_op
16:38karolherbst: ohhh you meant "op"
16:38robclark: ahh, thx glehmann
16:40glehmann: fwiw, doing the lowering in vtn and optimizing back is not possible under all conditions
16:41glehmann: it's going to depend on float rounding mode, and whether your hw clamps 32bit integers
16:42robclark: I _could_ make this dependent on pipe cap.. but since we are widening an already 16b tex to 32b I think we can use undef rounding mode
16:47robclark: https://www.irccloud.com/pastebin/4ZiNEL3q/
16:51glehmann: undef rounding mode in this context just means use the global one in the shader, so if that's set it's not completely undefined
16:52robclark: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35470
17:51robclark: karolherbst: sadness.. https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35447#note_2954676
17:54karolherbst: robclark: mhhhh what CL query is that based on?
17:54karolherbst: ohhw ait
17:54karolherbst: it's already a workaround for a broken driver
17:54karolherbst: pain
17:55karolherbst: robclark: add a test to the CL CTS for that 🙃
17:55karolherbst: but yeah...
17:56karolherbst: this is about CL_DEVICE_IMAGE_MAX_BUFFER_SIZE right?
17:56robclark: about the pitch alignment
17:56karolherbst: ehh CL_DEVICE_IMAGE_PITCH_ALIGNMENT
17:56robclark: looks like closed driver expects units of pixels instead of bytes
17:57karolherbst: "The row pitch alignment size in pixels for 2D images created from a buffer."
17:57robclark: so rusticl is correct here
17:57karolherbst: yeah..
17:57robclark: the cl api is a bit awkward because it mixes units of pixels and bytes
17:58karolherbst: yeah....
17:59karolherbst: the main reason I didn't want to enable drivers in rusticl from the start was, that I'm aware that others would just workaround a broken rusticl instead of reporting isses 🙃
17:59karolherbst: I was considering driconf for rusticl a few times to workaround broken applications tho...
18:02robclark: we _might_ need to driconf, but I'd prefer fixing tensorflow... the question of which of the pile of gpu_info.IsAdreno() are related to hw (which would apply to us to) vs sw
18:02karolherbst: beats me :)
18:02robclark: yeah, no worries, not your problem ;-)
18:02karolherbst: robclark: you probably want to check for the platform there as well tho
18:03robclark: right
19:06karolherbst: I think I have a GPU memory leak somewhere :')
19:58zmike: can I get an ack on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21738 ? I think this function has actually been subtly broken all this time but the obfuscation hid it
20:44idr: zmike: LGTM.
20:44zmike: ty
20:51robclark: karolherbst: can CL_DEVICE_MAX_WORK_GROUP_SIZE (or the kernel equiv) be greater than the MAX_WORK_ITEMS_SIZES in any dimension? This isn't clear to me from the spec
20:52robclark: tensorflow seems to expect not.. but idk if it is wrong here
20:52karolherbst: "MAX_WORK_ITEMS_SIZES"?
20:53robclark: CL_DEVICE_MAX_WORK_ITEM_SIZES
20:54karolherbst: I don't see why CL_DEVICE_MAX_WORK_GROUP_SIZE couldn't be greater than CL_DEVICE_MAX_WORK_ITEM_SIZES
20:54robclark: ok.. that was my expectation
20:55karolherbst: applications shouldn't rely on those anyway
20:55karolherbst: but rather use the values from clGetKernelWorkGroupInfo
20:56robclark: it is using the kernel value.. but the # of threads is greater than max_work_item_sizes[0]..
20:56robclark: it is trying to calc a localsize and ends up picking something with too large an x dimension
20:56karolherbst: mhhh
20:57karolherbst: would have to see the code, but MAX_WORK_ITEM_SIZES simply specify the total amount of threads in a work group, and work_group_size specify the limit per dimension
20:58karolherbst: ehh wait
20:59karolherbst: the terms are so confusing
20:59karolherbst: MAX_WORK_ITEM_SIZES is per dim 🙃
20:59karolherbst: anyway
20:59karolherbst: I don't see a restriction of any sorts, and the code might just do assumptions
20:59karolherbst: _though_ I know that the CTS was also broken in this sense
21:00robclark: yeah, the situation where the max threads is greater than a single dimension, ie. you couldn't max out of dims==1
21:00robclark: but I can fix it in tensorflow
21:01karolherbst: on v3d I ran into very funky issues in the CTS
21:01karolherbst: max_grid_Size of 65535 isn't fun :D
21:02karolherbst: what was the hardware with a subgroup size of 128 again...
21:02karolherbst: that was funky
21:12jenatali: Adreno has a subgroup of 128
22:45karolherbst: robclark: if you have a bit of time: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35448 and "integer_ops/test_integer_ops extended_bit_ops_extract extended_bit_ops_insert extended_bit_ops_reverse"
22:45karolherbst: there is lowering for everything inside mesa
23:47robclark: karolherbst: hmm, I'm getting a ibitfield_extract in the backend (which you could probably repo w/ drm_shim)
23:48karolherbst: robclark nir_compiler_options has flags to lower it
23:48karolherbst: lower_bitfield_extract{,8,16}
23:48karolherbst: and a 64 bit one as well
23:49karolherbst: and "lower_bitfield_insert" or "lower_bitfield_reverse" if needed, might also need adjustement of your lower_bit_size cb depending on things
23:49karolherbst: but yeah.. I could try to write a patch blind
23:51robclark: https://www.irccloud.com/pastebin/rBff9wZ9/
23:52karolherbst: with that it all passes? Do you want to push a change or should I just copy it?
23:53robclark: hmm, maybe we shouldn't lower bitfield_reverse except for 8b
23:53karolherbst: yeah.. that should be done inside the lower_bit_size cb then