00:36 Plagman: curro: source2 only calls that function if it doesn't have focus
00:37 Plagman: if you need to override that for testing purposes, set the engine_no_focus_sleep cvar to 0
00:37 Plagman: it shoudl also reduce that timeout to 0
04:03 airlied: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6082
04:04 airlied: if you want an initial vulkan swrast (that can run 2 demos)
05:05 curro: Plagman: whoa that's good information, thanks for looking it up
05:07 curro: craftyguy, janesma: sounds like we should fix the perf CI to set that option in order to get decent framerate in Dota2 at least
07:25 MrCooper: airlied: cool, could it run any tests in CI yet? :)
08:41 airlied: MrCooper: not yet, unless 500,000 crashes count as CI
08:41 MrCooper: no passes yet?
08:42 airlied: probably a few but crashes would drive up execution time a lot
08:43 airlied: a full run all cores on my ryzen is 40 mins now
08:43 airlied: crashes goes up to 2 hrs
08:49 airlied: will look at ci more once i dequeued all fixes
08:49 MrCooper: if there's a reasonable subset of tests with passes, could still be useful to catch regressions in those
10:32 danvet: mripard, mlankhorst tzimmermann need a backmerge of drm-next into drm-misc-next for a typo fix in drm/xln
10:47 mlankhorst: ok
10:56 emersion: i feel like i've asked this before, but is it valid for a KMS driver to set possible_crtcs to more than a single CRTC for a primary plane?
10:57 emersion: the docs say "All drivers should provide one primary plane per CRTC to avoid surprising userspace too much"
10:57 emersion: rpi drivers seem to allow multiple CRTCs
11:15 dolphin: hansg: related to flicker-free boot, I noticed your patches about not initializing text mode in GRUB went upstream, I guess the rest of the patches that actually avoid any messages during boot, are still being discussed?
12:00 mripard: emersion: it doesn't look like vc4 does that?
12:00 emersion: mripard: https://drmdb.emersion.fr/devices/01cd892b4221
12:01 emersion: all planes have "CRTCs: {0, 1, 2, 3, 4, 5, 6, 7}"
12:01 emersion: (side note, there are only 2 CRTCs)
12:02 emersion: i'm asking this question because of https://github.com/swaywm/wlroots/pull/2333#discussion_r456777535
12:02 gitbot: swaywm issue (Pull request) 2333 in wlroots "Improve DRM plane matching logic (fix #1943)" [Open]
12:02 emersion: also see: https://github.com/raspberrypi/linux/blob/rpi-5.4.y/drivers/gpu/drm/vc4/vc4_firmware_kms.c#L820
12:03 mripard: oh, with firmware kms
12:03 mripard: I thought you meant the upstream driver
12:04 hansg: dolphin, TBH I don't know a colleague of mine is working on upstreaming all our downstream patches. The end goal is to have no downstream patches, so the rest should eventually go upstream too, but I don't know what the status is.
12:06 emersion: firmware KMS?
12:06 dolphin: hansg: right, thanks for the info :) I'll try manually stubbing out "Booting 'XYZ'" message
12:07 mripard: emersion: well, the driver you pointed to ? :)
12:07 emersion: i thought vc4 was upstream, so is this a fork?
12:07 emersion: i'm unfamiliar with the vc4 situation
12:08 mripard: it's not really a fork, more like a stop gap
12:08 emersion: (anyways, the question is: is this valid from a KMS uapi point of view, or is this a driver bug?)
12:08 emersion: hm
12:08 mripard: it's a KMS driver that only communicates with the firmware to perform whatever KMS was asked
12:08 mripard: (and it's out-of-tree)
12:09 mripard: whereas vc4 is a full-blown driver that doesn't rely on the firmware at all
12:09 mripard: and vc4 is upstream but doesn't yet support the RPi4
12:13 emersion: ok
12:47 daniels: emersion: I'm pretty sure that Renesas also allows primary planes to hop between CRTCs
12:47 emersion: ok, i'll submit a patch to clarify this then
12:48 daniels: the general direction of travel is that planes are getting less specialised - all of Renesas/Qualcomm/RPi expose 'virtual' planes, and virtual CRTCs (which can be ganged together or driven separately) are becoming much more of a thing as well
12:49 daniels: but yeah, that downstream driver sure does look buggy
13:17 tzimmermann: danvet, im trying to incorporate your review on the ast fix. is it acceptable to program the crtc mode from atomic_begin()?
13:35 TheRealJohnGalt: I have a radv issue with a single game (Detroit: Become Human) only at 4k, and wonder if it's just that my hardware can't handle the game at 4k. At 4k output, the game video hangs for approximately 90% of the time. I've never seen this behavior though, so wonder if it's a bug worth reporting?
13:37 pendingchaos: TheRealJohnGalt: could be the hangs in https://gitlab.freedesktop.org/mesa/mesa/-/issues/3212 ?
13:39 TheRealJohnGalt: pendingchaos: doesn't seem like it for a few reasons. 1) it's only at 4k resolution. 2) the game video will resume for around 5 seconds after about a minute of the hang. 3) The game never crashes from this. 4) I don't see anything in dmesg. 5) my system remains responsive while the game video hangs in the background.
13:40 TheRealJohnGalt: If I run at 1440p or lower, the game is smooth except for the usual radv crashes (already reported there).
14:03 karolherbst: curro: btw.. mind reviewing 64 bit ptr support for clover? I actually hit this issue often enough with nouveau already :) so I kind of want the fix to be upstream: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6064
14:05 karolherbst: but maybe at this point we just want to rework the entire handling of buffers mapped :/
14:05 karolherbst: anyway.. I hope others have better ideas on how to solve this
14:06 karolherbst: obvious one would be to keep track of pairs and insert the values in the end and skip the first copy loop
15:37 bbrezillon: jekstrand: could I have your R-b on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5682 ?
15:37 bbrezillon: anholt, robclark, mattst88: mind taking a look/reviewing https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5588
15:38 bbrezillon: mattst88: only the last patch has changed from the previous version
15:42 bbrezillon: jekstrand, karolherbst: any chance you could have a look at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5900. I'm pretty sure the implementation is not ideal, but I'd like to start the discussion and see how we can propagate alignment constraints and get packed-structs properly supported
15:44 karolherbst: bbrezillon: I'll review a few patches so we could land some of them sooner :)
15:44 karolherbst: we don't have to push everything at once as there seem to be some trivial and some more annoying patches
16:07 karolherbst: bbrezillon: wait.. whe do you actually need an atomic load/store? uhhh OpenCL requires you to support it on entire structs/arrays, right?
16:07 jekstrand: bbrezillon: int64 sutff I'm happy with if mattst88 is, I think.
16:24 zmike: jekstrand: you planning to blog about the variable list change? might write something on it if you aren't
16:50 jekstrand: zmike: Wasn't planning to. I didn't think that was all that big of a change.
16:50 zmike: jekstrand: I only blog about little things, otherwise I risk having to write about things I don't understand
16:51 jekstrand: zmike: Heh. Feel free.
16:51 jekstrand: zmike: I don't blog about anything, really. :-(
16:51 zmike: 👍
16:51 jekstrand: I should blog more.
16:51 zmike: if you're doing real work, your code is the blog
17:00 jekstrand: zmike: I'll just send all my friends and family a link to an RSS feed of the Git log then. 😂
17:01 Sachiel: "I blogged about this thing a while ago, but someone went and changed it since then"
17:03 zmike: jekstrand: that's a lot of reading for them; might wanna consider a digest version
17:03 bbrezillon: karolherbst: nope, atomics are only on scalars
17:04 bbrezillon: no atomics on vectors and/or composite types
17:04 bbrezillon: AFAICT
17:04 karolherbst: uhhh
17:04 karolherbst: mhhh
17:04 karolherbst: where is the benefit then?
17:05 bbrezillon: jekstrand, jekstrand, robclark: thanks for reviewing those patches BTW
17:05 jekstrand: benefit of what?
17:05 karolherbst: atomic load/store on scalars?
17:05 jekstrand: It's different in some hardware
17:05 bbrezillon: that was my initial question to pendingchaos :)
17:05 jekstrand: from normal load/store
17:06 jekstrand: It also matters for the Vulkan memory model because atomics are allowed to have syncoronization scopes.
17:06 karolherbst: so.. other hw have non atomic load stores?
17:06 bbrezillon: and apparently some HW don't guarantee atomicity on regular scalar load/store
17:06 karolherbst: :/
17:06 bbrezillon: which is unusual
17:06 karolherbst: why designing broken hw? that's kind of beyond me :p
17:07 bbrezillon: anyway, even adding the semantics to regular load/store is not that simple
17:07 karolherbst: but I guess if you only have a 16bit memory bus and do a 32 store/load you don't want to care all that much?
17:08 karolherbst: bbrezillon: I want to actually know what hw is broken though :D... I still doubt there is any
17:08 pendingchaos: better behaviour with data races (like two simultaneous stores don't write garage) and I think different interactions with barriers
17:08 pendingchaos: (how atomic load/store differs from non-atomic load/store)
17:08 pendingchaos: I don't think image load/store on GCN/RDNA is atomic
17:08 karolherbst: well/.. opencl has stuff like atomic_load_explicit which define some ordering/scope
17:09 jekstrand: For 32-bit, I would be surprised if load/store isn't at least somewhat atomic in the sense that if you have a data race you get one or the other complete value.
17:09 jekstrand: However, the interaction with barriers may get weird.
17:09 karolherbst: right...
17:09 jekstrand: Also, for 64-bit types, I wouldn't make any atomicity assumptions at all.
17:10 jekstrand: On Intel, writing a vec2 is atomic, I'm pretty sure. However, I wouldn't make that assumption in general.
17:10 karolherbst: jekstrand: sure, but then we could just lower... but I guess even for that we need to add some opcodes or flag it somehow :/
17:10 karolherbst: I was just wondering if we really need any of that
17:10 jekstrand: Lower to what? Add of 0?
17:10 karolherbst: non atomic load/stores
17:11 jekstrand: Sure. And bbrezillon added such a pass which you can call if you don't want to handle the intrinsics yourself.
17:11 jekstrand: It's pretty trivial.
17:12 karolherbst: pendingchaos: do we even have atomic image load/store?
17:18 karolherbst: I could imagine that you want to limit what values you can see... like atomic_load should only return a written value from the same block or something...
17:18 karolherbst: at least I think that's what the scope parameter is all about
17:35 jenatali: bbrezillon, daniels: Are we still planning to update https://reviews.llvm.org/D77589 with the fma changes? Or were we going to submit that separately?
17:38 curro: karolherbst: sure i'll take a look
18:59 jenatali: karolherbst: Regarding the offset patches, I'm looking at your feedback, trying to decide what to actually change - sounds like you think vtn should just emit (index + offset) rather than index_with_offset which is lowered?
18:59 jenatali: Then the question is just, should vtn skip emitting the offset if there's an option set to false, or should nir lower it to 0?
19:00 karolherbst: jenatali: good question.. I think lowering it to 0 is probably the cleaner solution.
19:02 karolherbst: jenatali: anyway, how I see it, either there is hw not supporting it (and they have to use two values) or there is hw supporting it, which I doubt, so you have just one value. either way invoc_id and invoc_id_with_offsets would be the same in the second case anyway
19:02 karolherbst: or rather.. I don't see why both intrinsics would be emited inside the same kernel, making it pointless to differentiate between those
19:03 karolherbst: I am not strictly against adding all variants, I just think that invoc_id + invoc_id_offsets _is_ what will happen on all hw anyway
19:04 karolherbst: and maybe we can just always emit the pair depending on being CL vs GK/Vk
19:04 karolherbst: *GL/Vk
19:10 jenatali: Hm... so the global invocation offsets can be e.g. 1 while the thread group size is e.g. 2
19:10 jenatali: So you can't get the global offset applied via an invocation ID offset
19:11 jenatali: So if you wanted to go from 1 -> 1000000 instead of 0 -> 999999 with a group size of 2, at least for DXIL we'd have to have both toa chieve that
19:14 jenatali: karolherbst: Also if we're going to lower to 0, that's still a nir option rather than a vtn option
19:21 karolherbst: jenatali: ahh, no I meant having invocation_id and invocation_id_with_offset is more or less pointless
19:21 karolherbst: I'd just assume invocation_id is what the hw delivers and invocation_id_offset is an API feature we have to add on top
19:21 karolherbst: maybe it makes more sense to look at it like this
19:21 jenatali: Ohhh, I see what you're saying. Yeah the with_offset ones were just so that I could still have vtn emit one intrinsic and lower it to two after the fact
19:22 jenatali: Because the vtn pass for system values is set up that way and having it emit alus and multiple system values would've been ugly :)
19:22 karolherbst: yeah.. but for me it's not really "lowering" what happens here, just adding a feature from a broken API :p
19:23 karolherbst: and I think having vtn add the offset is fine.. but maybe somebody else has a more design perspective like opinion on that?
19:24 jenatali: Makes sense. I'll remove the with_offset variants, have vtn emit two values, and keep the nir option which lowers the offset to 0
19:24 karolherbst: jenatali: I don't think we need the lowering to 0 even anymore then
19:24 karolherbst: GL/Vk will just tell vtn to not emit both
19:24 karolherbst: and then we would be done
19:24 karolherbst: if you run CL you have to support offsets
19:24 karolherbst: it's not optional
19:25 karolherbst: but having it in the control of the driver, it becomes optional
19:25 karolherbst: (and wrong)
19:25 jenatali: Fair. I was mainly doing it as a way to not break clover, since I don't really know how to test it :)
19:25 karolherbst: which is fine :)
19:25 karolherbst: I am happy to write/test code for clover to wire things up
19:25 karolherbst: but we can also default to not emiting those for now with clover
19:26 karolherbst: as this wouldn't cause regressions at least
19:26 jenatali: Ohh I understand what you're saying, got it
19:26 karolherbst: well.. there is one thing though
19:26 karolherbst: right now with LLVM backends (amdgpu) clover adds an hidden parameter to the kernel to provide the offset
19:27 karolherbst: now that we kind of assume no haw natively supports offsets, that _might_ be a viable option as it requires less work for drivers
19:27 karolherbst: but it also steals away from the kernel input buffer
19:28 jenatali: I'd rather keep it separate and let the driver combine them if they want
19:28 karolherbst: and I am not sure if stealing away could make some devices not compliant with CL
19:28 karolherbst: 1k bytes is required for !CUSTOM devices
19:29 karolherbst: *are
19:29 karolherbst: and I am not sure if there are devices which are CL capable, but only having 1k const buffers
19:30 karolherbst: so I am still unsure on this.. I'd rather not mess around with llvm as that's where the additional arg comes from when I don't have a practical benefit of doing so
19:30 jenatali: Oof... having vtn actually emit the offset is going to be difficult I think...
19:30 karolherbst: ohh mhh :/
19:30 jenatali: Since vtn emits a variable for the sysval rather than a load intrinsic
19:30 karolherbst: ahh, true :/
19:30 jenatali: Nir converts a load of that var into the load intrinsic
19:31 karolherbst: right.. I totally forgot about this
19:31 karolherbst: curro: are you aware of devices only having a 1k const buffer?
19:31 karolherbst: or well.. storage they'd use for kernel inputs?
19:32 karolherbst: mhh.. seems like for CL1.0 it was 256 bytes even
19:32 karolherbst: changed in CL1.1 to 1k
19:34 jenatali: Think it's worth trying to hook a OpLoad of a builtin variable for these IDs into two loads + an ALU within vtn? I'm leaning towards leaving it in lower_system_values in nir
19:34 jenatali: At which point, the change you're asking for is just a vtn cap to determine whether the sysval is ID or ID_with_offset?
19:35 karolherbst: mhhh
19:35 karolherbst: maybe that's not a bad idea
19:35 karolherbst: nir could lower ID_with_offset into two intrinsics
19:36 karolherbst: and ID like before
19:36 jenatali: Right, which is what it does now
19:36 jenatali: Though I guess yeah, it lowers it first into an intrinsic for ID_with_offset, and *then* into two intrinsics
19:36 jenatali: Probably should just skip the intermediate
19:36 karolherbst: I still think we could just remove the with_offset intrinsics alltogether
19:36 karolherbst: yeah
19:44 jenatali: karolherbst: Lowering global invocation ID from local ID - currently I have it so that the local ID offset is added in, so the same global offset can be used for all of the looped dispatches. If you want me to remove the nir flag for supporting offsets, I guess that means the global offset needs to increment for looped dispatches?
19:45 karolherbst: jenatali: I thought that's the case already? Or what do you mean?
19:47 jekstrand: We have this same problem with like 3 other system values.
19:47 jenatali: Right now I have it so that the global ID is computed from ((group ID + group offset) * group size) + ID within group + global offset
19:47 jenatali: jekstrand: Is it solved so there's a pattern I should follow? :)
19:48 jekstrand: Take VERTEX_ID, for instance. We have VERTEX_ID, BASE_VERTEX, and VERTEX_ID_ZERO_BASE where VERTEX_ID = VERTEX_ID_ZERO_BASE + BASE_VERTEX.
19:48 karolherbst: jekstrand: yeah.. I think it's fine to require that from CL users, even if they end up emiting a 0 for now
19:48 jekstrand: Then we have an options flag somewhere that says which one you want in your HW
19:48 karolherbst: jekstrand: right.. but there is hw for both ways, right?
19:49 jekstrand: yeah
19:49 karolherbst: for CL we probably don't
19:49 karolherbst: well.. for offsets
19:49 jekstrand: So it serves two purposes:
19:49 karolherbst: and I'd argue we also need the lowering for big grids as there won't be hw insane enough to support it either
19:50 jekstrand: 1. It makes a distinction between the gl_VertexID which includes the base and the hardware value which doesn't so that each intrinsic has exactly one value.
19:50 jekstrand: 2. Gives us something we can lower depending on what the driver wants.
19:50 jekstrand: I think 1 is important.
19:51 karolherbst: right...
19:52 karolherbst: that's why jenatali added the _with_offset variants. I just argue that we can skip adding it, as there won't be hw supporting it and if we declare invoc_id to be the "hw" value, we have an additional _offset one being the API feature. but yeah...
19:52 karolherbst: I see why both ways are fine to do
19:52 karolherbst: I agree with adding both system values though, that sounds like a good idea
19:53 jekstrand: There's always some question over naming.
19:53 karolherbst: and let nir lower them to either invoc_id (INVOCATION_ID) or invoc_id + invoc_offsets (INVOCATION_ID_WITH_OFFSETS)
19:53 jekstrand: If we went along with the same convention as gl_VertexID, it would be invocation_id and invocation_id_zero_base.
19:54 jekstrand: and base_invocation_id
19:55 jenatali: Yeah, I was hesitant to change the meaning of existing things though, since invocation_id already exists but means invocation_id_zero_base
19:56 jekstrand: jenatali: Well, it really means both right now because it's in the GL world where there is no distinction.
19:56 jenatali: True :)
19:57 jenatali: Though Vulkan has vkDispatchBase
19:57 jekstrand: And if the series to add the base invocation id handling starts with a rename of invocation_id to invocation_id_zero_base and it compiles fine in CI, I don't think anyone will complain.
19:57 jekstrand: Yeah, the Vulkan feature I did really poorly in ANV>
19:57 jekstrand: I'd be happy to see that reworked in terms of a new intrinsic.
19:57 jekstrand: I almost did it a month ago or so
19:58 jekstrand: But -EBIGGERFISH
20:01 karolherbst: I still think we don't need both variants actually as invocation_id should be the 0 based hw value and we just add the offset on top.. or is there hw actually being able to do some 32 bit (or even 64 bit) magic to add offset?
20:03 jenatali: So, CL supports both global invocation ID offsets, and workgroup ID offsets, while Vulkan only supports the latter - and requires that offset to impact the global invocation ID
20:04 jenatali: That makes it seem like the load done as part of lowering for global invocation ID should include the workgroup ID offset, if it's present
20:05 jekstrand: Uh...
20:05 jekstrand: Wait, in CL, is global the "simple" global id?
20:05 jenatali: "simple"?
20:05 karolherbst: jekstrand: there is no difference in CL
20:05 karolherbst: there is just the one having all
20:06 jekstrand: On, no, there are multiple names
20:06 karolherbst: so get_global_id() and others return the value with the offset aded _and_ big grids lowered
20:06 jekstrand: GL/vulkan have Global, Local, and Workgroup where Global is a combination of local and workgroup
20:06 jekstrand: https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/gl_GlobalInvocationID.xhtml
20:06 karolherbst: ohhh. I see
20:06 jenatali: Yeah that's the same naming that CL uses I think
20:07 karolherbst: yeah
20:07 jekstrand: So what we really want is work_group_id_zero_base and base_work_group_id
20:07 jenatali: And for CL we'll also need base_global_invocation_id and invocation_id_zero_base
20:07 karolherbst: sounds about right
20:07 jekstrand: No, you need you only need the workgroup versions
20:07 jenatali: Nope
20:07 jekstrand: Wha?
20:07 jenatali: CL supports API-specified *global* ID offsets
20:07 jekstrand: The global values are derived from workgroup and local
20:08 jenatali: Which might not be a multiple of workgroup size
20:08 karolherbst: yeah
20:08 jekstrand: Seriously? That's garbage.
20:08 karolherbst: it is :)
20:08 jenatali: And which have no impact on workgroup IDs
20:08 karolherbst: those are just added to whatever is the internal result
20:08 jekstrand: Ok, fine, we need base_global_invocation_id
20:08 karolherbst: like
20:08 jenatali: Right
20:08 karolherbst: because develoeprs are too lazy to add another argument to the kernel :p
20:08 jekstrand: I don't think we need global_invocation_id_zero_base. We can do it all in nir_lower_system_values.
20:09 karolherbst: jekstrand: for vulkan we need both global_invocation_id_zero_base _and_ global_invocation_id?
20:09 karolherbst: uhm
20:09 karolherbst: work_group I meant
20:09 jekstrand: karolherbst: For Vulkan, we need the workgroup ones
20:09 karolherbst: I meant, do we need work_group_id_zero_base _and_ work_group_id?
20:10 karolherbst: or just work_group_id_zero_base?
20:10 jekstrand: We should probably go all the way with all of them.
20:10 jekstrand: I don't know
20:10 karolherbst: yeah...
20:10 karolherbst: I just don't think any hw supports the offseted ones natively
20:11 karolherbst: especially the CL ones as those are 64 bit
20:11 jekstrand: The reason why I'm fine with not bothering with global_invocation_id_zero_base is that global_invocation_id is a computed system value and we're just making the computation more complex. global_invocation_id_zero_base will never be generated.
20:11 jenatali: So... more or less exactly what's in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5891, minus naming?
20:12 jenatali: jekstrand: global_invocation_id_zero_base is a system value in DXIL, FYI
20:12 karolherbst: jenatali: true
20:12 karolherbst: ... jekstrand: true
20:12 jekstrand: Random question: Are anyon else's GitLab emojis messed up?
20:12 jekstrand: jenatali: What do you mean?
20:12 jenatali: SV_DispatchThreadID
20:12 jekstrand: Oh, I see now.
20:13 jekstrand: So, yeah, you might want global_invocation_id_zero_base. That's fine.
20:13 jenatali: We could compute it in nir, but I'd prefer to forward it through to DXIL (when there's no workgroup offset)
20:13 jekstrand: Just more if statements in nir_lower_system_values.
20:13 karolherbst: :/
20:13 jekstrand: NIR intrinsics are dirt cheap
20:13 jekstrand: I don't get why people get so bent out of shape about adding more.
20:13 jenatali: Yup. Again pretty much exactly what's in !5891, just named differently
20:13 karolherbst: and I hoped the code could remain a bit simplier with removing some of them
20:14 karolherbst: jenatali: so I guess you recompile for each iteration?
20:14 karolherbst: just wondering why it is optional
20:15 jenatali: karolherbst: We recompile per dispatch, if necessary. If we don't need to loop dispatches, then we'll remove that offset from the compilation
20:15 jenatali: Which is the 99% case
20:15 karolherbst: yeah....
20:15 karolherbst: uff
20:15 karolherbst: workarounding a broken API :/
20:15 karolherbst: *sigh*
20:15 jenatali: :)
20:15 karolherbst: not that it matters performance wise though.. or shouldn't all that much
20:15 karolherbst: just painful it's 64 bit math
20:15 jekstrand: One add shouldn't be noticable
20:16 karolherbst: jekstrand: ... those it's not just one :p
20:16 karolherbst: *are
20:16 karolherbst: but yeah...
20:16 karolherbst: well
20:16 karolherbst: on maxwell/pascal that's already annoying to read
20:16 jekstrand: The moment you touch memory (which you have to do from a kernel), 20 adds aren't noticable.
20:17 karolherbst: like 25 instructions to just calculate the invoc id...
20:17 karolherbst: right
20:17 jenatali: Yeah my only thought was just about passing the system value straight to DXIL vs computing it in nir - if there's workgroup offsets we have to compute it in nir
20:17 karolherbst: jenatali: I am wondering if it makes sense to always add the offset for somplied code and probably no perf benefit :/
20:17 karolherbst: but mhhh
20:17 jenatali: Which isn't the end of the world, and the resulting DXIL compiler is probably going to compute it anyway
20:18 jenatali: Just results in slightly nicer looking code at the end of what I can see
20:18 karolherbst: yeah.. I agree about the nicier looking code
20:18 karolherbst: and backend compiler _will_ do a better job lowering
20:18 karolherbst: nvidia respects value ranges so they emit a bunch of 32 bit operations
20:18 karolherbst: not 64 bit ones
20:21 jekstrand: We need range analysis in NIR
20:21 karolherbst: yep
20:21 jekstrand: For a bunch of this OpenCL address stuff, it'd be a huge win.
20:27 jenatali: jekstrand, karolherbst: There's one more system value related to this, which is number of groups. For CL, if we're lowering huge grids, we also need to return the right thing from get_num_groups(), which probably means it's a UBO lookup
20:28 jenatali: Right now my patch adds a "total" number of groups sysval, which is lowered to the pre-existing number of groups if offsets aren't supported
20:28 karolherbst: jekstrand: that one additional ubo lookup doesn't matter when you already have 10 :p
20:28 karolherbst: also.. UBO accesses are cheap
20:28 karolherbst: if not, the hw is broken
20:28 jekstrand: karolherbst: Our HW is broken then. :-P
20:29 jekstrand: karolherbst: Which is to say that we'd rather get it in the input buffer so we can push it.
20:29 karolherbst: what can I say? for us a direct ubo load equals register access :p
20:29 karolherbst: ahh push constants.. right
20:29 jenatali: I'm just wondering if different sysvals makes sense or not
20:29 karolherbst: jenatali: I guess in the end it's simply lowered everywhere
20:29 jekstrand: jenatali: Not sure what to say about the work group size.
20:29 jekstrand: That one's tricky
20:29 jekstrand: jenatali: That's a case where you really are faking it for the driver.
20:30 karolherbst: mhhhh
20:30 karolherbst: well...
20:30 jekstrand: Everything up until that point hasn't really been lies.
20:30 karolherbst: jekstrand: it's easy for us, our hw doesn't have a native sys val :)
20:30 karolherbst: for group size
20:30 jenatali: Yeah DXIL doesn't have a native sysval either
20:30 karolherbst: so it's already lowered
20:30 karolherbst: or well.. ubo lookup
20:30 jenatali: So for us either way it's a UBO lookup
20:30 karolherbst: yeah...
20:30 karolherbst: so the runtime can just provide the full value
20:31 karolherbst: and we don't need any lowering
20:31 jekstrand: karolherbst: Neither does ours
20:31 jekstrand: We put it in an SSBO
20:31 jekstrand: Which is garbage
20:31 danvet: sravn, other reason why bikesheds for simple patches aren't worth it: you don't forget to apply v2
20:31 jekstrand: We should push it
20:31 jekstrand: But we don't
20:31 karolherbst: jekstrand: yes... sounds like it
20:31 jekstrand: 'cause stupid
20:31 jekstrand:blames jljusten
20:31 danvet: sravn, re the drm_gem_object_put bugfix, I just pushed it now
20:31 jenatali: Just wondering if I should have one which *can* be a hardware-provided value, and one which definitely can't, or if we should just have one
20:31 karolherbst: jekstrand: we have a 64k driver UBO where we put all that stuff :)
20:31 karolherbst: so there is space
20:31 jekstrand: karolherbst: We have a 2K driver UBO to put that sort of stuff. It's called the push buffer. :-P
20:31 karolherbst: fun fact... tex handles are also read from an UBO by the tex instructions
20:32 jekstrand: Except we don't use it for workgroup size for stupid reasons.
20:32 jekstrand: Oh, actually, I remember why now. It's because it's annoying to push it with DispatchIndirect
20:32 karolherbst: ehhhh
20:32 jekstrand: And num_work_groups isn't that commonly used so meh.
20:32 karolherbst: jekstrand: do you know what we have to dispatchIndirect?
20:32 jekstrand: karolherbst: Do I want to?
20:32 karolherbst: maybe? :D
20:33 jekstrand: karolherbst: Does it involve writing to the batch from a compute shader?
20:33 karolherbst: jekstrand: worse but also better
20:33 karolherbst: so.. we have this macro language we can script our graph engine
20:33 karolherbst: and... it can do stuff
20:33 karolherbst: like dispatching work
20:33 karolherbst: :)
20:33 karolherbst: so it can also write into const buffers/memory and all that shit
20:34 karolherbst: so.. we adjust values, do some prep stuff and push the grid from a macro
20:34 jekstrand: karolherbst: The difficulty there is that we'd have to COW the push buffer somehow
20:34 jekstrand: We can write stuff to memory too
20:34 karolherbst: we even have an assembler for it
20:34 jekstrand: It's maintaining the buffering that's hard. :-/
20:35 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/tree/master/src/gallium/drivers/nouveau/nvc0/mme is where the stuff lives
20:35 danvet: mripard, mlankhorst tzimmermann needs to do a drm-misc-fixes pull this week, some important stuff in there
20:35 karolherbst: is the grid indirect stuff: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/nvc0/mme/com90c0.mme
20:35 danvet: airlied, ^^ fyi
20:37 sravn: danvet: I had totally lost track of that patch. I did not even recognize it when I read Greg's mail. Thanks for pushing
20:37 sravn: I have cleared all drm-misc-fixes patches now. Or so I hope.
20:38 jenatali: jekstrand, karolherbst: Thoughts about whether it makes sense to have a "true" count sysval vs one that's definitely coming from the driver? Based on your discussion I'm leaning towards just one now. Not sure if that's limiting which HW can support CL through Clover though
20:40 jekstrand: jenatali: Will you ever pass it on to the driver?
20:41 danvet: mripard, mlankhorst should also tell tzimmermann to at least sometimes fast-forward -fixes
20:41 danvet: it's still stuck on -rc1
20:41 jenatali: jekstrand: The way it's written right now, if there's workgroup offsets, then the driver will get the "total" count sysval. If there's no offsets, the driver gets the pre-existing count sysval
20:41 jekstrand: jenatali: Except for the case where you re-compile for a zero offset, I don't think there's ever going to be a case where the driver is going to see the work group size in the shader.
20:41 danvet: which isn't so great for bisecting
20:42 jekstrand: jenatali: You either lower it to a UBO load or input because you've got offsetting going on or everything fits and you pass it on to the driver.
20:42 jekstrand: Which actually raises an interesting question: How are we defining all this grid dispatch stuff?
20:43 jenatali: I've been focusing on the CLOn12 path, which doesn't have a gallium driver, so it doesn't use the existing grid dispatch stuff
20:43 jenatali: Just trying to make sure that the nir changes also align with what clover wants
20:45 jekstrand: jenatali: Yeah, I get that.
20:45 jekstrand: I'm just wondering how things are defined in clover because I suspect it may be wrong.
20:45 jenatali: Ah :)
20:45 karolherbst: jekstrand: clover adds an additional argument for the invocation id offset
20:45 karolherbst: it doesn't support lowering big grids
20:45 karolherbst: for spirv/nir we don't support any of this yet
20:46 jljusten: jekstrand: what stupid thing did I do?
20:46 jekstrand: jljusten: Oh, I remembered now why it wasn't stupid. :)
20:46 jekstrand: jljusten: You used a SSBO for work_group_size.
20:47 jekstrand: jljusten: Instead of pushing it. But then I remembered indirect dispatch and got sad.
20:47 jekstrand: karolherbst: What I mean is I'm wondering how we want to spec splitting.
20:47 jekstrand: karolherbst: I don't think we actually want dims and base.
20:48 jljusten: yeah. I mean, we could make a different program for indirect vs direct, I guess. :)
20:48 jekstrand: jljusten: Yeah. Probably not worth it. :)
20:48 jenatali: jekstrand: What would do instead of dims and base?
20:48 jekstrand: karolherbst: I suspect we want work group dimension and then a range of sorts.
20:49 jekstrand: where work group dimension is the "real" thing and range is a 3D base+size that specifies a sub-range.
20:49 karolherbst: so.. this is what we have today: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/include/pipe/p_state.h#L822
20:49 karolherbst: those are the things gallium tells the driver
20:49 jekstrand: Yeah, I'm looking at that now.
20:49 jekstrand: I don't get all this block and last_block stuff.
20:49 karolherbst: I added an offset[3] for local testing
20:50 karolherbst: jekstrand: is last_block even used?
20:50 jekstrand: karolherbst: I have no idea
20:50 karolherbst: ohh radeonsi uses it
20:50 jenatali: jekstrand: CL 2.0 allows non-uniform thread groups?
20:50 jekstrand: karolherbst: I assume that stuff is there for a reason.
20:50 jenatali: Not sure if VK maybe does something similar
20:51 jekstrand: jenatali: What do you mean by non-uniform thread groups?
20:51 jenatali: E.g. work group size of 4, dispatch 5 work items, the second work group only has 1 work item in it
20:51 karolherbst: jekstrand: it's used from util_compute_blit
20:51 karolherbst: and only from there
20:51 jekstrand: jenatali: Yeah, Vulkan doesn't have anything like that.
20:51 karolherbst: jekstrand: does CL even allow this?
20:51 karolherbst: ...
20:51 karolherbst: jent
20:51 jekstrand: jenatali: You're allowed to "if (x > 5) return;" in you shader. :-)
20:51 karolherbst: ....
20:51 karolherbst: jenatali:
20:52 jenatali: karolherbst: Yep. Let me find the spec bits
20:52 karolherbst: heh? that would surprise me honestly
20:52 karolherbst: grid_size has to be a multple of local_size afaik
20:52 jekstrand: CL 2 has lots of extra stuff in it. There's a reason they pulled it all back out for CL 3 :)
20:52 jenatali: If non-uniform work-groups are supported, any single dimension for which the global size is not divisible by the local size will be partitioned into two regions. One region will have work-groups that have the same number of work items as was specified by the local size parameter in that dimension. The other region will have work-groups with less than the number of work items specified by the local size parameter in that dimension. The global IDs and
20:52 jenatali: group IDs of the work items in the first region will be numerically lower than those in the second, and the second region will be at most one work-group wide in that dimension. Work-group sizes could be non-uniform in multiple dimensions, potentially producing work-groups of up to 4 different sizes in a 2D range and 8 different sizes in a 3D range.
20:53 karolherbst: I read "If non-uniform work-groups are supported" and stopped reading :p
20:53 jenatali: Yeah pretty much :P
20:53 karolherbst: how useless
20:53 karolherbst: and annoying
20:53 karolherbst: the heck...
20:53 karolherbst: what were they thinking
20:53 karolherbst: but yeah..
20:53 jenatali: It's a required 2.0 feature, which means optional in 3.0 at least
20:53 karolherbst: I kind of see the value for developers not having to deal with this themselves
20:54 karolherbst: ufff...
20:54 karolherbst: the issue is kind of, if you don't have a multiple or your multiple is causing non perfect local_sizes you are screwed
20:54 karolherbst: and I guess that's what they tried to workaround
20:55 karolherbst: so you want to run with the perfect amount of threads and just do the last bits with a special launch or whatever
20:55 jekstrand: I wonder if barriers are expected to work.
20:55 jekstrand: If they are, that sounds very annoying.
20:55 karolherbst: I assume so
20:55 karolherbst: barriers are..... complicated
20:55 karolherbst: have like thousends of options on the hw...
20:56 jekstrand: I suppose it probably works ok on our HW. I think barriers are per-thread not per-SIMD-lane.
20:56 karolherbst: yeah
20:56 karolherbst: here as well
20:56 karolherbst: well...
20:56 jekstrand: So you can just halt those lanes and as long as you have one lane active, the barrier works.
20:56 karolherbst: you are supposed to have converged threads :)
20:56 karolherbst: pre volta that is
20:57 karolherbst: ahh yeah.. there are barriers with explicit thread masks as well
20:57 karolherbst: but I think threads still have to converge before executing the barrier
20:57 karolherbst: well.. they don't have to, but then they do something unexpected
20:59 jekstrand: Yeah, barriers are only allowed in uniform control-flow.
20:59 jekstrand: Which, in OpenCL means you're not allowed to have diverged ever.
20:59 karolherbst: jekstrand: you know how that's solved in cuda/ptx?
20:59 karolherbst: it's quite simple really
20:59 jekstrand: karolherbst: No idea. I don't know what the rules there are.
21:00 karolherbst: the compiler converges for you :p
21:00 jekstrand: karolherbst: "quite simple" Yes, that's what that is. :-P
21:00 karolherbst: :D
21:00 karolherbst: well
21:00 karolherbst: we have instructions to sync threads
21:00 karolherbst: and we just insert those around ifs and loops
21:00 karolherbst: :p
21:00 karolherbst: it's really not that hard
21:00 karolherbst: we have to do that for GL already anyway
21:02 jenatali: So... work group sizes... :)
21:06 karolherbst: jenatali: mhhhh
21:06 karolherbst: you can't cheat your dx runtime, can you?
21:06 jenatali: Cheat?
21:06 karolherbst: so you need to add lowering code, right?
21:06 jenatali: Right
21:07 jenatali: It doesn't need to be in core nir though, it can be in our DX-specific compiler bits
21:07 karolherbst: right.. for mesa we could just push the full value to the drivers probably
21:07 karolherbst: I don't know if there is hardware having an native sys vall.. let me check
21:07 jenatali: Thanks - not sure how I'd go about checking that, but I guess that's in essence what I'm asking
21:08 karolherbst: mhhh
21:08 karolherbst: in nir we only have nir_intrinsic_load_local_group_size, right?
21:09 karolherbst: mhh.. gl_NumWorkGroups
21:09 jenatali: Yeah, load_num_work_groups
21:09 karolherbst: right
21:10 karolherbst: ahhh
21:10 karolherbst: seems like up to kepler that was supported...
21:11 karolherbst: uhhh
21:11 karolherbst: I think that's implicitly calculated even :/
21:11 karolherbst: now I am annoyed
21:12 jenatali: So, sounds like 2 - one that can be hardware or software, and one that has to be software?
21:13 karolherbst: I am still not 100% sure
21:14 karolherbst: so we do push those to the driver const buf
21:14 karolherbst: at least the grid dimension
21:14 karolherbst: and block
21:15 karolherbst: but we also specify them when launching the grid via the launch descriptor
21:15 karolherbst: imirkin: would reading out the grid size via c7 be faster/slower than using system values?
21:16 karolherbst: jenatali: anyway.. so we need kind of a few changes to the gallium API I guess
21:16 jekstrand: jenatali: So where I'm leaning on workgroup sizes is that we have load_work_group_size and that's the "real" one.
21:16 karolherbst: dispatched grid size and "total" grid size
21:16 karolherbst: invoc offsets
21:16 karolherbst: grid id offsets
21:16 karolherbst: I think...
21:17 jekstrand: Then at the gallium dispatch layer (or whatever you're doing), you invoke 3D ranges of base+size.
21:17 jekstrand: And that size never goes into shaders.
21:17 jekstrand: The one exception is if you know that you are only going to have one dispatch then you can pass num_work_groups through to the back-end (DXIL in your case)
21:18 karolherbst: yeah....
21:18 karolherbst: I kind of like the maxwell+ model of our hw
21:18 karolherbst: the hw only knows the local and global id
21:18 karolherbst: everything else is driver provided data
21:18 jenatali: Got it, one sysval, lowered into a ubo load when we're looping
21:18 jenatali: That works
21:20 karolherbst: jenatali: I probably convince pmoreau to test some patches on that for the full lowering
21:20 karolherbst: pmoreau has a nv50 era GPU and those only have 2D grids :)
21:20 jenatali: Whoa...
21:20 jenatali: Did not know that was a thing
21:20 karolherbst: well the blocks (local) are 3D
21:20 karolherbst: but grids are 2D only
21:21 karolherbst: and they only have 512 threads :)
21:21 karolherbst: {32, 16, 2} can't be dispatched by the hw :)
21:21 karolherbst: and needs lowering
21:21 jekstrand: karolherbst: Oh, that's "fun"
21:21 jenatali: Well-said :P
21:22 jekstrand: karolherbst: We have some HW where we have to run SIMD32 in order to get 1024 which is pretty annoying.
21:22 karolherbst: jekstrand: we don't even support GL compute on those :p
21:22 imirkin: karolherbst: no idea. the constbuf would have to be updated too...
21:22 jekstrand: But nothing where we have to actually lower.
21:22 karolherbst: imirkin: we already do
21:22 imirkin: karolherbst: and would it work with indirect?
21:22 imirkin: i guess you could have the macro update the ubo directly
21:22 karolherbst: mhhh... don't we use the same macro?
21:22 karolherbst: I mean.. it all works for maxwell anyway
21:22 imirkin: i forget how that stuff works tbh - i'm not sure we use a macro on later gens
21:22 karolherbst: and since maxwell we only have local and global ids
21:22 imirkin: or i forget how that macro works
21:22 karolherbst: nothing else
21:24 karolherbst: anyway.. compute really became usefull since fermi.. but OpenCL is the only compute you can do on nv50 :/
21:24 airlied: karolherbst: did nvidia expose gl compute on nv50?
21:24 imirkin: no
21:24 imirkin: only opencl
21:25 karolherbst: airlied: you can't do images from !compute shaders
21:25 imirkin: GL compute requires ssbo / images, and ssbo/images require being done from frag shaders
21:25 karolherbst: and also no global memory
21:25 airlied: ah lols
21:25 karolherbst: so.. technically you could do GL compute
21:25 karolherbst: you just couldn't advertise the extensions
21:25 imirkin: should be enough for ES 3.1 compute/ssbo/etc
21:25 karolherbst: imirkin: ohh really?
21:25 imirkin: can't do indirect, but that can be faked
21:25 karolherbst: yeah...
21:25 karolherbst: maybe that's something we should do actually?
21:25 karolherbst: would give us some testing :d
21:25 imirkin: only thing offhand that i can think of that won't work
21:26 imirkin: yeah
21:26 imirkin: i guess i have the hw on-hand, maybe i should give it a go
21:26 imirkin: curro did the hard work already
21:26 imirkin: of lowering images to ssbo
21:26 karolherbst: no surface ops on nv50?
21:26 imirkin: nope
21:26 karolherbst: ehhh
21:26 karolherbst: but yeah.. would be cool to be able to run the gles tests on nv50 :)
21:27 karolherbst: would help pmoreau :p
21:27 imirkin: we could do staged impl too - e.g. only support ssbo/atomic
21:27 imirkin: and leave images to later
21:27 imirkin: that sounds like a fun project actually - thanks for reminding me :)
21:27 karolherbst: yeah...
21:27 imirkin: i'll check what else we need for ES 3.1...
21:27 karolherbst: would be a good first step at least
21:28 imirkin: uff ... enhanced textureGather
21:28 karolherbst: do we even advertise 3.0 for nv50?
21:28 imirkin: yeah, but we shouldn't :)
21:28 karolherbst: :D
21:28 imirkin: pre-GT200, no support for pause/resume feedback
21:28 karolherbst: ahhh
21:28 karolherbst: oh well
21:28 karolherbst: I hope nobody uses that feature
21:29 imirkin: the "enhanced" texture gather maybe problematic. i need to check what all ES 3.1 actually enables...
21:29 imirkin: indirect draws(/compute dispatch) can be faked
21:29 imirkin: everything else should be doable
21:30 karolherbst: I wouldn't be surprised if there is already lowering for the enhanced texture gather stuff
21:30 karolherbst: somewhere
21:31 imirkin: ok, so texgather with explicit component selection is a no-go
21:31 karolherbst: nir has lower_txd_cube_map, lower_txd_3d and lower_txd_shadow _and_ lower_txd :D
21:31 imirkin: first of all, texgather is only a thing on the 10.1 GPUs (i.e. GT21x)
21:32 imirkin: secondly, it only allowed RED to be selected
21:32 karolherbst: txd was textureGather, no?
21:32 imirkin: no
21:32 karolherbst: ehhh tg4 was?
21:32 imirkin: yes
21:32 karolherbst: annoying
21:32 imirkin: txd = derivatives
21:32 karolherbst: txd was fetch?
21:32 imirkin: txf = fetch
21:32 karolherbst: what was txd then?
21:33 imirkin: see above
21:33 karolherbst: ahh derivates
21:33 karolherbst: ehh
21:33 imirkin: textureGrad & co
21:33 karolherbst: right...
21:35 karolherbst: imirkin: well.. at least we could expose the extensions for testing
21:35 karolherbst: even if we don't expose 3.1
21:35 imirkin: no exts
21:35 karolherbst: huh?
21:35 imirkin: it's ES 3.1 or not ES 3.1
21:35 imirkin: there are no exts
21:36 karolherbst: wait...
21:36 imirkin:is waiting
21:36 karolherbst: GLES has only extensions starting with 3.1?
21:36 imirkin: GLES has extensions
21:36 imirkin: starting from 1.0
21:36 karolherbst: ohh those features are just core without extensions?
21:37 imirkin: but there are no extensions that cover the functionality additions in ES 3.1
21:37 karolherbst: annoying :/
21:37 karolherbst: heh wait... I actually have a tesla GPU here
21:37 karolherbst: I totally forgot
21:37 karolherbst: 8600 or so
21:37 imirkin: that's probably a G84 or G86
21:37 karolherbst: yeah
21:37 imirkin: won't even have texgather
21:41 imirkin: or pause/resume xfb
21:46 karolherbst: imirkin: well.. now that I managed to run the CTS with piglit, maybe we can just fix as much as possible :D dunno
21:49 imirkin: karolherbst: i did send a bunch of CTS fixes to make it work better with my board
21:49 imirkin: to my vast surprise, they were accepted
21:49 imirkin: (my board = G84)
21:51 karolherbst: for GL3.3, right?
21:51 karolherbst: I was more thinking of GLES3.1 stuff
21:51 imirkin: yes, for GL 3.3
21:52 imirkin: well, a lot of the GLES 3 CTS tests fail because they'll try to do xfb with pause/resume
21:52 karolherbst: right
21:52 karolherbst: but I was more thinking about random crashes :/
21:52 imirkin: ah. not a lot of those.
21:53 karolherbst: still annoying if you just want to run each test :p
21:53 karolherbst: anyway.. I am planning to hook that up for the CI stuff
21:53 imirkin: k
21:53 karolherbst: and would hook up tesla cards as well probably? dunno
21:53 karolherbst: still very much at the begining with it :)
21:53 imirkin: should be no harm, other than effort
21:54 karolherbst: well.. not even effor as we do the same stuff for newer gens anyway