13:14 tzimmermann: mlankhorst, will there be another PR for drm-misc-next in this cycle ?
14:28 glehmann: airlied: I think the new reduce MR is very reasonable
14:29 glehmann: I just had a few nits
14:29 glehmann: I didn't look at the radv lowering in detail again though, because I think that's what I already reviewed on the other MRs
14:41 alyssa: skimmed it, looks ok
14:41 alyssa: this is not an ack
14:41 alyssa: but it's a "no longer a nak"
15:02 glehmann: so a nnak?
15:19 sima: hwentlan__, https://lore.kernel.org/r/20251112222646.495189-1-mario.limonciello@amd.com I guess still not up to par ...
15:19 alyssa: glehmann: sure
17:24 Venemo: when you turn off a monitor, should it deassert the HPD signal?
17:40 anholt: NIR folks, would appreciate an opinion in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38387 about a proposed direction.
17:52 alyssa: anholt: about the BASE vs BYTE_OFFSET thing?
17:53 alyssa: or about adding BASE to scratch intrinsics?
17:53 anholt: BASE vs BYTE_OFFSET
17:54 alyssa: It's.. probably the right direction to go in
17:54 alyssa: on one hand this gets back to the usual problem of "address arithm is slightly different on every ISA, what do we do in NIR"
17:55 K900: So uh, folks, is 25.3 final happening or?
17:55 anholt: at least for a const offset, letting it just be bytes is not a pain for codegen.
17:55 alyssa: on the other hand if we're going to do offsets at all on the common intrinsics .. not overloading the same const index for both I/O locations and byte offsets is not a bad thing
17:55 K900: (trying to figure out if we should update to 25.2.7 or hold off a couple days and jump straight to 25.3)
17:55 Venemo: anholt: commented on the MR
17:55 glehmann: well I think that BASE is badly named for all memory intrinsics, but as long as we don't change all of them at once, it's better to stay with it
17:55 alyssa: glehmann: the issue is that BASE is used for both memory intrinsics and also I/O
17:55 alyssa: and means very different things for each
17:56 alyssa: which I think bit Marek recently which is why we're having this talk :p
17:56 Venemo: I think the linked MR follows an already established pattern, so it would be unfair to ask anholt to change all of that now.
17:56 Venemo: that said, I am in favour of cleaning it up eventually
17:56 alyssa: agreed
17:57 anholt: I'm hearing consensus on "we want to move to BYTE_OFFSET for memory". I'm not hearing consensus on "taking an incremental step to BYTE_OFFSET for a new intrinsic is good"
17:57 Venemo: when we change it, we should change all of them at once, otherwise we just create even more inconsistency
17:58 alyssa: anholt: consensus seems to be "status quo sucks but this is not your problem"
17:59 anholt: incremental feels like mostly a problem when passes like load/store vectorizer handle the base but not the byte offset, for example.
17:59 karolherbst: it almost feels like to me that almost every driver will eventually end up with their own load/store intrinsics anyway
17:59 karolherbst: like for NV none of the existing ones are good fits
18:00 karolherbst: so I'll add another set of ~7 just for nv
18:00 Venemo: nothing wrong with that
18:00 karolherbst: my point is rather that there is no point in overthinking it
18:00 Venemo: they should be lowered late enough to not cause issues for most passes, then it's good
18:01 karolherbst: but then we'll need passes to know what's the max offset allowed and pipe that info through
18:01 karolherbst: with the byte offset thing
18:01 karolherbst: which is fine if only opt_offset folds it in
18:03 alyssa: also in terms of my own priorities at least ... as far as invasive cross-tree NIR work that needs to happen, this is not even top 10 priority
18:03 alyssa: not that I would nak if someone else did the work, of course
18:04 mareko: driver can keep using the generic intrinsics, but they would legalize srcs and attributes, i.e. not change the semantics, but e.g. fold BYTE_OFFSET into src if the hw can't encode immediate byte offset in the instruction
18:05 mareko: it's not always necessary to add brand new intrinsics for every drivers, but maybe only lower to BDA etc.
18:06 anholt: freedreno has a bit of "opt_offsets pulls out the base up to a limit, but also lowers out base beyond the limit." that may be mostly a problem of our own creation, though, I forget the details now.
18:07 karolherbst: well for nv we need 2 offset sources and a constant and a shift (for shared only), so I'm sure that no matter what we do in core nir, we'll add specific intrinsics anyway
18:09 mareko: alyssa: I guess that shuffle+bitcast thing is one of them. what are the other 9? ;)
18:09 anholt: karolherbst: the OFFSET_SHIFT we added should help you with getting shifts.
18:10 karolherbst: anholt: we only support 0, 2 or 3 as shifts and only for shared. But that doesn't solve the issue with the offset sources of which one is uniform, and you can mix 32 and 64 bit offsets
18:11 anholt: I do understand. I'm just pointing you at a solution to part of your problem.
18:11 glehmann: maybe we need something like nir_tex_instr srcs, but for intrinsic load/store
18:11 karolherbst: it feels it's target specific that we can have target specific intrinsics and just code you can plug into to get what you need
18:12 Venemo: btw when I wrote opt_offsets I didn't realize it would be this useful for everyone
18:13 karolherbst: yeah I'm kind surprised about all those fancy IO ops the various ISAs have and how they are all different 🙃 but also seem to contain constant offsets
18:13 karolherbst: but the constant offset is a total disaster on nvidia
18:13 Venemo: how so?
18:13 karolherbst: it's either signed or unsigned depending on the register offsets
18:14 anholt: karolherbst: we also have some io ops that have two varying offsets (one shifted) plus a constant offset.
18:14 karolherbst: but can it do a 64 + 32 addition?
18:14 anholt: looks like current gen has one that's "uniform offset plus varying offset plus constant offset, no shift."
18:14 karolherbst: but yeah.. the shift on nv only exists on shared memory
18:14 anholt: it's global mem access, and I believe so.
18:14 karolherbst: ahh
18:14 karolherbst: sounds like nvidia then
18:15 Venemo: karolherbst: we have the same mess on amd, except it is also slightly different between each generation
18:15 karolherbst: though nvidia has a 24 bit _signed_ constant offset unless the base address are all zero regs in which caase it's unsigned
18:15 karolherbst: like we have a special register that's always zero and it's treated special in specific cases
18:16 anholt: our const offsets are also signed. current nir_opt_offsets is we just limit to the positive range.
18:16 karolherbst: well.. for global it doesn't matter, but for shared you can do a s[RZ+$unsigned_constant] access
18:16 karolherbst: it's not a toggle
18:17 karolherbst: it's always unsigned with RZ and always signed otherwise
18:17 karolherbst: not sure how that works out with the uniform + non uniform + constant variant
18:18 karolherbst: maybe both need to be RZ or just the uniform one...
18:18 karolherbst: ohh both have to be
18:21 karolherbst: anyway.. for e.g. nir_opt_offset it's trivial to add support for more intrinsics and I think it's fine if we have generic IO ops, but also specific IO ops and they are both supported by the passes we have to optimize it all
18:22 karolherbst: so we don't really have to try very hard to have a generic solution that's good, because we should always be able to plug in vendor specific intrinsics to get the optimal output
19:03 alyssa: mareko: that's one. vec32+ is another, depends on removing swizzles. removing nir_alu_src is a follow on from removing swizzles. dealing with undef/poison/freeze is another. compressing nir_src use chains is another. pushing semantic I/O throughout the tree is another. establishing canonical pass orderings follows from that etc.
19:04 alyssa: i'm sure i'm forgetting things
19:15 mareko: one of the things on my list is that all code motion within a single shader should be aware of register usage and how code motion changes it
19:16 mareko: without that, GCM and LICM are unusable
19:29 glehmann: alyssa: not using linked lists to speed up instruction iteration speed
19:30 glehmann: or maybe I'm overestimating how bad pointer chasing is for modern cpus
19:32 glehmann: one major thing on my list is fixing float controls, so that the dxvk devs can stop shouting at me
19:36 airlied: we do also pack the instructions afaik
19:36 airlied: so we should have okay caching properties
19:36 airlied: even if we still chase pointers
19:37 alyssa: dsNip
19:37 alyssa: mareko: ooh yeah thats a big one
19:37 alyssa: glehmann: also good ones
19:37 airlied: using the gc_alloc stuff
19:38 airlied: so I'm not sure the linked lists avoidance is as important as it once was
19:44 alyssa: I hope you're right because the linked lists are never going away (:
19:45 mareko: there 2 cases where code motion likely improves register usage: 1) if sizeof(all srcs) > sizeof(dst), moving the instr up is likely better (it's always better if sizeof(dst) <= 4)), 2) if sizeof(all srcs) < sizeof(dst), moving the instr down is always better
19:45 alyssa: Yep
19:45 alyssa: nir_opt_sink/nir_opt_move has a fairly conservative heuristic based on those ideas
19:45 alyssa: but we're leaving a lot on the table still
19:48 mareko: for loops, anything that is LICM'd extends the liveness of dst to the whole loop, on top of that LICM'ing vec8 descriptor loads could multiple times exceed max registers
19:49 glehmann: spilling SGPRs isn't the end of the world at least
19:50 glehmann: ofc the worst case is still bad, but in practice LICM seems to work decently for radv at the moment
19:50 mareko: because it doesn't do anything :)
19:51 mareko: I have a MR making LICM complete, but got stuck on register usage issues
19:52 alyssa: not doing anything is a great way to not hurt register usage
19:52 alyssa: have you seen the AGX scheduler? ;P
19:53 mareko: indeed
19:58 glehmann: well licm does something
19:58 glehmann: it's just conservative
19:59 mareko: given "for (a; b; c) d;", current LICM only hoist invariant code in b
20:02 mareko: "do { a; } while (b);" is the only once where current LICM hoist invariant code from both a & b
20:02 mareko: *one