03:16mareko: fma is also ballooning register pressure over mul+add, many things do
03:21alyssa: yes. and?
07:37danvet: robclark, no ack for the syncobj uapi from gfxstrand or bnieuwenhuizen_ ?
07:45HdkR: danvet: Is this a uapi change? Where can I look at it?
07:47danvet: HdkR, https://lore.kernel.org/dri-devel/20230308155322.344664-10-robdclark@gmail.com/
07:47danvet: some ack from vk people would be good for this I think
07:55HdkR: Alright cool, looks like the msm change will just work since I was already passing through padding, syncobj_wait is also a passthrough with correct arrangement, and...I'm not even watching the sync_file uapi so if it breaks, wheh
08:15HdkR: I should probably one day take a scroll through ioctls that are registering compat handlers
08:22danvet: HdkR, anything new that requires compat handler is kinda a misdesign, we really shouldn't need these anymore
08:23danvet: unfortunately android having been 32bit only for a while butchered a few :-/
08:23danvet: but in drm I don't think we've screwed up since years
08:23HdkR: There's been a decent number of mess ups with tail padding on the structs
08:23HdkR: Recent too
08:23danvet: HdkR, care to supply an ack on-list?
08:24danvet: HdkR, hm for arrays and stuff?
08:24danvet: drm_ioctl() auto-corrects padding both ways, so unless you don't memset the entire struct or something it should work out
08:24HdkR: nah, just ending the struct with a uint32_t and the tail alignment ends with 32-bit or 64-bit aligned depending on bitness
08:25HdkR: Is the most common thing I've noticed
08:25danvet: yeah I think we screw that sometimes, but it's also shouldn't hurt (unless it's an array)
08:25HdkR: Changes the ioctl number which flags my CI :P
08:26danvet: oh for secomp filtering and stuff like that?
08:27HdkR: I didn't even think about seccomp filtering. That...might be interesting
08:27danvet: sometimes we need to patch it up like DMA_BUF_SET_NAME_A/B
08:27HdkR: But since FEX is translating 32-bit ioctls to 64-bit ioctls, any change in ioctl size is a flag for bad smell
08:27danvet: but drm_ioctl handling padding either way absorbs these fully
08:28danvet: HdkR, ah yeah if you write your own compat code then this sucks :-(
08:28HdkR: Indeed!
08:29danvet: just wondering whether a makefile check for uapi header structs would be doable
08:29danvet: would need a ton of annotations for all the current ones that are busted though
08:30HdkR: I have a ton of annotations specifically because of this. Mostly focused around drm, and I'm /definitely/ missing some
08:31HdkR: Then a clang script that compares member alignment, size, offset, etc between architectures
08:32danvet: HdkR, maybe we want gitlab first to make sure it's actually used, but this sounds like something that would be really useful in upstream
08:32danvet: maybe opt-in for drm only at first
08:33HdkR: I will love if this could get maintained upstream at some point so I don't need to constantly glare at the ML for uapi changes :P
09:14HdkR: https://github.com/FEX-Emu/FEX/blob/main/Source/Tests/LinuxSyscalls/x32/Ioctl/HelperDefines.h#L13 As a reference, I do some basic annotation of the structs through wrapping the ioctl and which works in tandem with the clang tool to pull out all of the information. It could be cleaner
09:15HdkR: Doing it this way since I don't annotate the struct definitions directly so I can use unmodified headers
09:16HdkR: Also needs to handle opaque ioctl definitions that don't have the struct information declared in the `_IO_RW`, which is really annoying to validate
12:59eric_engestrom: DavidHeidelberg[m]: not sure what you're trying to do with the CI compiler wrapper, but yes, right now it's there to pass `-Werror` because meson ignores `-werror=true` (which IMO is a meson bug), and in my WIP MR I'm also adding `--fatal-warnings` in that wrapper (arguable whether meson should do that, so we might have to keep the wrapper indefinitely if meson doesn't accept that)
13:00eric_engestrom: DavidHeidelberg[m]: but for `clang` vs `clang-15`, you can trivially add a new 2-lines wrapper that includes the version number in the executable name
13:04DavidHeidelberg[m]: My approach was to remove dash and number; ofc I can, I'm just thinking about how complex and ugly it looks :(
13:05DavidHeidelberg[m]: wish meson would fix that behaviour to get along with our needs rather than this hacking around
13:05eric_engestrom:nods
13:06eric_engestrom: the whole "`clang` something has a version number in its executable name" is a mess, and the way meson deals with it is another mess :/
13:06eric_engestrom: s/something/sometimes/
13:09MrCooper: eric_engestrom: FWIW, -Wl,--fatal-warnings could be added to c{,pp}_link_args, unless that breaks some meson configure checks as well?
13:10eric_engestrom: MrCooper: you're right, and that's what I was doing at first actually
13:10eric_engestrom: now I'm wondering why I changed this to use the wrapper instead
13:11eric_engestrom: but very good point, so the wrapper is only needed until meson's `--werror=true` enables `-Werror`
13:13MrCooper: right
13:42vandemar: I keep getting segfaults in a game, after anywhere from a few minutes to 1-2 hours, running on a radeon 6700, in mesa 23.0.0 amdgpu_cs_add_buffer:674, more details https://pastebin.com/Zkup7jaV
13:44vandemar: Is there anything else I should investigate, or should I file a bug, or is it a known issue? I didn't see any recent similar looking reports of segfaults, but I didn't look that far back.
13:45pepp: vandemar: please file a bug
13:48eric_engestrom: MrCooper, DavidHeidelberg[m]: MR updated (although it is still blocked by the bug of telling the linker we export symbols that don't exist depending on which driver we build), but it made me wonder: why to we do `-D c_args="$(echo -n $C_ARGS)"` instead of `-D c_args="$C_ARGS"`? I don't think duplicate whitespaces are an issue, if that's what this was trying to remove
13:48pepp: vandemar: identifying which call to radeon_add_to_buffer_list() from si_emit_draw_packets the lead to the crash would help
13:49MrCooper: eric_engestrom: it's to get rid of the newlines
13:50eric_engestrom: if we're using `>` there shouldn't be any newline?
13:50eric_engestrom: in yaml I mean
13:50eric_engestrom: oh, perhaps a trailing newline, maybe we should have `>-`
13:51MrCooper: maybe, can't remember the details
13:52zmike: mareko: for KHR-GL46.gpu_shader_fp64.fp64.state_query doesn't this mean the TES uniforms are also not referenced since the GS isn't reading gl_Position?
13:55vandemar: pepp: ok. it's line 1523 (23.0.0) in si_emit_draw_packets, in the if(index_size) branch
13:56pepp: vandemar: can you print the indexbuf variable with gdb?
13:57vandemar: no locals? should there be?
14:38gfxstrand: robclark: For your 64+32 loads, is the 32-bit offset signed or unsigned?
14:41robclark: gfxstrand: IIRC signed in most (all?) cases
14:47gfxstrand: robclark: Ok, so at least we have consistency there.
14:47gfxstrand: etnaviv is 32+32 so no signeness issues there.
14:50robclark: I think the harder thing is that there are several different encodings we can pick (ie. one option has a small const offset, but we can optionally left-shift the variable src by a PoT.. or we can choose a different option w/ larger const offset, etc)
14:51gfxstrand: For NV, we can just re-materialize the add if offset > INT32_MAX
14:52gfxstrand: I'm more concerned about hardware that can do non-const offsets.
14:53robclark: hmm, I think we'd prefer the offset to _not_ be 64b (but happy for nir not to lower to this form if offset is too big)
14:53gfxstrand: Because we can't know whether or not the offset is negative then
14:53robclark: in practice I don't think you'll see many >4GB buffers ;-)
14:53gfxstrand: Oh, the plan is for the offset to be 32bit
14:53gfxstrand: always
14:53robclark: +1
14:53gfxstrand: The question is how sign-extension is handled on the 32-bit part
14:54robclark: hmm
14:54robclark: I think I'd have to play with that
14:54gfxstrand: NVIDIA only has const offsets and they're signed. So if it's unsigned, we just check for <= INT32_MAX and re-mat the add if it would be negative.
14:54gfxstrand: For Etnaviv, it's 32+32 so there is no zero- or sign-extension
14:55gfxstrand: It's freedreno where we're going to have a problem if we get it wrong.
14:55gfxstrand: For NIR internals, I think I'd like it to be unsigned because then it matches nir_address_format_64bit_global_32bit_offset.
14:55gfxstrand: But NIR can be changed, hardware can't.
14:56gfxstrand: It also depends a bit on how we want to deal with offsets. Array indices are typically signed so for an OpenCL-like case where you have a pointer that you're treating as an array, you may have a negative offset.
14:59robclark: I'm going to hope/assume that hw handles negative offset in a sane way, but will need to do some experiments
15:04gfxstrand: k
15:30mareko: zmike: yes
15:38alyssa: HdkR: wouldn't it be less work to implement 32-bit libGL/libVK thunking and call it a day?
15:42eric_engestrom: any objection to exporting drmOpenMinor()? it's the only missing piece to implement https://gitlab.freedesktop.org/mesa/drm/-/issues/85#note_1839524
15:43eric_engestrom: (marcan's suggestion of looping over the devices once and picking the driver based on that, instead of going through all the devices for all the drivers until one loads)
15:45robclark: gfxstrand: ok, so the variable src offset is unsigned 32b
15:45gfxstrand: Venemo: What does AMD have for offsets on load/store_global?
15:45gfxstrand: robclark: Okay...
15:45gfxstrand: robclark: But some of the constant ones are signed?
15:46robclark: I _think_ so.. they are also a lot smaller than 32b
15:47robclark: or at least we have signed offsets in some other places
15:50alyssa: gfxstrand: for AGX, 64+32 either signed or unsigned, but the offset is implicitly multiplied by the element size (4 unless you're doing 8/16-bit loads)
15:51alyssa: agx_nir_lower_address sorts out the mess, not planning to use common code becuase that last offset multiply thing is off the deep ened
15:52Venemo: gfxstrand: I think we have a vector address, scalar address and a constant address, but pendingchaos can correct me if I'm wrong
15:55robclark: gfxstrand: yeah, immed offsets encoded in instruction are signed
15:56robclark: and yeah, offset is in units of element size as well
15:56gfxstrand: robclark: Are your non-constant offsets in element size units?
15:57robclark: yeah
15:57gfxstrand: bugger. Okay.
15:58gfxstrand: So now we have Immaginappledreno?!?
15:58robclark: yeah, I don't think unaligned access is a thing ;-)
15:59gfxstrand: I fear I may be starting to agree with Venemo that we want to just add a signed 32-bit offset const index to load/store_global.
16:00gfxstrand: nir_lower_mem_access_bit_sizes is going to hate me for it.
16:00gfxstrand: Or maybe we just assert offset == 0 in that pass
16:01robclark: hmm, I was kinda expecting that lower-derefs could give us a 32b offset that driver can consume..
16:01gfxstrand: Yeah, there's some debate to be had about where the best place to do that offsetting is
16:02robclark: if you think about array+struct dereferencing you'd design an instruction exactly like what I have ;-)
16:03gfxstrand: Right up until some app does a bit of "creative" casting. :P
16:04robclark: sure there isn't some weasle out clause somewhere in the spec declaring that negative array indexing is UB?
16:16gfxstrand: Negative array indexing is UB assuming you get the entire array access in a single index.
16:17gfxstrand: However there's OpPtrAsArray which lets you effectively do (&arr[7])[-3] which would be arr[4] which is valid.
16:18robclark: if the xyz[-3] is UB how does it become defined if xyz=&arr[7]?
16:21pendingchaos: AMD global loads on gfx6 and gfx9+ can have a limited constant offset, which can be negative on gfx9+
16:21pendingchaos: for gfx6: it can do either a sgpr/vgpr address and a unsigned 32-bit sgpr offset
16:21pendingchaos: for gfx7/8: it can take a vgpr address and offsets have to be lowered as an addition
16:21pendingchaos: for gfx9+: it can take a vgpr address, or a sgpr address and unsigned 32-bit vgpr offset
16:21gfxstrand: robclark: The UB comes in when you actually access outside arr[], not based purely on the index.
16:22gfxstrand: robclark: As long as all your OpPtrAssArray end you back inside the array, you're fine.
16:22alyssa: gfxstrand: wait do we know what imagination does?
16:22gfxstrand: Well, as long as you never end up outside.
16:22gfxstrand: alyssa: No, I'm just making silly jokes. :P
16:22alyssa: the ISA on AGX is all Apple as far as we know
16:23gfxstrand: robclark: So if you have int arr[10] then arr[7] is valid as is (&arr[7])[-3] because that's just arr[4]. I'm not sure about (&arr[12])[-5]. That one might be UB.
16:25robclark: hmm, this is only a problem if part of the offset is folded into the 64b address..
16:26robclark: although maybe we can't 100% avoid that?
16:26gfxstrand: Yeah
16:26gfxstrand: I mean, it's not strictly speaking a problem. We can handle whatever. It just affects design a bit.
16:26alyssa: appledreno
16:28cwabbott: robclark: yeah, presumably the two parts could be separated by other stuff in between
16:28cwabbott: pass it as a function argument, use in a phi, you name it
16:33robclark: I guess even if we limited it to cases where we could tell at compile time that the base address is the actual start of the object, that seems fine.. if you're trying to be a jerk to the compiler I don't really care if you get horrible suboptimal code ;-)
16:37gfxstrand: At this point, I think we want to just add a signed 32-bit offset to load/store_global like Venemo said. I don't like calling it "base" but we can add a new name. Names are cheap.
16:37gfxstrand: italove: ^^
16:38gfxstrand: For the non-constant case, IDK what to do. It's also a far less important case, IMO. If there's a few adds in the shader, oh well.
16:39robclark: non-constant == any array access
16:39robclark: and few adds means bunch of instructions after lowering to 32b :-(
16:40gfxstrand: robclark: Yeah, but when it comes to evolving core NIR, I think we start with what we know.
16:41gfxstrand: I'm not going to suggest we take away your ir3 intrinsics :)
16:41gfxstrand: Just make the common load/store more capable.
16:43gfxstrand: The constant offset case is the one lots of hardware can do, where signed vs. unsigned matters less, and where it's easy enough / zero-cost to do `offset / (bit_size/8)` if you need to
16:44Venemo: gfxstrand: thx :)
16:45robclark: hmm, it is easy enough to check if nir src is const.. and if we _eventually_ want to support variable offset, does that mean we end up with both const and non-const offset?
16:46gfxstrand: robclark: No. The load/store_global_offset32 versions that we'd add wouldn't have a const_index offset.
16:46anholt_: gallo: a618 perf traces busted? https://gitlab.freedesktop.org/mesa/mesa/-/jobs/38738095
16:47robclark: hmm, ok
16:48gfxstrand: I wish we could practically just do load/store_global_offset32 and have the driver do nir_src_is_const() on it but there seem to be too many variables right now to figure out the right thing to do globally there.
16:48gfxstrand: signed vs. unsigned, offset units, etc.
16:49robclark: offset units I think you can punt on... a later stage pass would turn it into driver specific intrinsic and add the approprate PoT shift (and then maybe that shift gets optimized back out again)
16:50robclark: I'd look at it this way, it is a lot easier for backend to do something like that than it is to fish out 32b'ness from 64b math
16:50gfxstrand: Oh, no arguments there.
16:51gfxstrand: As long as you have something to optimize to which you can do before you lower 64b math to 32b, I think that's the important thing.
16:51robclark: there is still the signedness issue.. but at least offset units is easy to push to driver
16:53danvet: gfxstrand, different topic, ack on the syncobj deadline from robclark?
16:53danvet: it looked just like nits you had
16:53danvet: just want to make sure since uapi
16:56gfxstrand: danvet: Yeah, I'm gonna look at that again. Been on Khronos calls and thinking about NIR pointers all morning.
16:57danvet: gfxstrand, pls ping me when you're all happy so I can pull robclark's pr
16:57danvet: or ask for respin :-)
17:22hays: what could it mean if during the boot process, I get no hdmi output until wayland/x11 loads?
17:22hays: apologize if this is the wrong channel
17:34ccr: I have an ASUS motherboard that occasionally starts acting so that there's no HDMI output until Xorg starts, it apparently suddenly decides that it prefers VGA output (board has VGA, DVI and HDMI).
17:35ccr: 3
17:40hays: hmm that is an interesting thing to check. there are 2 hdmi outputs on this thing
17:43hays: kernel commandline lets you do video=HDMI-1:D video=HDMI-2:D but unclear semantics of that...
17:43hays: but seems worth trying
17:45ccr: you could also check if UEFI/BIOS has some setting for preferred connector
17:46ccr: (in my case it's a weird bug in UEFI or something)
18:11gfxstrand: robclark: RE deadline defaults: What I wouldn't give for Rust's Option<T> right about now....
18:12robclark: Option<T> is nice for other reasons.. but doesn't really solve the issue that driver doesn't have enough information from app ;-)
18:13robclark: vk extension wouild be a good thing.. but I think short term a 90% soln and driconf if needed are workable
18:22gfxstrand: robclark: No, but it solves the "how do I indicate that I don't care?" internal API question :)
18:23danvet: just a real type system would be nice
18:25gfxstrand: Well, yeah.
18:41jenatali: I'm curious if there's Vulkan drivers out there that are working around app bugs in descriptor indexing. Specifically out-of-bounds indexing
18:41jenatali: (Unless that's supposed to be well-defined and I just missed a bit of spec language)
18:43gfxstrand: With robustness enabled, it's supposed to be less buggy
18:50jenatali: gfxstrand: Which robustness?
18:51gfxstrand: For buffers, it's robustBufferAccess, I think. For others, I don't remember the details.
18:51jenatali: I'm not talking out-of-bounds within a buffer, I'm talking out-of-bounds within an array of buffers
18:52gfxstrand: I know
18:52gfxstrand: You're going to make me look this up, aren't you?
18:52jenatali: That's a strong way to phrase it :P
19:00jenatali: gfxstrand: https://vulkan.lunarg.com/doc/view/1.3.204.1/windows/gpu_validation.html - the validation described here would seem to declare it's invalid, and I don't see any caveats around robustness
19:03gfxstrand: jenatali: I'm not seeing it in the spec anywhere
19:03jenatali: Awesome. I'm still debugging DOOM Eternal and it looks like they do it
19:05gfxstrand: That's believable
19:06DavidHeidelberg[m]: jenatali: on which difficulty you do debug? :D
19:07jenatali: David Heidelberg: When debugging an app like that, there's only one difficulty
19:07jenatali: And it hurts
19:07DavidHeidelberg[m]: nightmare :P ?
19:08alyssa: jenatali: the natural descriptor indexing impl on some hw is going to be robust against that automatically so I could easily believe in app bugs
19:08alyssa: robclark: optimizing out the shift after it's added in is tricky
19:08alyssa: because `(a << b) >> b` is not equal to a in case of overflow
19:09gfxstrand: No, but ((a << b) >> b) << b is
19:09jenatali: alyssa: Interesting. I guess I'm confused as to what the behavior would be, if e.g. you index out-of-bounds of samplers
19:09alyssa: jenatali: on valhall, return all-0 texels, iirc
19:10alyssa: but I agree that's goofy
19:10jenatali: Huh, cool
19:10alyssa: I don't think Arm specifically designed that, more of an emergent behaviour
19:10alyssa: the texture instructions check for valid descriptors and return 0's if not present (ostensibly)
19:11alyssa: so I think it Just Works
19:11jenatali: Huh
19:11alyssa: but also I haven't tried and don't particularly care
19:11jenatali: D3D leaves it as undefined behavior if you're using the native indexing functionality, but the stuff I had to do to get VK to work make it so much worse
19:12alyssa: nod
19:13alyssa: we don't need another blog post about descriptor sets
19:13alyssa: i learned that at xdc
19:13alyssa: :p
19:39italove: gfxstrand: ack on adding a const index offset to load/store_global and handling non-const separately, but then wouldn't it make sense to just use `base` so that we can fit more cleanly into `nir_opt_offsets`? It seems other load/store ops already assume `base` is an offset. Or you think that's not a good enough reason?
19:58HdkR: alyssa: 32-bit thunking is a lot of work. Which is why we will have both. Also we still need to support all the....other 32-bit ioctl mess outside of DRM still
20:01alyssa: HdkR: fair enough
20:02alyssa: still seems like it ought to be the easier of the two but I don't know what I don't know
20:02HdkR: :P
20:02alyssa: gfxstrand: italove: Wait, are we touching core load/store_global? nak on that, not thrilled on the idea of auditing every backend
20:02alyssa: or just adding new offset versions?
20:02alyssa: IDk
20:03alyssa: TBH, we need backend specific address arithmetic opt for some backends regardless so I'm pretty meh about the whole thing
20:41gfxstrand: alyssa: The plan was to modify load/store_global but make it so non-zero offsets are optional.
20:42gfxstrand: alyssa: So the "audit" would just be running around to all the back-ends and adding `assert(nir_intrinsic_offset(intrin) == 0)` to their load/store_global handling.
20:42gfxstrand: Should be pretty mechanical and fool-proof.
21:29Kayden: has anyone had much luck running specviewperf2020?
21:30Kayden: something with the nodejs gui stuff seems to be going wrong here and it just never displays anything on the screen at all, either when trying to use the viewset downloader, or just running the benchmarks
21:34Kayden: IIRC strace just showed it sitting around blocked in a read call waiting on some socket forever
21:37alyssa: gfxstrand: meh, fair enough
21:37alyssa: is constant-only offsets really worth plumbing into NIR?
21:38alyssa: i don't care i just don't plan on using it for 3/4 of the ISAs that I work on
21:38alyssa: I'll use it for #4 i guess (Valhall)
21:39gfxstrand: alyssa: I guess it depends on how much nir_ssa_scalar crawling you want back-ends doing
21:39alyssa: shrug
21:40alyssa: I don't see this materially reducing the backend crawls if backends want competent address arithm
21:40alyssa: since most (all?) backends can do something more than just constant only which is the absolute lowest common denominator
21:41gfxstrand: Intel can't do anything. Nvidia is const-only
21:41alyssa: OK
21:41alyssa: so.. it sounds like NV is the only backend that actually is simplified by this.
21:41alyssa: since every other backend either can't do it at all or can do something more complicated
21:41alyssa: and in either case won't reduce backend crawling with a "common" crawl
21:42gfxstrand: The perfect is killing the good right now...
21:42alyssa: is it?
21:42alyssa: I'm just not sold on what the good is
21:43alyssa: We *already* have backend crawling for most of the backends that care for it, for that matter.
21:43gfxstrand: You really don't want back-end crawling if your backend wants lower_int64
21:43gfxstrand: crawling post-int64-lowering is a giant pain
21:43alyssa: yes, this is true.
21:43gfxstrand: So we need something that can get lowered at least a little earlier in the pipeline.
21:43alyssa: and the reason I haven't bothered on mali, but that's more because I keep procrastinating on wiring up the real int64 that does exist lol
21:44gfxstrand: If you haven't bothered on mali, then we don't "already" have it everywhere.
21:44alyssa: I mean
21:44alyssa: I don't plan to wire up the new thing on mali either ;-)
21:44alyssa: mostly because even doing constant-only is a huge pain on bifrost
21:44gfxstrand: That's fine. That's your choice.
21:45gfxstrand: It's optional. You can just not use it on bifrost.
21:45alyssa: For context, AGX uses lower_int64 but I call agx_nir_lower_address immediately before nir_lower_int64 iirc
21:45gfxstrand: I'm just saying there's a giant difference between "Thing X already exists everywhere, you don't need to bother" and "Think X *could* exist everywhere but I haven't bothered because it's a pain."
21:45alyssa: and agx_nir_lower_address lowers to load/store_agx intrinsics which map directly to the hw
21:45gfxstrand: The later case is exactly when we want common things.
21:45alyssa: Fair enough
21:46gfxstrand: Maybe the answer is everyone has their own load/store_hw intrinsics and a lowering pass.
21:46alyssa: gfxstrand: My opinion -- take it or leave it -- is add load/store_global_offset intrinsics (I think we already have the load) and add a NIR pass to do the crawl, it can live in src/compiler/nir/ if there's a second user of it
21:46gfxstrand: It seems like we can do better than that.
21:47gfxstrand: alyssa: Yeah, well that's the approach Venemo is NAKing.
21:47alyssa: NAK and maybe Mali can use that (and call the common crawl right before lower_int64)
21:47alyssa: ugh
21:47gfxstrand: And where I started
21:47alyssa: Venemo: what's your nak?
21:47gfxstrand: That's why the perfect hates the good here. :P
21:47alyssa: load_global_nak
21:47alyssa: then you can't nak it
21:47alyssa: or maybe you have to nak it
21:47alyssa: shit
21:47Venemo: I think load_global should be consistent with load_everything_else and have a const offset
21:48gfxstrand: Venemo: Except that's not everything else
21:48gfxstrand: That's everything else that doesn't take explicit addresses.
21:48Venemo: eg. load_shared
21:48gfxstrand: Wait, when did load_shared grow an extra offset?!?
21:49alyssa: uhh
21:49alyssa: did I miss that too?
21:49gfxstrand: Maybe it always had one? Could be an artifact of history, that one.
21:49alyssa: I did not know it had one?!
21:49alyssa: is this broken in my backends?
21:49Venemo: load_shared has a base offset
21:50Venemo: so it makes sense to add it to load_global as well
21:50Venemo: that said, there is no hard NAK from me, it's just that I don't see why we should keep the intrinsic without the offset when we add the new one with the offset
21:50gfxstrand: This goes back a long ways...
21:51gfxstrand: Like pre-2018
21:51Venemo: if you add a new one with a nee offset, who would actually want to use the old one?
21:51alyssa: me
21:51Venemo: y tho?
21:51alyssa: because I'm doing all my own crawling anyway and an extra offset means one more thing to worry about
21:52zmike: Kayden: link? I can try it tomorrow and let you know
21:52Venemo: you can simply emit an add, it you can elect not to call the pass that collects the const offset, alyssa
21:52alyssa: and the "well it'll just always be zero just trust us" is a recipe for brokenness in the future when some clever nir pass comes along in the future
21:52Venemo: or* you can elect
21:53alyssa: and also I need to get dinner and go to class and don't care about this bikeshed do whatever your want i'm out, peace!
21:53gfxstrand: So, RE load/store_shared. The offset appears to go all the way back but it hasn't been used by most backends since we switched to using nir_lower_explicit_io for shared.
21:54gfxstrand: The base+offset stuff is typically associated with internal load/store ops, not explicit.
21:55gfxstrand: It looks like Intel does support it though I doubt it ever sees a non-zero base these days.
21:55gfxstrand: Intel HW doesn't have it, to be clear. The back-end emits an add.
21:58Venemo: intel implemented it once I reminded them it's missing
21:58gfxstrand: Oh
21:58Venemo: it was last year when we started using it in the task shader lowering
21:59gfxstrand: ah
21:59Venemo: there was a bug due to them not supporting it
22:00Venemo: anyway, I am sorry that this discussion offended alyssa, but I think it is good to have consistency between these intrinsics
22:00Venemo: it is somewhat a mess though
22:00gfxstrand: I'm all for consistency, I'm just not sure which way consistency favors
22:00gfxstrand: From the Intel PoV, consistency favors ripping out BASE everywhere.
22:01Venemo: we added load_global_amd so we can have the offsets we need, so dunno
22:01gfxstrand: Well, that's the problem. :) The hardware is inconsistent.
22:02Venemo: if you say we should remove the offsets from the ones that have it and we all should use backend specific ones instead, that is also fine by me
22:02gfxstrand: I'm not necessarily saying we should do that.
22:03gfxstrand: I'm not sure what the best thing to do is.
22:03gfxstrand: I went into today with a pretty clear plan and it's been shot to hell. IDK where that leaves us.
22:03Venemo: I'm sorry
22:03gfxstrand: It's okay, it's how this process works.
22:04gfxstrand: It's good to uncover all the inconsistencies
22:04jenatali: Personally, I like having 2 intrinsics, one with a base that must be respected, and one that can never have a source. Having a nir_option for which one you prefer and having some common lowering pass that removes offsets when unwanted across all ops seems right to me
22:04jenatali: That way we don't have "assert(offset == 0)" in backends that suddenly blows up because some pass started using offsets
22:04gfxstrand: Well, we actually have three cases:
22:05gfxstrand: 1) Just an address
22:05gfxstrand: 2) Address + imm (where imm has some bit size and signedness)
22:05gfxstrand: 3) Address + offset
22:05jenatali: True
22:06Venemo: maybe we should have some kind of abstract NIR address type, which could be configured by the backend what it means exactly. and then we could all use the same intrinsic
22:06jenatali: That already exists, and is consumed by (e.g.) lower_explicit_io to select the right intrinsics for the backend
22:07gfxstrand: Intel wants 1, Nvidia wants 2, Etnaviv, Mali, and Freedreno, all want 2+2 with some hand-waved details about how big the immediates can be and the stride of the offset.
22:07jenatali: Derefs are already the abstract address type
22:07Venemo: I'm not talking about derefs
22:07gfxstrand: *want 2+3
22:07gfxstrand: Actually, I think Etnaviv just wants 3
22:08gfxstrand: Venemo: What you're suggesting is pretty nir_address_mode, just way more of them than we already have.
22:08gfxstrand: *pretty much
22:08Venemo: maybe the load could take an arbitrary number of sources, but the meaning of those could be configured in nir_compiler_options
22:09gfxstrand: That's pretty much nir_address_mode
22:09Venemo: I don't see how that helps here
22:10gfxstrand: It doesn't, not without way more address modes
22:10Venemo: it could even part of the intrinsic itself
22:11gfxstrand: But how would you describe what they mean?
22:11gfxstrand: If it's all arbitrary and there isn't a machine-understandable meaning, we're back to everyeone having custom back-end intrinsics.
22:12Venemo: struct { arbitrary_offset_sources : 8; scalar_offset_sources : 8; const_offset_sources: 8 } each field a bitmask that describes how to interpret each source
22:13Venemo: and you can have up to 8 sources in total
22:13gfxstrand: Writing correct generic code is going to be near impossible.
22:13gfxstrand: We can barely do it now and that's with a way simpler system.
22:14gfxstrand: That also doesn't take into account things like strides etc.
22:14Venemo: well, then I guess we'll just have to keep using the backend specific intrinsics we already have
22:14Venemo: I'm just throwing around random ideas, of course
22:15Venemo: not necessarily saying this is any good
22:15gfxstrand: How different are the rules on AMD for scalar vs. not?
22:16Venemo: what kind of rules do you mean?
22:16Venemo: scalar means it must be an SGPR
22:17Venemo: on the NIR level it means divergence analysis must be able to prove that it's uniform
22:17gfxstrand: Yeah, I know that
22:18gfxstrand: But do you have different forms for SGPR vs. VGPR. Like VGPR can only do addr+imm but SGPR can do addr+offset or something like that.
22:18gfxstrand: Or maybe just VGPR+SGPR
22:18gfxstrand: I'm wondering how complicated the rules are and if there's any hope of getting AMD to use common code.
22:18Venemo: most (not all) loads have all three kinds of offsets
22:19gfxstrand: all supporting 64-bit?
22:21pendingchaos: https://pastebin.com/raw/w220DB73
22:22Venemo: shared only has vgpr and imm, buffer loads have base sgpr pair, vgpr index, vgpr offset, sgpr offset and imm, scalar loads have a base sgpr pair, sgpr offset and imm, global is a bit weird I think it only has an sgpr pair base and a vgpr offset and imm
22:22Venemo: thanks pendingchaos that is much more accurate for global loads than what I said
22:23gfxstrand: Ok, that's enough for me to know that a common NIR thing isn't going to model everything you want for AMD>
22:24gfxstrand: So the question is if we can do something more useful than what we have today but that's also useful for other people and not any worse for anybody.
22:33Venemo: yea
22:33gfxstrand: But, IDK, everyone doing their own thing may not be totally horrile. :shrug:
22:40Venemo: gfxstrand: in that case though, there is no need for load_global_offset either. nobody is going to want to downgrade to this when we already have a backend specific one that more accurately represents the HW
22:43gfxstrand: Yeah, if everyone's doing their own thing, no bases, just addresses.
22:54Venemo: so, shall we close the MR?
22:54gfxstrand: Let me think some more before we do that.
22:54Venemo: sure thing
23:31Venemo: cmarcelo, mslusarz: when is it going to be time to remove NV_mesh_shader from ANV?
23:32Venemo: gfxstrand: are you going to want to have NV_mesh_shader in NVK? :O
23:35gfxstrand: Venemo: No
23:35Venemo: hooray :)
23:35Venemo: that means we can remove that monstrosity from mesa without looking back
23:36cmarcelo: Venemo: I _think_ should be fine by us, but mslusarz is more on top of mesh things, so let's wait him to chime in :)
23:37Venemo: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22139 <- radv
23:56Venemo: gfxstrand: I am looking to change the divergence analysis a bit, so it can distinguish between workgroup-uniform and wave-uniform. so I'd like to change nir_src_is_divergent and nir_dest_is_divergent to also take a scope, what do you think about that?
23:59cmarcelo: is freedreno CI having trouble with S3 storage? e.g. https://gitlab.freedesktop.org/mesa/mesa/-/jobs/38857903