05:35DemiMarie: Is it possible to identify who is to blame for a GPU hang? In native contexts it would be useful to be able to determine which VM is at fault and ban it from using the GPU until the user says otherwise.
09:01tzimmermann: jani, thanks for reviewing my fbdev heeader cleanup. may i ask you to give an ack to the additional patch in v2: https://patchwork.freedesktop.org/patch/578025/?series=129765&rev=2
09:02jani: tzimmermann: ack; already looked at it but got distracted and forgot to reply
09:02tzimmermann: thanks
11:17samuelig: cmarcelo, go ahead
11:19samuelig: cmarcelo, would you like me or somebody else to review them?
12:31karolherbst: DemiMarie: it might be potentially be possible for some drivers and some faults
12:33karolherbst: mhh maybe "some drivers" is too optimistic, let's say "some GPUs"
14:08zmike: alyssa kusma: I assume at least one of you will be on call today for https://gitlab.freedesktop.org/mesa/mesa/-/issues/10550 ?
14:09zmike: / https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27521
14:14alyssa: zmike: wasn't planning on it but I can be if you want
14:15zmike: I figured since it was so contentious that at least someone would show up, but maybe I didn't skim hard enough to see through the argumentation
14:25alyssa: I just don't think we should be breaking piles of apps
15:13mareko: is the mesa-commit list discontinued?
15:15mareko: zmike: I think I can file cts tickets
15:15mareko: just haven't doen it
15:15daniels: mareko: yep
15:24mareko: daniels: somebody asked me about it, I didn't even know that it existed :)
15:27DemiMarie: karolherbst: which GPUs?
15:28karolherbst: unknown
15:29DemiMarie: Any guesses?
15:29karolherbst: nope
15:29DemiMarie:wishes GPUs had full preemption
15:29karolherbst: that's not the problem :)
15:29karolherbst: even if they had, how would the kernel know what crashed the GPU if the GPU gets suddenly into a weird state and returns nonsense?
15:30karolherbst: or the firmware returning nonsense
15:30karolherbst: or not responding
15:30DemiMarie: The GPU should never be able to be put in such a state unless the kernel driver is buggy.
15:30karolherbst: well..
15:31karolherbst: it's all software in the end
15:31karolherbst: and software has bugs
15:31karolherbst: GPU not getting into a weird state is like saying "computers shouldn't be able to get into a weird state"
15:32DemiMarie: karolherbst: suppose we ignore KMD and FW bugs for now
15:32karolherbst: still
15:33karolherbst: you ask for bugfree computers
15:33DemiMarie: No
15:33karolherbst: a GPU in itself can be considered a full computer.. maybe not a personal one, but definetly on the embedded level
15:33DemiMarie: Or my question is bad
15:35DemiMarie: So on the CPU, if there is a fault (such as accessing junk memory), the hardware gets control and tells the kernel enough information for the kernel to know what the fault was and what process did it.
15:35karolherbst: we all have to accept that an embedded system can and will crash in a way that you can only power cycle it to recover
15:35karolherbst: you can't compare GPUs to CPUs
15:35DemiMarie: Why?
15:35karolherbst: because GPUs are way more complex
15:35DemiMarie: Why?
15:36karolherbst: because GPUs are more like embedded devices
15:36DemiMarie: Wh
15:36DemiMarie: y?
15:36karolherbst: I got tired :) have fun
15:36sima: gpu is essentially a distributed network, there's a pile of things that send messages around, and eventually they reach a network node that talks to the memory subsystem
15:36sima: that's the point you get the hw fault
15:37DemiMarie: Am I asking questions that only the HW vendor can answer?
15:37sima: and the kernel pretty much has to be able to preempt, or things will go sideways due to priority/locking inversion issues
15:37sima: so unlike a cpu, where you can preempt a single node, for a gpu you have to preempt that entire distributed network
15:37sima: including all the in-flight messages
15:37sima: DemiMarie, ^^ and safe/restore of what essentially is a cluster is just too hard
15:37mattst88: I think since GPUs are expected to be programmed by drivers, their designers are able to not ensure that they're impossible to wedge (unlike a CPU, where that would be basically unforgivable)
15:37DemiMarie: sima: I thought CPUs are also distributed internally. Do they just put more work into hiding it?
15:38sima: DemiMarie, they're a lot less distributed, and they have enormous amounts of magic to hide their distributed nature from the application
15:38sima: because the ISA isn't distributed
15:38sima: so you can reuse that for preempting the entire thing
15:38sima: with gpu you actually send around these messages explicitly in shaders
15:38sima: (well most of them at least)
15:43DemiMarie: mattst88: hopefully it will be less forgivable over time, given the rising use of GPUs in secure situations.
15:45DemiMarie: sima: Ah, so that is why one needs memory barriers in so many situations where a CPU would never need them.
15:45sima: yeah that's one aspect
15:46DemiMarie: How does preemption work for compute then?
15:46sima: but the overall one that makes preempt/page fault so hard really is that it's a network of nodes having a big chat, and memory i/o is just one of them - kinda like you have storage servers in a cluster
15:46cmarcelo: samuelig: thanks. just this ACK here is good for me
15:46sima: DemiMarie, badly :-)
15:47sima: no idea how it works on others, but on intel, where it does work essentially preempt sends an interrupt to all the compute cores
15:47sima: and they run a special shader which essentially stores the entire thread state into memory somewhere
15:47sima: but that means this special preempt shader and your compute kernel need to cooperate
15:48sima: and preemption is kinda voluntary
15:48sima: and it only works for nodes which support it, and because things get messy for 3d fixed function they just dont
15:48DemiMarie: sima: why do they need to cooperate, beyond the need for a memory buffer to save the state?
15:49sima: the hw cannot actually save/restore itself, it's kinda like the kernel asking userspace to please store all it's state, so that it can nuke the process
15:49sima: and then on restart it asks userspace again to recreate everything
15:50sima: it's a giantic mess afaiui
15:50DemiMarie: Could the KMD provide the preempt shader?
15:50sima: afaiui it's tied to how you compile the compute shader
15:50DemiMarie: Or just say, “you have X amount of time before I blow you away?”
15:50sima: like if you yolo too much in-flight stuff you can't restore that on the other side
15:50sima: yeah it's a timeout
15:51sima: but the timeout is huge because register state is like a few mb
15:51DemiMarie: What is the timeout?
15:51sima: more than a frame iirc
15:51DemiMarie: Ugh
15:51sima: so forget anything remotely realtime
15:51DemiMarie: Hopefully future GPUs will support full preemption of everything.
15:51sima: it's getting less
15:52DemiMarie: That’s good
15:52sima: like both intel and amd are switching to the model where any pending page fault prevents preemption
15:52DemiMarie: What do you mean?
15:52sima: because the thing that hits the page fault is a few network hops away from the compute cores that can save/restore
15:52sima: so you cannot preempt while any fault is pending
15:52DemiMarie: So that just means no page faults allowed.
15:52DemiMarie: Pin all memory.
15:52sima: nah it just means kmd engineers are crying a lot
15:53DemiMarie: Why?
15:53DemiMarie: Because pre-pinned memory management is terrible?
15:53sima: yeah
15:53sima: so you get to pick essentially between "you get page faults, no pinning yay" and "you get preempt, no gpu hogging, yay"
15:54sima: but not at the same time
15:54sima: which is a full on inversion of the linux kernel virtual memory handling
15:54DemiMarie: For Qubes OS we will pick the latter every time
15:54DemiMarie: So make guests pin all their buffers.
15:55sima: so you don't necessarily need to pin everything, you /just/ need to guarantee there's enough memory around to fix up any faults while you try to preempt
15:55DemiMarie: Lots of room for bugs 🙂.
15:55DemiMarie: In the Qubes case, the memory will be granted by a Xen guest, so it is already pinned.
15:55sima: yeah it's one of these "great for tech demo, so much pain to ship" things
15:56DemiMarie: sima: a sysctl or module opt to just force pinning would be nice
15:56sima: plan for linux is that you get cgroups to make sure there's not memory thrashing that hurts
15:56DemiMarie: That way those who do not need page faults can avoid the attack surface.
15:56sima: but we had that for like decade plus as a plan by now :-/
15:57DemiMarie: sima: why is pre-pinned memory management so bad?
15:57sima: people love memory overcommit
15:58sima: with pinning everything you need enough memory for everything for its worst case
15:58DemiMarie: Make that userspace’s job
15:58DemiMarie: It can pin and unpin buffers at will
15:58sima: which is a lot more than just enough for the average case and let the kernel balance usage
15:59sima: but in the end, if you want real-time then "pin it all" really is the only option
15:59sima: irrespective of gpu or not
16:00DemiMarie: Hence why PipeWire has an option for mlockall()
16:01DemiMarie: sima: is the kernel command-line option to disable overcommit a reasonable idea?
16:01DemiMarie: Or could it be emulated in the native context code?
16:06DemiMarie: This also explains Apple’s decision to make AR fully declarative: it means they can guarantee real-time behavior, because there are no app-provided shaders in the code that updates what the user sees in response to the real-world changing.
16:10DemiMarie: sima: thanks for taking some time to explain all of this!!! It means a lot to me.
17:21abhinav__: vsyrjala jani gentle reminder on https://patchwork.kernel.org/project/dri-devel/patch/20240212173355.1857757-1-quic_abhinavk@quicinc.com/ pls
17:23DemiMarie: So for my use-case, I care less about ensuring that the GPU doesn’t crash (so long as the crash is non-exploitable) as I do about ensuring that there is someone to blame.
17:24DemiMarie: I want the userspace VMM to be able to distinguish between `GUILTY_CONTEXT` and `INNOCENT_CONTEXT`>
17:38alyssa: karolherbst: dj-death: forgot about the "designated initializers cause spilling because nir_opt_memcpy chokes" issue
17:38alyssa: gfxstrand: karolherbst and I talked about it back in October or so, and then I think we forgot, or at least I did
17:38karolherbst: same
17:39karolherbst: but did faiths MR helped with that? I think there was more to do, like... something with opt_memcpy or something
17:40alyssa: I don't think it did but I might not have testeed
17:40karolherbst: it also kinda depends when you run it
17:41karolherbst: ohh right.. there was `copy_deref` doing something similar
17:41karolherbst: and memcpy lowering could translate memcpy_deref to copy_deref or something
17:42karolherbst: in any case...
17:42jenatali: Right, you have to lower vars to explicit types, and then opt_memcpy should be able it to turn it into a copy
17:42karolherbst: if the copy between derefs is huge, you can end up with tons of live values
17:42karolherbst: mhh yeah.. I might have to check again as I did reorder things again
17:43alyssa: iirc there were a bunch of related memcpy issues I hit
17:43jenatali: (But then you want to erase scratch size and lower vars to explicit types again to recompute how much scratch is actually needed after that optimization...
17:43jenatali: )
17:43karolherbst: but yeah.. we need to lower those copies to loops, not unroll them directly
17:47DemiMarie: karolherbst: sorry for tiring you with the repeated “why”.
17:48karolherbst: don't worry, I was also literally tired anyway and had to finish other things
18:20DemiMarie: It now makes more sense why graphics is not preemptable: It must be so fast (to avoid user complaints) that it is faster to simply wipe out the state and force applications to recompute it.
18:21DemiMarie: Reset can be done by a broadcast signal in hardware that forces everything to a known state, irrespective of what state it had been in. This is much simpler and cheaper (both in time and in transistors) than trying to save that state for later restoration.
18:23zmike: mareko pepp: it looks like radeonsi doesn't do any kind of checking with sample counts for e.g., rgb9e5? so si_is_format_supported will return true for sample_count==8
18:24zmike: is this somehow intended? I'm skeptical that you really support this
18:25mareko: zmike: it's always supported by shader images, it's supported by render targets only if the format is supported
18:26zmike: you support multisampled rgb9e5 in shader images?
18:26mareko: yes
18:26zmike: huh
18:26zmike: and textures?
18:26mareko: texelFetch only
18:27zmike: hm
18:27mareko: it's the same for all MSAA
18:28mareko: it's just 32bpp memory layout with a 32bpp format, then the texture and render hw just needs to handle the format conversion to/from FP32 and FP16
18:29sima: DemiMarie, afaik that's also what the big compute cluster people do, with checkpoints to permanent storage thrown in often enough that you don't have to throw away too much computation time when something, anyhting really, goes wrong
18:29sima: so also reset and recover from known-good application state, including anything gpus do
18:32DemiMarie: sima: that indeed makes sense. IIUC it is expected to have to change at some point, because failures will be too frequent, but I’m not sure if that point has been reached yet.
18:32DemiMarie: sima: interestingly AGX can do resets so quickly that one can reset every frame and still have a usable desktop.
18:33DemiMarie: That’s what Asahi did before Lina figured out TLB flushing.
18:33sima: yeah reset tends to be fairly quick, the slow part is waiting long enough to not piss of users too much about the fallout
18:34DemiMarie: Waiting for what?
18:34DemiMarie: Also, is there any information available as to what caused the reset?
18:35DemiMarie: I want to throw up a dialog to the user saying, “VM X crashed the GPU.”
18:35sima: arb_robustness tells you why you died (i.e. guilty or innocent collateral damage)
18:35sima: but it's very driver specific, and you need to be the userspace that created the gpu ctx
18:36sima: plus I think aside from amdgpu and intel no one even implements that, you just get a "you died"
18:36DemiMarie: Does Vulkan have something similar, and is this information something that the native context implementation could collect?
18:36sima: if that
18:37DemiMarie: sima: thankfully AMDGPU and Intel are the ones I care about the most by far
18:38DemiMarie: sima: is this a hardware or driver limitation?
18:41sima: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_device_fault.html only thing I've found with a bit of googling, but that doesn't give you info (at least at a glance) about issues caused by someone elese
18:41sima: DemiMarie, driver limitation generally, it's a lot of tricky corner cases
18:42sima: also if you're asking about why there's innocent ctx getting nuked, that's usually a mix of hw and driver limitations
18:42dj-death: jenatali: have you been doing much with LLVM 17?
18:42jenatali: dj-death: Nothing at all
18:43dj-death: jenatali: just hitting some interesting casts from the translation
18:43dj-death: jenatali: whereas the LLVM 16 was doing more deref_struct
18:43jenatali: At this point my goal is to stay on LLVM 15 as long as possible and let you all shake out all the problems with newer LLVM versions :)
18:43dj-death: jenatali: that's preventing a bunch of optimizations
18:43dj-death: ahaha :)
18:43dj-death: nice
18:44dj-death: I guess we could put an upper bound on the LLVM version
18:44jenatali: One of the nice benefits of being on Windows and being forced to statically link LLVM is I get to control when we update
18:44jenatali: And it takes hours to build so I'm not inclined to do it often, especially if it risks issues like this
18:44dj-death: yeah
18:44dj-death: I love linux distros
18:45Calandracas: packaging llvm is pain
18:46alyssa: jenatali: :clown:
18:46Calandracas: espacially when some things like zig==15, while chromium>=17 and rust>=17
18:48Calandracas: I wish there was a cannonical way to support parrallel installations
19:14karolherbst: I mean.. it kinda works, you are just (sometimes) screwed if multiple versions end up in the same program
19:16zmike: jenatali: good news, I think I'm accidentally fixing all those xfails I added a couple weeks ago
19:17jenatali: \o/
19:17jenatali: I'll be happy to review once you've got something for me to look at
19:19zmike: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27621
19:19zmike: I tried adding better samplecount checking to zink and ended up failing the same tests as you
19:20jenatali: Hah, awesome
19:20zmike: the classic https://www.youtube.com/watch?v=AbSehcT19u0
19:21karolherbst: opaque pointers were a mistake, don't @ me
19:22HdkR: void* is as opaque as we need to be
19:43airlied: dj-death, karolherbst : I assume the changes in derefs is just opaque ptrs getting us different patterns?
19:43karolherbst: yes
19:43karolherbst: more or less
19:43karolherbst: and LLVM doing unholy optimizations due to that
19:44karolherbst: I really hope the SPIR-V backend doesn't give us the same issues...
19:44karolherbst: well.. with llvm-19 that is
19:45airlied: well it's just different patterns, apps could give us them we just have to keep up
19:45karolherbst: nah, in this case it's something LLVM does
19:45karolherbst: for apps that would be super unholy to do as well
19:46karolherbst: like.. if your first member of a struct is an int, nobody does this: "(int *)some_struct" instead of just "&some_struct->first_field"
19:46karolherbst: but that's what LLVM-17 is now giving us
19:46karolherbst: so yeah. in _theory_ apps could, but none actually would
19:47airlied: I admire your confidence
19:47karolherbst: I know I'm wrong, but I still want to believe
19:48karolherbst: anyway.. I think I fixed it with rusticl and maybe the same fix helps intel
19:48alyssa: i feel like i've done that in user code
19:48karolherbst: I _think_ I wrote this to fix it in rusticl: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27068/diffs?commit_id=f8966598940ad46fb1ff2cbd9c23013289ef0736
19:48karolherbst: but not 100%
19:48karolherbst: it was to fix some llvm-17 issue though
19:56DemiMarie: Calandracas: statically link LLVM in all of its callers and ensure its symbols are never exported?
20:02Calandracas: that isn't really ideal. What we ended up doing is splitings all of the shared libs into their own packages for each version (libclang15 package provides libclang.so.15, libclang17 package provides libclang.so.17) etc.
20:02karolherbst: Calandracas: have you tried to see if it also works if all of them get loaded into the same process?
20:03karolherbst: because apparently that doesn't work in all distributions
20:03karolherbst: and it's actually a cursed use case
20:03karolherbst: but a real one
20:04Calandracas: The ugly part is needing to have a full toolchain for each llvm version. What alpine does is install each version in /usr/lib/llvm$VERSION/
20:04karolherbst: oh yeah, that as well
20:04Calandracas: Fedora installs the latest release to /usr but "compat" packages go to /usr/lib/llvm$VERSION
20:04karolherbst: and making sure that things like "CLANG_RESOURCE_DIR" stay consistent with the installation directory relative to the so file
20:05Calandracas: it then turns into a mess of symlinking things from /usr/lib/llvm$VERSION to /usr
20:05karolherbst: though I only got reports of fedora messing that up and only for some packages? kinda works if you compile mesa from git
20:05karolherbst: yeah....
20:05tnt: karolherbst: and then there are some applications doing stuff with RTLD_DEEPBIND that don't help either.
20:06karolherbst: Calandracas: in mesa we need to know where the resource path of clang is.. I think we finally have something that works with that MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25568
20:07karolherbst: the "realpath" part was needed for debian having weird symlinks :D
20:07Calandracas: for context, I'm talking about void linux, which just merged llvm17 last week
20:07airlied: in theory if you build all the llvm version with proper sym versioning it should work, in practice it's all screwed
20:07karolherbst: yeah...
20:08karolherbst: though at least on fedora I had three CL impls using a different llvm version (15/16/17) and it was fine....
20:08karolherbst: which surprised me tbh
20:09Calandracas: well its fine when llvm is only used a library, and applications can link whatever soname they want
20:09Calandracas: but having conflicting versions of llvm-config, cmake files, etc. is what causes issues
20:10karolherbst: yeah.... that's also painful
20:11Calandracas: + musl, armv6l, and armv7l patches
20:11Calandracas: but thats a different issue altogether
20:15dj-death: airlied: yes
20:17dj-death: karolherbst: I guess I could try to reproduce the order in which you run NIR passes
20:17dj-death: karolherbst: hopefully that works
20:17dj-death: I have doubts
20:18karolherbst: the idea behind my changes was that I need to run nir_lower_memcpy after explicit_types and some deref opt loop
20:19karolherbst: and the deref opt loop does more opts if explicit type information exists
20:19karolherbst: s/loop//
20:52dj-death: karolherbst: I kind of end up with the same code if I pass the arguments by value instead of by pointer :)
20:52dj-death: stuff spills everywhere
21:13DemiMarie: airlied: considering that proprietary games might not use symbol versioning I am not surprised that there are problems there.
21:14jenatali: It's not really about symbol versioning, and if it was, it'd be about LLVM using symbol versioning. The problem is the lack of symbol namespacing
21:14karolherbst: dj-death: yeah.. that's LLVM being LLVM
21:15jenatali: You'd want component A in a process to use LLVM A, and component B to use LLVM B, but since the symbols are global, you can get the components using the wrong version
21:17karolherbst: it's kinda sad that LLVM IR stopped being useful for a middle-end IR :'(
21:17DemiMarie: jenatali: I thought symbol versioning solved that by including the versions (which are different) in the symbol that the dynamic linker looked up.
21:17karolherbst: now it's just a backend one
21:17jenatali: Demi: "Solved"
21:17jenatali: Still requires LLVM to put versions on their symbols
21:17DemiMarie: and they don’t?
21:17jenatali: Not AFAIK
21:17DemiMarie: could this be forced during compilation?
21:18DemiMarie: but yeah, until that is dealt with static linking seems like the safest option
21:19jenatali: 🤷♂️ My knowledge in this space is really just based on how things are different from Windows
21:19dj-death: karolherbst: so I suppose if I try to compile the same code with rusticl, I'll end up with spills every where
21:19DemiMarie: and any distro that doesn’t like it can deal with the bug reports
21:20vsyrjala: llvm even leaks its compile time options into the abi. i once tried to turn off the nvptx (or whatever) thing in my system llvm. everything linked against llvm stoped working because of some missing global symbol
21:21karolherbst: dj-death: maybe? Wouldn't surprise me, but it also kinda depends how the codegen goes
21:21karolherbst: anyway.. do you have a dump of the code being compiled?
21:21karolherbst: like.. the thing passed into llvm
21:22karolherbst: In hinsight I should have added `dump_clc` to `CLC_DEBUG`, but uhh.. that's kinda not useful in the CL context
21:22dj-death: karolherbst: yeah : https://pastebin.com/5tdiJ7i5
21:22karolherbst: yeah..
21:22karolherbst: "__attribute__((unused)) const struct GFX125_3DSTATE_VERTEX_BUFFERS * restrict values" that's your generic pointer
21:22karolherbst: if you pass in private memory, specify it as "private const struct ...*"
21:22dj-death: I know
21:23karolherbst: though that just gets rid of the generic modes in the cast
21:23karolherbst: not the cast itself
21:23karolherbst: but should make it easier to fix the compiler pass order
21:26dj-death: yeah
21:26dj-death: will see tomorrow, thanks
21:26karolherbst: mhh.. we have this CL runner in piglit.. let's see if I can make it do something
21:27dj-death: yeah private didn't help indeed :)
21:28karolherbst: yo.. how do I make this code run.. this will be header file hell :D
21:32karolherbst: mhh it didn't crash at least
21:32karolherbst: ehh wait.. that's still on my box with llvm-16 I think? mhh
21:33karolherbst: rusticl on llvm-16 compiles that to this, which looks very sane: https://gist.githubusercontent.com/karolherbst/fc5bcbfdcc6d1a633d1e63046a3a3fb0/raw/6188dab8d21ed059a563c01c64dc0c5d5bdd9cf2/gistfile1.txt
21:33karolherbst: finishing compiling llvm-17 and testing it there shortly
21:34karolherbst: but turning it into a kernel can also lead to a lot of things happening or not happening...
21:48airlied: jenatali: pretty sure there is an llvm option to turn on symbol versions across the api
21:48jenatali: Oh cool
22:13dj-death: karolherbst: any luck with llvm-17?
22:13karolherbst: recompiling mesa atm
22:14dj-death: I can install both versions on debian
22:14karolherbst: dj-death: mhh.. with llvm-17 I indeed run into scratch being used...
22:14dj-death: it's not too bad
22:14karolherbst: but at least I have a fairly trivial reproducer now
22:14dj-death: karolherbst: ah so it's more complicated than just ordering :(
22:15karolherbst: I'll try to take a look and see if I can fix it for rusticl, because that's a real issue regardless...
22:15karolherbst: I wonder....
22:15dj-death: yeah, well let me know :)
22:16karolherbst: let me diff it...
22:19karolherbst: yeah....
22:19karolherbst: it's `nir_opt_deref` not kicking in
22:19karolherbst: well..
22:19karolherbst: somewhat
22:20karolherbst: mhhh.. interesting
22:36dj-death: there is an additional
22:36dj-death: 64 %160 = deref_cast (struct.GFX125_3DSTATE_VERTEX_BUFFERS *)%6 (constant struct.GFX125_3DSTATE_VERTEX_BUFFERS) (ptr_stride=20, align_mul=0, align_offset=0)
22:36karolherbst: yeah...
22:36karolherbst: so the thing is, that vars_to_ssa is smart and matches the struct field access back to the original source
22:36karolherbst: but that doesn't happen with the deref_struct thing missing
22:37karolherbst: https://gist.github.com/karolherbst/67bc4a4e2bd62714b92899c2ae4e8cd7
22:38karolherbst: this "32 %26 = mov %21" makes it all optimized away in the llvm-16 case
22:38karolherbst: but I don't know yet what I think would be a good solution to this issue
22:39karolherbst: this casting to the struct base instead of creating a deref of the first field is really a nonsense thing llvm is doing 🙃I don't understand why
22:41karolherbst: I think we can detect this pattern with explicit type information
22:41jenatali: Seems like we could detect that in nir_opt_deref?
22:41karolherbst: and just... workaround it in a dirty way
22:41karolherbst: yeah
22:41karolherbst: if you cast a struct to the type of it's first member... just do a deref_struct on the first field or something
22:41jenatali: Right
22:41jenatali: As long as the cast doesn't add alignment/stride info
22:41karolherbst: yeah...
22:42karolherbst: but that sounds like the most pragmatic solution here
22:42jenatali: It should probably be recursive too...
22:42karolherbst: I wonder what would be the _in a perfect world with perfect code_ solution tho
22:42karolherbst: mhhhh
22:42karolherbst: let's see...
22:42jenatali: struct outer { struct middle { struct inner { int a; }; }; };
22:42karolherbst: should be easy to try
22:42jenatali: Should be able to cast outer* to int*
22:42karolherbst: but yeah...
22:43karolherbst: we can follow inner structs until we hit a non struct/array thing
22:43karolherbst: I suspect LLVM might do the same on arrays 🙃 why stop at structs?
22:43jenatali: Right, array element 0
22:44jenatali: Seems like a worthy optimization even if it was an app that was doing it explicitly TBH
22:44karolherbst: that's the beauty of LLVM: everything is possible
22:44dj-death: I'm checking why this doesn't get removed by opt_replace_struct_wrapper_cast :
22:44dj-death: 64 %20 = deref_cast (struct.GFX125_3DSTATE_VERTEX_BUFFERS *)%12 (function_temp struct.GFX125_3DSTATE_VERTEX_BUFFERS) (ptr_stride=0, align_mul=0, align_offset=0)
22:44karolherbst: the stride probably?
22:45dj-death: nope :
22:45dj-death: glsl_get_struct_field_offset(parent->type, 0) != 0
22:45dj-death: WTF
22:45karolherbst: ahh yeah
22:45karolherbst: do you have explicit types at that point?
22:45karolherbst: ohh also...
22:45karolherbst: OHHHH
22:45karolherbst: I remember
22:45karolherbst: I was debugging that function
22:46karolherbst: but I forgot on why it didn't work
22:46karolherbst: lemme debug this :D
22:51karolherbst: dj-death: okay.. so
22:51karolherbst: in rusticl I pass this check
22:51karolherbst: but it fails later
22:51karolherbst: `if (cast->cast.ptr_stride != glsl_get_explicit_stride(field_type))`
22:51karolherbst: ptr_stride of the cast is 4
22:52karolherbst: field_type has a stride of 0 :')
22:54karolherbst: dj-death: https://gist.github.com/karolherbst/a271bf08f8d3da54dc587e1a9e14e676
22:55jenatali: karolherbst: No, that's not the right check
22:55karolherbst: my change? I know.. I just forced to make it work
22:56jenatali: If explicit stride is 0 you'd want to compute an implicit stride
22:56karolherbst: I see...
22:56karolherbst: in any case, the struct fields have no explicit_stride and therefore we don't do this opt
22:56dj-death: karolherbst: yeah that works here too
22:57karolherbst: requires explicit_types before opt_deref tho :)
22:57jenatali: Right, a uint in a struct has an implicit stride of 4, but it gets cast to a pointer with an explicit stride of... 4
22:57karolherbst: but yeah
22:57dj-death: yeah I can work with that
22:57dj-death: especially for temp variables
22:57karolherbst: so yeah.. I think my changes were motivated by this.. but it didn't fix the issue because I forgot to deal with the that opt not kicking in
22:59karolherbst: yeah... I just took the opportunity and reworked my pipeline so that I only have to call explicit types exactly once for each var type, so I won't have to reset the scratch_size at all :)
22:59jenatali: karolherbst: I don't see how you can do that
22:59karolherbst: well.. I did
22:59karolherbst: why shouldn't it be possible?
22:59jenatali: If you need explicit types before opt_deref/opt_memcpy, but then after those optimizations temp variables disappear
23:00jenatali: If you don't reset scratch size, you'll overallocate, since you still reserve space for variables that got deleted
23:00karolherbst: so here is the thing
23:00karolherbst: I call run those opts once with and once without explicit types
23:00karolherbst: or multiple times even
23:00jenatali: Sure
23:00karolherbst: so in the end it all works out
23:00jenatali: But when you lower to explicit types it sets the scratch size
23:00karolherbst: yeah
23:00karolherbst: but I do that kinda late
23:00jenatali: And those opts won't actually result in variables getting deleted until after lowering to explicit types
23:01jenatali: So you still overallocate. If you reset the scratch size and rerun the explicit types pass, you'll get a (much) smaller value
23:01karolherbst: mhh? I don't think I actually overallocate scratch size, because it is able to figure out to dce some/most of the vars? but maybe I actually did miss something here
23:01jenatali: Which is a design flaw in using that pass to set scratch size, FWIW
23:01karolherbst: but I didn't encounter anything strange
23:02jenatali: What do you do based on the scratch size set in the shader info?
23:02karolherbst: nothing?
23:03jenatali: I ended up failing to compile shaders because the validator that we run downstream on the DXIL sees that we request an alloca, but then never use it, because all of the temp variables go away between setting the scratch size and emitting code
23:03karolherbst: mhhh
23:03karolherbst: maybe we should add a nir validation for this?
23:04karolherbst: scratch size set, but not scratch ops found?
23:04karolherbst: *no
23:04dj-death: yeah
23:04jenatali: It'd need to be explicitly run. Scratch size is set after lower_to_explicit_types, but you need lower_explicit_io afterwards to actually create the scratch ops
23:04karolherbst: I mean.. or a deref on temp memory
23:05dj-death: would need to be moved to io?
23:05jenatali: Then vars_to_ssa or copy_prop would start failing validation
23:05karolherbst: we could check for scratch ops or derefs on temporaries
23:05karolherbst: shouldn't be too hard
23:05jenatali: Nothing decrements the scratch size once it's set
23:05jenatali: Because they just leave holes instead. The scratch offsets are baked into the variables after lowering to explicit types
23:05karolherbst: then those passes nuking all those ops, should set scratch size to 0
23:06karolherbst: that's kinda the point here in validating it, no?
23:06jenatali: But they don't necessarily nuke everything
23:06jenatali: And scratch that looks like { empty space, var } is still bad
23:06karolherbst: fair
23:06karolherbst: or we just add a "nir_shrink_memory" pass
23:07karolherbst: and use it on shared as well
23:07jenatali: Aka "set scratch size to 0 and re-run lower_to_explicit_types" :)
23:07karolherbst: mhhh
23:07karolherbst: sounds like pain :D
23:07jenatali: And yeah, same problem with shared, but I don't know if as much stuff can remove shared variables
23:07karolherbst: _but_...
23:07karolherbst: maybe I should validate on it in debug builds
23:07karolherbst: do explicit types again and see if it changes the sizes
23:07jenatali: Hint: It will ;)
23:08karolherbst: we'll see about that 🙃 though I think what I am doing now gets me pretty close at least
23:08jenatali: Look at the link you pasted above
23:08jenatali: scratch: 20
23:08jenatali: I don't see any load_scratch or store_scratch
23:08karolherbst: yo.. pain
23:09jenatali: Yeah
23:09karolherbst: *sigh*... guess I'll just add it back then
23:10karolherbst: so anyway...
23:10dj-death: karolherbst: still a bit confused by your current fix
23:11karolherbst: fixing opt_replace_struct_wrapper_cast comes first
23:11karolherbst: don't think too much about it
23:11dj-death: karolherbst: why uint has no explicit_stride ?
23:11karolherbst: I just hacked it
23:11jenatali: dj-death: Explicit stride is only a thing on arrays and matrices in nir
23:11jenatali: And deref_cast I guess
23:11jenatali: Since those can be treated as arrays using deref_ptr_as_array
23:11dj-death: oh
23:12dj-death: so you have to trust the deref here
23:12karolherbst: maybe we should just set explicit_stride on struct members? dunno
23:12dj-death: yeah
23:12karolherbst: though that gets funky with packed structs
23:12dj-death: I mean they have a location
23:12jenatali: I'd want to hear from gfxstrand for this particular issue
23:18karolherbst: can't we just sneak a fix past her this time? 🙃
23:26dj-death: karolherbst: not fixing all the issues though :)
23:27dj-death: struct delta64 { uint64_t v0; uint64_t v1;
23:27dj-death: } data = *((global struct delta64 *)&query[qw_offset]);
23:27dj-death: this local variable ends up to scratch
23:28jenatali: dj-death: How are you expecting that to not end up in scratch?
23:28jenatali: Range analysis on qw_offset to realize it can only be 0 or 1 and turn it into a bcsel or something?
23:31dj-death: jenatali: https://pastebin.com/rL8rfNUw
23:31dj-death: jenatali: a simplified example also scratching
23:32jenatali: dj-death: Ohh I see
23:32jenatali: I misread it at first :)
23:33karolherbst: mhhh
23:34karolherbst: my hope is that with the spirv backend we can just turn on -O2 and llvm gives us less silly code :D
23:35airlied: I admire your confidence (v2)
23:35karolherbst: I'm sure it will work out in the end
23:36jenatali: As long as the end isn't like 10 years from now
23:37dj-death: the problem here is that it's casting the first field which is uint64_t
23:37dj-death: into a uvec4
23:37dj-death: interesting
23:39dj-death: I guess we could add more special casing
23:42dj-death: if you're casting a struct's first field to a type that covers the entire struct
23:42dj-death: you actually want the initial deref to the struct
23:43jenatali: Yeah
23:43dj-death: 64 %9 = deref_var &data (function_temp struct.delta64)
23:43dj-death: 64 %38 = deref_struct &%9->field0 (function_temp uint64_t) // &data.field0
23:43dj-death: 64 %39 = deref_cast (uvec4 *)%38 (function_temp uvec4) (ptr_stride=16, align_mul=0, align_offset=0)
23:43dj-death: can just replace %39 with %9
23:43jenatali: Weird...
23:45jenatali: Yeah maybe not even having to qualify "that covers the entire struct," the only thing you need to make sure is that you're not trying to go the other way and cast to the first member of an inner struct