01:20fdobridge: <airlied> @gfxstrand also for non-turing it would help to get set_mem_order_scope fixed
01:21fdobridge: <gfxstrand> ?
01:23fdobridge: <airlied> set_mem_order_scope in NAK has two asserts for sm80
01:30fdobridge: <gfxstrand> Right. Yeah. Need to look at that at some point. It would help if I had an Ampere card...
01:32fdobridge: <gfxstrand> I really should try and pick something up. GPU prices aren't too insane, I don't think.
01:33fdobridge: <gfxstrand> $400 USD for a 3080 isn't bad
01:44fdobridge: <airlied> I think that is the only thing I've really hit on ampere, but I haven't got a full run in a while due to gsp issues
02:11fdobridge: <mhenning> I have a PR for that https://gitlab.freedesktop.org/gfxstrand/mesa/-/merge_requests/44
02:11fdobridge: <mhenning> Although I also have a bunch of other issues on ampere that I haven't solved yet
02:13airlied: oh nice
10:25fdobridge: <karolherbst🐧🦀> uhh.. now that I'm digging into all the drivers, that genxml stuff is quite a mess 🙃
11:08fdobridge: <karolherbst🐧🦀> @gfxstrand btw.. I think I figured out what a fast-GS is 😄
11:08fdobridge: <karolherbst🐧🦀> it's _probably_ a GS with a single stream
14:30fdobridge: <gfxstrand> I think there may be other restrictions such as a known number of output primitives.
14:30fdobridge: <gfxstrand> Or maybe even that number is known to be 1
14:30fdobridge: <karolherbst🐧🦀> yeah...
14:31fdobridge: <karolherbst🐧🦀> but it does have implications in the ISA
14:31fdobridge: <karolherbst🐧🦀> e.g. `AST` doesn't have the third source as in normal GS
14:31fdobridge: <karolherbst🐧🦀> and `OUT` apparently is a `NOP` in fast-GS
14:31fdobridge: <gfxstrand> RIght. That would be consistent with there only being one primitive
14:32fdobridge: <karolherbst🐧🦀> yeah...
14:32fdobridge: <gfxstrand> Or maybe it's "no more than one". Using a GS right after tessellation to discard primitives is a pretty common case.
14:32fdobridge: <gfxstrand> But I'm just spit-balling here. I don't actually know.
14:32fdobridge: <karolherbst🐧🦀> ohh, that might also be it
14:33fdobridge: <karolherbst🐧🦀> I think they mostly wanted to deal with noop and passthrough GS this way
14:33fdobridge: <gfxstrand> Yeah
14:34fdobridge: <karolherbst🐧🦀> so the actual thing changing is, that in normal GS you are managing the GS state via that opaque value `OUT` produces and `AST` consumes
14:34fdobridge: <karolherbst🐧🦀> and that's gone
14:34fdobridge: <karolherbst🐧🦀> in fast-GS
14:34fdobridge: <gfxstrand> RIght
14:34fdobridge: <gfxstrand> Intel has something similar called URB handles.
14:35fdobridge: <gfxstrand> It's some sort of handle to the memory allocated in the I/O ring for the given primitive.
14:35fdobridge: <karolherbst🐧🦀> yeah...
14:35fdobridge: <karolherbst🐧🦀> sounds about right
14:36fdobridge: <karolherbst🐧🦀> what's also interesting is, that the hardware inserts an implicit `.CUT` whenever the stream_id changes
14:38fdobridge: <gfxstrand> What's `.CUT`?
14:39fdobridge: <karolherbst🐧🦀> ends the current output strip
14:40fdobridge: <karolherbst🐧🦀> also discards incomplete vertices
14:40fdobridge: <gfxstrand> Right
20:49fdobridge: <georgeouzou> I think fast gs is the feature from GL_NV_geometry_shader_passthrough
20:51fdobridge: <karolherbst🐧🦀> yeah
20:51fdobridge: <georgeouzou> It requires that the output primitive type is the same as the input type and no the vertex attribs are passed through from the precious stage to the fragment shader
20:51fdobridge: <georgeouzou> It requires that the output primitive type is the same as the input type and the vertex attribs are passed through from the precious stage to the fragment shader (edited)
20:51fdobridge: <karolherbst🐧🦀> would makes sene
20:51fdobridge: <georgeouzou> It requires that the output primitive type is the same as the input type and the vertex attribs are passed through from the previous stage to the fragment shader (edited)
20:51fdobridge: <karolherbst🐧🦀> *sense
22:35gfxstrand: dakr: Where did we land on the MAX_PUSH thing? I'm about to type pus splitting in NVK but I don't remember if we decided the kernel could split it or if I should.
22:42dakr: gfxstrand: Oh, I think EXEC still checks for NOUVEAU_GEM_MAX_PUSH.
22:42gfxstrand: That's fine. It's pretty trivial to split
22:43dakr: I think the kernel could split it, but it's a bit unclear what to do when we already submitted partial jobs and a subsequent submit fails.
22:43gfxstrand: That's fair.
22:43gfxstrand: It's easy enough to split in userspace.
22:44dakr: I just submitted https://firstname.lastname@example.org/T/#u btw.
22:45dakr: Once the scheduler patches and this is merged we should fill the ring more efficiently.
22:45dakr: Also we might need to replace the NOUVEAU_GEM_MAX_PUSH limit with an EXEC specific one.
22:46dakr: Currently NOUVEAU_GEM_MAX_PUSH is 512, but the actual amount of IBs the ring can take is 1023.
22:46gfxstrand: I think that's probably a good idea.
22:47dakr: In order to not let the ring run dry the limit should be RING_SIZE / 2.
22:47gfxstrand: I'm just going to use 512 for now
22:47gfxstrand: Or are you suggesting 511?
22:48dakr: Either that, or increasing the ring size, but I'd need to check if we have some limitations there.
22:48gfxstrand: So, we have two decisions to make here:
22:48gfxstrand: 1) What to make the hard UAPI limit. This is UAPI so if we ever lower it, that's a breaking change and we need to get that sorted ASAP.
22:49gfxstrand: 2) What should NVK do. That's kind-of a tuning thing. I can drop down to 250 or something just to ensure we can always fit 4 or whetever.
22:49airlied: we could add a patch to expose it as a getparam
22:49gfxstrand: RE 1), right now the UAPI limit as per the kernel I have is 512
22:49airlied: if we do it quick
22:50gfxstrand: That would also be an option if we're concerned about changing it in future.
22:50gfxstrand: It's kind-of annoying for NVK because I'm stack allocating right now. (-:
22:50gfxstrand: But I can have an NVK max and a kernel max.
22:50gfxstrand: Kind-of annoying but maybe safer to do that way?
22:50dakr: Might be better to add some flexibility. Since I don't really know how freely we're able to adjust the ring size.
22:50gfxstrand: Whatever we do, we should do it quick
22:51airlied: gfxstrand: btw for multiple queues and async compute I expect we may need some more kernel work
22:51gfxstrand: airlied: Uh... multiple queues are a problem? Why?
22:52airlied: at least async compute was on ben's list as something that needs work
22:52airlied: to setup channel groups properly
22:53gfxstrand: I would expect it works but maybe we don't actually get 3D and compute in parallel.
22:53airlied: it's an area I don't really have how it works, so we have to just figure it out from Ben's limited notes :-P
22:55fdobridge: <gfxstrand> Hey, look! I just deleted `nvk_device::mutex` 😁
22:59fdobridge: <gfxstrand> Ya'know... I bet my CTS runs would be faster if I implemented a pipeline cache... 🤔
23:00fdobridge: <gfxstrand> I compile those blit shaders A LOT
23:06dakr: gfxstrand: I will push a patch tomorrow regarding the IB limit.
23:07dakr: Good you brought this up again.
23:09gfxstrand: Thanks! Ping me here and I'll post an NVK change.
23:09gfxstrand: dakr: Do we have any hidden limits for binds?
23:14dakr: gfxstrand: Currently it's only limited by memory. But we update page tables from the CPU.
23:16dakr: Not entirely sure if we could run into some limit when we start using the DMA engine one day.
23:16dakr: But I don't think so.
23:17gfxstrand: It's mostly a matter of whether or not we want to impose one now or plan to handle that limit in the kernel later.
23:17gfxstrand: That one's a bit more annoying to deal with in userspace. Still possible, though.
23:18dakr: I really tend to think that if there is some limitation we can handle this in the kernel.
23:18gfxstrand: Okay. Works for me.
23:19gfxstrand:just deleted 1K LOC from NVK. :D
23:19gfxstrand: Or will once this CTS run is done.
23:20fdobridge: <gfxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25357