01:20 fdobridge: <a​irlied> @gfxstrand also for non-turing it would help to get set_mem_order_scope fixed
01:21 fdobridge: <g​fxstrand> ?
01:23 fdobridge: <a​irlied> set_mem_order_scope in NAK has two asserts for sm80
01:30 fdobridge: <g​fxstrand> Right. Yeah. Need to look at that at some point. It would help if I had an Ampere card...
01:32 fdobridge: <g​fxstrand> I really should try and pick something up. GPU prices aren't too insane, I don't think.
01:33 fdobridge: <g​fxstrand> $400 USD for a 3080 isn't bad
01:44 fdobridge: <a​irlied> I think that is the only thing I've really hit on ampere, but I haven't got a full run in a while due to gsp issues
02:11 fdobridge: <m​henning> I have a PR for that https://gitlab.freedesktop.org/gfxstrand/mesa/-/merge_requests/44
02:11 fdobridge: <m​henning> Although I also have a bunch of other issues on ampere that I haven't solved yet
02:13 airlied: oh nice
10:25 fdobridge: <k​arolherbst🐧🦀> uhh.. now that I'm digging into all the drivers, that genxml stuff is quite a mess 🙃
11:08 fdobridge: <k​arolherbst🐧🦀> @gfxstrand btw.. I think I figured out what a fast-GS is 😄
11:08 fdobridge: <k​arolherbst🐧🦀> it's _probably_ a GS with a single stream
14:30 fdobridge: <g​fxstrand> I think there may be other restrictions such as a known number of output primitives.
14:30 fdobridge: <g​fxstrand> Or maybe even that number is known to be 1
14:30 fdobridge: <k​arolherbst🐧🦀> yeah...
14:31 fdobridge: <k​arolherbst🐧🦀> but it does have implications in the ISA
14:31 fdobridge: <k​arolherbst🐧🦀> e.g. `AST` doesn't have the third source as in normal GS
14:31 fdobridge: <k​arolherbst🐧🦀> and `OUT` apparently is a `NOP` in fast-GS
14:31 fdobridge: <g​fxstrand> RIght. That would be consistent with there only being one primitive
14:32 fdobridge: <k​arolherbst🐧🦀> yeah...
14:32 fdobridge: <g​fxstrand> Or maybe it's "no more than one". Using a GS right after tessellation to discard primitives is a pretty common case.
14:32 fdobridge: <g​fxstrand> But I'm just spit-balling here. I don't actually know.
14:32 fdobridge: <k​arolherbst🐧🦀> ohh, that might also be it
14:33 fdobridge: <k​arolherbst🐧🦀> I think they mostly wanted to deal with noop and passthrough GS this way
14:33 fdobridge: <g​fxstrand> Yeah
14:34 fdobridge: <k​arolherbst🐧🦀> so the actual thing changing is, that in normal GS you are managing the GS state via that opaque value `OUT` produces and `AST` consumes
14:34 fdobridge: <k​arolherbst🐧🦀> and that's gone
14:34 fdobridge: <k​arolherbst🐧🦀> in fast-GS
14:34 fdobridge: <g​fxstrand> RIght
14:34 fdobridge: <g​fxstrand> Intel has something similar called URB handles.
14:35 fdobridge: <g​fxstrand> It's some sort of handle to the memory allocated in the I/O ring for the given primitive.
14:35 fdobridge: <k​arolherbst🐧🦀> yeah...
14:35 fdobridge: <k​arolherbst🐧🦀> sounds about right
14:36 fdobridge: <k​arolherbst🐧🦀> what's also interesting is, that the hardware inserts an implicit `.CUT` whenever the stream_id changes
14:38 fdobridge: <g​fxstrand> What's `.CUT`?
14:39 fdobridge: <k​arolherbst🐧🦀> ends the current output strip
14:40 fdobridge: <k​arolherbst🐧🦀> also discards incomplete vertices
14:40 fdobridge: <g​fxstrand> Right
20:49 fdobridge: <g​eorgeouzou> I think fast gs is the feature from GL_NV_geometry_shader_passthrough
20:51 fdobridge: <k​arolherbst🐧🦀> yeah
20:51 fdobridge: <g​eorgeouzou> It requires that the output primitive type is the same as the input type and no the vertex attribs are passed through from the precious stage to the fragment shader
20:51 fdobridge: <g​eorgeouzou> It requires that the output primitive type is the same as the input type and the vertex attribs are passed through from the precious stage to the fragment shader (edited)
20:51 fdobridge: <k​arolherbst🐧🦀> would makes sene
20:51 fdobridge: <g​eorgeouzou> It requires that the output primitive type is the same as the input type and the vertex attribs are passed through from the previous stage to the fragment shader (edited)
20:51 fdobridge: <k​arolherbst🐧🦀> *sense
22:35 gfxstrand: dakr: Where did we land on the MAX_PUSH thing? I'm about to type pus splitting in NVK but I don't remember if we decided the kernel could split it or if I should.
22:42 dakr: gfxstrand: Oh, I think EXEC still checks for NOUVEAU_GEM_MAX_PUSH.
22:42 gfxstrand: That's fine. It's pretty trivial to split
22:43 dakr: I think the kernel could split it, but it's a bit unclear what to do when we already submitted partial jobs and a subsequent submit fails.
22:43 gfxstrand: That's fair.
22:43 gfxstrand: It's easy enough to split in userspace.
22:44 dakr: I just submitted https://lore.kernel.org/nouveau/20230924224555.15595-1-dakr@redhat.com/T/#u btw.
22:45 dakr: Once the scheduler patches and this is merged we should fill the ring more efficiently.
22:45 gfxstrand: Cool!
22:45 dakr: Also we might need to replace the NOUVEAU_GEM_MAX_PUSH limit with an EXEC specific one.
22:46 dakr: Currently NOUVEAU_GEM_MAX_PUSH is 512, but the actual amount of IBs the ring can take is 1023.
22:46 gfxstrand: SUre
22:46 gfxstrand: I think that's probably a good idea.
22:47 dakr: In order to not let the ring run dry the limit should be RING_SIZE / 2.
22:47 gfxstrand: I'm just going to use 512 for now
22:47 gfxstrand: Or are you suggesting 511?
22:48 dakr: Either that, or increasing the ring size, but I'd need to check if we have some limitations there.
22:48 gfxstrand: So, we have two decisions to make here:
22:48 gfxstrand: 1) What to make the hard UAPI limit. This is UAPI so if we ever lower it, that's a breaking change and we need to get that sorted ASAP.
22:49 gfxstrand: 2) What should NVK do. That's kind-of a tuning thing. I can drop down to 250 or something just to ensure we can always fit 4 or whetever.
22:49 airlied: we could add a patch to expose it as a getparam
22:49 gfxstrand: RE 1), right now the UAPI limit as per the kernel I have is 512
22:49 airlied: if we do it quick
22:50 gfxstrand: That would also be an option if we're concerned about changing it in future.
22:50 gfxstrand: It's kind-of annoying for NVK because I'm stack allocating right now. (-:
22:50 gfxstrand: But I can have an NVK max and a kernel max.
22:50 gfxstrand: Kind-of annoying but maybe safer to do that way?
22:50 dakr: Might be better to add some flexibility. Since I don't really know how freely we're able to adjust the ring size.
22:50 gfxstrand: Whatever we do, we should do it quick
22:51 airlied: gfxstrand: btw for multiple queues and async compute I expect we may need some more kernel work
22:51 gfxstrand: sure
22:51 gfxstrand: airlied: Uh... multiple queues are a problem? Why?
22:52 airlied: at least async compute was on ben's list as something that needs work
22:52 airlied: to setup channel groups properly
22:53 gfxstrand: Ah.
22:53 gfxstrand: I would expect it works but maybe we don't actually get 3D and compute in parallel.
22:53 airlied: it's an area I don't really have how it works, so we have to just figure it out from Ben's limited notes :-P
22:53 gfxstrand: hehe
22:53 gfxstrand: Fair
22:55 fdobridge: <g​fxstrand> Hey, look! I just deleted `nvk_device::mutex` 😁
22:59 fdobridge: <g​fxstrand> Ya'know... I bet my CTS runs would be faster if I implemented a pipeline cache... 🤔
23:00 fdobridge: <g​fxstrand> I compile those blit shaders A LOT
23:06 dakr: gfxstrand: I will push a patch tomorrow regarding the IB limit.
23:07 dakr: Good you brought this up again.
23:09 gfxstrand: Thanks! Ping me here and I'll post an NVK change.
23:09 gfxstrand: dakr: Do we have any hidden limits for binds?
23:14 dakr: gfxstrand: Currently it's only limited by memory. But we update page tables from the CPU.
23:16 gfxstrand: Right
23:16 dakr: Not entirely sure if we could run into some limit when we start using the DMA engine one day.
23:16 dakr: But I don't think so.
23:17 gfxstrand: It's mostly a matter of whether or not we want to impose one now or plan to handle that limit in the kernel later.
23:17 gfxstrand: That one's a bit more annoying to deal with in userspace. Still possible, though.
23:18 dakr: I really tend to think that if there is some limitation we can handle this in the kernel.
23:18 gfxstrand: Okay. Works for me.
23:19 gfxstrand:just deleted 1K LOC from NVK. :D
23:19 gfxstrand: Or will once this CTS run is done.
23:20 fdobridge: <g​fxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25357