00:03fdobridge: <marysaka> nice catch 😅
02:28fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Did this happen in the non-serial CTS run?
03:37fdobridge: <gfxstrand> Nope
03:38fdobridge: <gfxstrand> It didn't happen in my serial run with deqp-runner, either. I only saw it in my "Let's pretend I want to submit conformance" run.
03:52fdobridge: <gfxstrand> Ugh... these timeline semaphore tests take forever. 🙄
03:59fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> How do they compare to memory model tests?
04:35fdobridge: <gfxstrand> IDK. I've not run the memory model tests
04:36fdobridge: <gfxstrand> Running @marysaka 's GS branch and seeing lots of faults. Looks like stuff is running but something's still wrong.
06:36fdobridge: <airlied> I wonder if that race is responsible for those wierd ampere sked issues
06:43fdobridge: <mohamexiety> that's a really obscure bug 😮
06:43fdobridge: <mohamexiety> nice catch!
08:24fdobridge: <marysaka> Are those mmu faults?
14:26fdobridge: <gfxstrand> Yeah
14:26fdobridge: <gfxstrand> Quite possibly
14:53fdobridge: <gfxstrand> Here's the `NVK_USE_NAK=all` numbers:
14:53fdobridge: <gfxstrand> `Pass: 399394, Fail: 5810, Crash: 11, Warn: 4, Skip: 3194910, Timeout: 2, Flake: 752, Duration: 2:30:54`
14:53fdobridge: <gfxstrand> Not bad, really. Just something's still wrong with GS
14:54fdobridge: <gfxstrand> Binding model tests especially seem to go boom
15:35fdobridge: <gfxstrand> Damn... It really does smell like something to do with branching.
15:46fdobridge: <gfxstrand> Yup! It's branching alright...
15:46fdobridge: <gfxstrand> The next question is: WTF?!?
15:50fdobridge: <gfxstrand> I think the old compiler is saved by control-flow re-convergence. However, geometry shaders have no general guarantee that `emitVertex()` happens in uniform control-flow.
15:52fdobridge: <gfxstrand> I'm thinking this is a problem for Monday...
15:52fdobridge: <karolherbst🐧🦀> mhh.. you think that `OUT` needs to be executed uniformly?
15:55fdobridge: <gfxstrand> IDK. That seems off because, like I said, there's no GLSL or SPIR-V requirement that it must.
15:55fdobridge: <gfxstrand> But it's definitely breaking when it's not.
15:55fdobridge: <gfxstrand> Forcing it to be uniform would be some real compiler shenanigans.
15:55fdobridge: <gfxstrand> The other thing is that it does seem to work. It just doesn't pick up all of the output stores correctly.
15:56fdobridge: <karolherbst🐧🦀> are you sure to always execute `OUT.FINAL`?
15:56fdobridge: <gfxstrand> Yeah
15:56fdobridge: <karolherbst🐧🦀> are you sure that the state register is always passed through correctly?
15:57fdobridge: <gfxstrand> Looks like
15:58fdobridge: <karolherbst🐧🦀> also, the register has to be initialized with 0
15:59fdobridge: <karolherbst🐧🦀> mhhh... the docs are a bit confusing here.. because later it also states it has to be sourced by `AST` 🙃
15:59fdobridge: <gfxstrand> It does have to be sourced by AST but IDK that that's always happening
15:59fdobridge: <karolherbst🐧🦀> ehh wayt.. I read that wrongly, yeah
15:59fdobridge: <karolherbst🐧🦀> *wait
16:00fdobridge: <karolherbst🐧🦀> anyway.. not really seeing any requirement that threads must be converged before executing it..
16:02fdobridge: <karolherbst🐧🦀> nvidia uses exit handler to call `OUT.FINAL`, but with proper lowering that shouldn't be required anyway
16:02fdobridge: <gfxstrand> So, we're actually using $ZERO for AST for a bunch of cases.
16:02fdobridge: <gfxstrand> Maybe that's messing something up?
16:02fdobridge: <karolherbst🐧🦀> what the docs also suggest is to allocate a register for the state thing
16:02fdobridge: <karolherbst🐧🦀> would be surprising.. but also possible
16:02fdobridge: <karolherbst🐧🦀> do you even know how $ZERO works? 😄
16:03fdobridge: <karolherbst🐧🦀> or rather, what makes it return 0?
16:03fdobridge: <karolherbst🐧🦀> it's a little trick, that any OOB register read is returning 0, just for $ZERO the hardware doesn't complain or something.
16:04fdobridge: <karolherbst🐧🦀> which... might cause some issues, and I think codegen has a few places where it forces a live register with 0 instead of RZ
16:04fdobridge: <gfxstrand> There are definitely instructions which don't like it
16:04fdobridge: <gfxstrand> Typically things that source vectors like tex or st
16:05fdobridge: <gfxstrand> Even if they're only reading a single component, they don't like ZERO
16:05fdobridge: <karolherbst🐧🦀> yeah...
16:05fdobridge: <gfxstrand> Learned that one the hard way. 🙃
16:05fdobridge: <karolherbst🐧🦀> oof
16:05fdobridge: <gfxstrand> In any case, making AST not use zero doesn't fix anything
16:05fdobridge: <karolherbst🐧🦀> anyway.. you might want to try out `OUT` without RZ
16:06fdobridge: <karolherbst🐧🦀> maybe even try to force the same register for all `OUT`s
16:06fdobridge: <karolherbst🐧🦀> not sure if Nvidia always uses a specific instruction or not
16:06fdobridge: <gfxstrand> We're currently using R0 for all outs
16:06fdobridge: <gfxstrand> By sheer RA luck but it at least shows that's not the problem
16:07fdobridge: <karolherbst🐧🦀> mhhh
16:07fdobridge: <karolherbst🐧🦀> it's even recommended to always use the same register
16:07fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> I wonder how long Karol can infodump stuff for about special NVIDIA hardware quirks/features 🤔
16:08fdobridge: <karolherbst🐧🦀> but anyway.. not really seeing any weird requirements here... could also be something else going wrong
16:08fdobridge: <karolherbst🐧🦀> anyway.. thread reconvergence isn't hard to implement anyway, so might as well try it
16:10fdobridge: <gfxstrand> Yeah, but that doesn't solve the problem. You can `EmitVertex()` in a non-uniform loop
16:10fdobridge: <gfxstrand> That just fixes the CTS tests. It'll break in the wild
16:11fdobridge: <gfxstrand> The fact that there are no CTS tests for EmitVertex in non-uniform control-flow is a little disturbing...
16:12fdobridge: <karolherbst🐧🦀> checked what nvidia emits?
16:12fdobridge: <gfxstrand> And like I said, it does work. It emits vertices. It just gets one of the colors wrong
16:12fdobridge: <karolherbst🐧🦀> sounds like something else is going wrong tho
16:13fdobridge: <karolherbst🐧🦀> there aren't really many instructions which need converged threads, mostly just barriers and subgroup things
16:14fdobridge: <marysaka> Could it be related by the fact that EXIT isn't at the end of the shader most of the time?
16:15fdobridge: <marysaka> From what I recall NVIDIA drivers put some infinite branch after each EXIT too but idk if that truly matter
16:15fdobridge: <karolherbst🐧🦀> I wonder if `OUT.FINAL` needs converged threads?
16:15fdobridge: <karolherbst🐧🦀> and I wonder if the exit handler always executes with converged threads
16:16fdobridge: <karolherbst🐧🦀> yeah...
16:16fdobridge: <karolherbst🐧🦀> all threads enter the exit handler together
16:16fdobridge: <karolherbst🐧🦀> if it matters for `OUT.FINAL`? no idea
16:17fdobridge: <karolherbst🐧🦀> but if it does, nvidia does the correct thing
16:17fdobridge: <marysaka> The fact that even for a simple shaders it was doing that, maye it does matter...?
16:17fdobridge: <karolherbst🐧🦀> maybe?
16:18fdobridge: <karolherbst🐧🦀> might also be just static GS specific code
16:18fdobridge: <karolherbst🐧🦀> it does simplify things
16:18fdobridge: <karolherbst🐧🦀> for two resons
16:18fdobridge: <karolherbst🐧🦀> 1. you always have to execute `OUT.FINAL`, so might as well push it into the exit handler for everything
16:19fdobridge: <karolherbst🐧🦀> 2. if you allocate a register for all `OUT`s, you can just use that in the exit handler as well
16:19fdobridge: <karolherbst🐧🦀> and then you won't have to bother that CFG manipulation messes up your GS
16:20fdobridge: <gfxstrand> Ooh, I think I found the bug
16:20fdobridge: <gfxstrand> Maybe
16:20fdobridge: <karolherbst🐧🦀> anyway.. having exit handler support in NAK might also come in handy with other things 😄
16:20fdobridge: <karolherbst🐧🦀> but anyway.. they are per warp, and only executed if all threads are in KILLED state
16:21fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Hopefully NAK will be able to do ProcessVertices() now
16:21fdobridge: <karolherbst🐧🦀> huh...
16:22fdobridge: <karolherbst🐧🦀> exit acts as an implied join also for helper threads
16:22fdobridge: <gfxstrand> It's mary's opt_out
16:22fdobridge: <gfxstrand> IDK what it's doing but it's doing it wrong
16:22fdobridge: <marysaka> huh
16:23fdobridge: <marysaka> oh that may make sense... now that I think about it I only saw Maxwell/Pascal emit that with NVIDIA driver, not with Turing
16:23fdobridge: <marysaka> it was doing OUT.EMIT and OUT.CUT one after the other...
16:23fdobridge: <karolherbst🐧🦀> 🙃
16:23fdobridge: <gfxstrand> opt_out is throwing away a predicate on a BRA
16:23fdobridge: <gfxstrand> @marysaka, what is opt_out trying to do?
16:24fdobridge: <marysaka> Merging OUT.EMIT and OUT.CUT when they are one after the other and if the stream id is the same
16:24fdobridge: <gfxstrand> Ah
16:24fdobridge: <marysaka> so I suppose I forgot about predicates oops
16:25fdobridge: <karolherbst🐧🦀> might also want to get rid of `OUT.CUT` if the stream id changes
16:26fdobridge: <karolherbst🐧🦀> I guess that merges them to `EMIT_THEN_CUT~?
16:26fdobridge: <karolherbst🐧🦀> I guess that merges them to `EMIT_THEN_CUT`? (edited)
16:26fdobridge: <gfxstrand> Okay, disabled opt_cut and I'm going to do another CTS run.
16:26fdobridge: <karolherbst🐧🦀> anyway, that's something the hardware does. Inserting an `OUT.CUT` if the stream id changes on the transition point.
16:30fdobridge: <marysaka> yeah
16:48fdobridge: <gfxstrand> Okay, I fixed it.
16:49fdobridge: <gfxstrand> Time for a full CTS run with `NVK_USE_NAK=all` 🚀
16:57fdobridge: <marysaka> so it was merging an OUT.EMIT and OUT.CUT with different predicate conditions?
17:06fdobridge: <mhenning> @gfxstrand Was it intentional that you removed the commits from https://gitlab.freedesktop.org/gfxstrand/mesa/-/merge_requests/44 from nak/main?
17:09fdobridge: <gfxstrand> Nope
17:09fdobridge: <gfxstrand> All this rebasing...
17:10fdobridge: <karolherbst🐧🦀> just ship it already 🙃
17:12fdobridge: <gfxstrand> Yeah...
17:12fdobridge: <gfxstrand> No, it was dropping the predicates from any instruction that happened to come after an OUT.CUT
17:13fdobridge: <marysaka> oh no :blobcatnotlikethis:
17:13fdobridge: <gfxstrand> Yeah, you can't safely, pull out an op and then do `Instr::new_boxed(op)` and trust that you get the same instruction.
17:16fdobridge: <gfxstrand> I just need to decide how much I hate my proc macro wraps
17:16fdobridge: <gfxstrand> Xavier has a meson MR to use actual crates as wraps but it's not in meson upstream yet.
17:18fdobridge: <karolherbst🐧🦀> yeah....
17:19fdobridge: <gfxstrand> I'm going to chat with Dylan and I think we'll probably just merge the horrible wraps for now.
17:20fdobridge: <mhenning> should also remove the hacked changes to codegen
17:20fdobridge: <gfxstrand> Yeah
17:21fdobridge: <gfxstrand> There's a bit of maintenance work to do on the branch.
17:21fdobridge: <gfxstrand> I need to pull the NIR bits out into an MR (one already exists but it's way out of date)
17:21fdobridge: <gfxstrand> I need to drop the codegen hacks
17:21fdobridge: <karolherbst🐧🦀> though maybe we should wait for 1.3 because that will probably also have that static inline wrapping stuff
17:21fdobridge: <gfxstrand> Scan through for spurrious NVK changes that might be breaking things.
17:22fdobridge: <gfxstrand> That's not actually hurting me too bad.
17:22fdobridge: <karolherbst🐧🦀> it's more about others though
17:23fdobridge: <gfxstrand> We can stick in meson requirements on a per-component basis
17:23fdobridge: <gfxstrand> We're not going to lock all of Mesa to 1.3
17:23fdobridge: <karolherbst🐧🦀> e.g. https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25129
17:24fdobridge: <karolherbst🐧🦀> though I think it was worse in the past
17:24fdobridge: <gfxstrand> I don't see what that has to do with NAK. We basically never use types in nak_from_nir
17:25fdobridge: <karolherbst🐧🦀> just the general thing if somebody converts functions to static inlines
17:25fdobridge: <karolherbst🐧🦀> and then having to update all the rust code using it
17:25fdobridge: <karolherbst🐧🦀> but yeah...
17:25fdobridge: <gfxstrand> Yeah, but I'm not using any of that
17:25fdobridge: <gfxstrand> All the "interesting" NIR usage is in C code
17:25fdobridge: <karolherbst🐧🦀> ahh, okay
17:28fdobridge: <gfxstrand> GS run looking good so far: Pass: 61395, Fail: 2, Crash: 1, Skip: 481527, Flake: 75, Duration: 15:23, Remaining: 1:26:37
17:28fdobridge: <gfxstrand> I'll report back in a few hours once it's actually done
17:28fdobridge: <gfxstrand> GS run looking good so far: `Pass: 61395, Fail: 2, Crash: 1, Skip: 481527, Flake: 75, Duration: 15:23, Remaining: 1:26:37` (edited)
18:21fdobridge: <mhenning> ugh. gitlab won't let me post a `Reviewed-by` tag with my email address in it any more because it thinks it's spam.
19:36fdobridge: <_maide> gfxstrand: What do the "needed" transform feedback query reports represent? E.g `REPORT_SEMAPHORE_D_REPORT_STREAMING_PRIMITIVES_NEEDED` and `REPORT_SEMAPHORE_D_REPORT_STREAMING_PRIMITIVES_NEEDED_MINUS_SUCCEEDED`
19:36fdobridge: <_maide>
19:36fdobridge: <_maide> I saw a while ago that you implemented them somewhere, but I can't find the commits for them anymore. As I remember it wasn't clear what they meant from your code though.
21:48fdobridge: <gfxstrand> GS looks good. There's a silly XFB test failing but I suspect that's a cache flushing issue more than anything. I've got some ideas. I just need to poke about a bit.
21:48fdobridge: <gfxstrand> Figuring out caching for real is getting pretty high up my priority list.
22:21HdkR: gfxstrand: I'm curious your opinion. Now that you have experience with GS on multiple different pieces of hardware. How does the NVIDIA implementation compare? :)
22:33fdobridge: <georgeouzou> The following explains what the xfb queries report :
22:33fdobridge: <georgeouzou> https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#queries-transform-feedback