03:41fdobridge: <gfxstrand> Latest NAK numbers (CS only):
03:41fdobridge: <gfxstrand> `Pass: 311177, Fail: 1063, Crash: 1246, Skip: 1673845, Flake: 44, Duration: 1:01:48`
03:44fdobridge: <gfxstrand> I think next week's project is going to be fixing dependency analysis (I'm still running with `NAK_DEBUG=serial`) and then I should probably implement spilling.
03:56fdobridge: <gfxstrand> And.. Just fixed another 144 tests at least. (`nir_texop_txf_ms` was broken)
16:20fdobridge: <Samantas5855> why are so many skipped
16:20fdobridge: <Mohamexiety> unsupported/non advertised yet probably
16:23fdobridge: <Samantas5855> I see thx
16:23fdobridge: <Samantas5855> when is nak coming to the nvk branch
16:34fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> I wonder if NAK will be able to compile those pesky Minecraft shaders 🤔
17:38fdobridge: <gfxstrand> Soon, I think. IDK what the "It's time" point is but I think we're getting close. Probably as soon as I make dep tracking not total crap.
17:39fdobridge: <gfxstrand> `Pass: 311315, Fail: 924, Crash: 1246, Skip: 1673845, Flake: 45, Duration: 1:00:23`
17:39fdobridge: <gfxstrand> Unsupported features and formats. Formats make a lot of difference. Even a desktop driver that supports Vulkan 1.3 and all the fancy features will run at most 50% of the tests.
17:40fdobridge: <gfxstrand> So 20% is actually more like 40%
17:42fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> How hard would it be to skip zero tests? 🍩
17:47fdobridge: <Mohamexiety> probably not a good question, sorry, but what sort of dependency tracking do you need to do? since the hardware does track and handle dependencies too from what I understand, so what does the compiler need to do vs what the HW already does?
17:51fdobridge: <Samantas5855> thx, what about flake
17:53fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> "NO FAILURES, NO FLAKES, it just passe-" - zmike, 2022
20:03fdobridge: <gfxstrand> I don't think there exists hardware on which that is possible. Lavapipe could maybe do it in theory because it's not hardware and we can implement everything.
20:04fdobridge: <gfxstrand> The hardware doesn't, not on Turing and later.
20:04fdobridge: <gfxstrand> Flakes are bugs we need to fix.
20:04fdobridge: <gfxstrand> Flakes are bugs we need to fix. They're just not 100% reproducable. (edited)
20:05fdobridge: <Mohamexiety> oh 😮 I see.
20:05fdobridge: <Mohamexiety> so all sorts of dependences need to be handled entirely by the compiler in this case.. thanks!
20:06fdobridge: <Samantas5855> I see thanks
20:23fdobridge: <gfxstrand> Yup. There's two mechanisms for it. One is a delay where an instruction can say "Wait at least N cycles before executing the next instruction." The other is a token mechanism where one instruction signals a barrier slot when it's done and other instructions can wait on those barrier slots before they execute. This is needed for things like texture instructions which take a variable amount of time to execute depending on memory, caching, etc
20:44fdobridge: <Mohamexiety> got it, thanks a lot! interesting.. and must be quite tricky to take care of in software
20:50fdobridge: <Mohamexiety> from my understanding, AMD has a mechanism similar to the first mechanism introduced in RDNA3 -- but it's a performance thing rather than a correctness thing. `s_delay_alu` being a cooperative yield instruction which signals to the frontend that you should yield/switch to instructions on other waves till the result of <instruction specified as an argument> is ready, otherwise your ALUs stall and can't execute instructions from other waves
21:09fdobridge: <gfxstrand> Yeah, pre-RDNA2 or so, AMD didn't have any instruction delay. All ALU type things tok one logical(ish) cycle. That's actually not 100% true and there's details in there but the end effect was that you could put instructions back-to-back-to-back without any delay.
21:09fdobridge: <gfxstrand> That changed with RDNA2 or RDNA3
21:30fdobridge: <DadSchoorse> RDNA1 made instruction latency matter, but switching waves happened automatically until RDNA3. s_delay_alu is puretly a optimization though, the alu just stalls if you omit it
21:31fdobridge: <gfxstrand> `Pass: 311401, Fail: 917, Crash: 1167, Skip: 1673845, Flake: 45, Duration: 1:00:31`
21:33fdobridge: <DadSchoorse> RDNA1 made instruction latency matter, but switching waves happened automatically until RDNA3. s_delay_alu is purely a optimization though, the alu just stalls if you omit it (edited)
21:51fdobridge: <Mohamexiety> I see, thanks! is there some material somewhere that explains this in detail? I am curious how they made the latency transparent in GCN (and how RDNA diverged). my initial guess is the layout of having 4 SIMD16s, where each SIMD block would be taking on the instructions of a particular wavefront over 4 cycles, thus helping to hide both execution and decode latency (i.e., you get 4 instructions/4 cycles) but not sure I get it correctly.
21:56fdobridge: <Esdras Tarsis> Whats the result with codegen?
21:59fdobridge: <gfxstrand> I think this one is recentish
22:02fdobridge: <Mohamexiety> I ran a full CTS run a week or so ago actually
22:02fdobridge: <Mohamexiety> `Pass: 301529, Fail: 2122, Crash: 164, Skip: 1608852, Timeout: 2, Flake: 68, Duration: 47:32`
22:02fdobridge: <Mohamexiety> (GTX 1660)
22:21fdobridge: <Mohamexiety> huh, looking closely at the numbers.. NAK is pretty close to codegen. that's very interesting 😮
22:40fdobridge: <Esdras Tarsis> RIP codegen 2011-2023
22:48fdobridge: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1104177817815621812/PXL_20230505_224220224.jpg
22:49fdobridge: <Mohamexiety> ooooohhh, very very niice!!!