03:41 fdobridge: <g​fxstrand> Latest NAK numbers (CS only):
03:41 fdobridge: <g​fxstrand> `Pass: 311177, Fail: 1063, Crash: 1246, Skip: 1673845, Flake: 44, Duration: 1:01:48`
03:44 fdobridge: <g​fxstrand> I think next week's project is going to be fixing dependency analysis (I'm still running with `NAK_DEBUG=serial`) and then I should probably implement spilling.
03:56 fdobridge: <g​fxstrand> And.. Just fixed another 144 tests at least. (`nir_texop_txf_ms` was broken)
16:20 fdobridge: <S​amantas5855> why are so many skipped
16:20 fdobridge: <M​ohamexiety> unsupported/non advertised yet probably
16:23 fdobridge: <S​amantas5855> I see thx
16:23 fdobridge: <S​amantas5855> when is nak coming to the nvk branch
16:34 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> I wonder if NAK will be able to compile those pesky Minecraft shaders 🤔
17:38 fdobridge: <g​fxstrand> Soon, I think. IDK what the "It's time" point is but I think we're getting close. Probably as soon as I make dep tracking not total crap.
17:39 fdobridge: <g​fxstrand> `Pass: 311315, Fail: 924, Crash: 1246, Skip: 1673845, Flake: 45, Duration: 1:00:23`
17:39 fdobridge: <g​fxstrand> Unsupported features and formats. Formats make a lot of difference. Even a desktop driver that supports Vulkan 1.3 and all the fancy features will run at most 50% of the tests.
17:40 fdobridge: <g​fxstrand> So 20% is actually more like 40%
17:42 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> How hard would it be to skip zero tests? 🍩
17:47 fdobridge: <M​ohamexiety> probably not a good question, sorry, but what sort of dependency tracking do you need to do? since the hardware does track and handle dependencies too from what I understand, so what does the compiler need to do vs what the HW already does?
17:51 fdobridge: <S​amantas5855> thx, what about flake
17:53 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> "NO FAILURES, NO FLAKES, it just passe-" - zmike, 2022
20:03 fdobridge: <g​fxstrand> I don't think there exists hardware on which that is possible. Lavapipe could maybe do it in theory because it's not hardware and we can implement everything.
20:04 fdobridge: <g​fxstrand> The hardware doesn't, not on Turing and later.
20:04 fdobridge: <g​fxstrand> Flakes are bugs we need to fix.
20:04 fdobridge: <g​fxstrand> Flakes are bugs we need to fix. They're just not 100% reproducable. (edited)
20:05 fdobridge: <M​ohamexiety> oh 😮 I see.
20:05 fdobridge: <M​ohamexiety> so all sorts of dependences need to be handled entirely by the compiler in this case.. thanks!
20:06 fdobridge: <S​amantas5855> I see thanks
20:23 fdobridge: <g​fxstrand> Yup. There's two mechanisms for it. One is a delay where an instruction can say "Wait at least N cycles before executing the next instruction." The other is a token mechanism where one instruction signals a barrier slot when it's done and other instructions can wait on those barrier slots before they execute. This is needed for things like texture instructions which take a variable amount of time to execute depending on memory, caching, etc
20:44 fdobridge: <M​ohamexiety> got it, thanks a lot! interesting.. and must be quite tricky to take care of in software
20:50 fdobridge: <M​ohamexiety> from my understanding, AMD has a mechanism similar to the first mechanism introduced in RDNA3 -- but it's a performance thing rather than a correctness thing. `s_delay_alu` being a cooperative yield instruction which signals to the frontend that you should yield/switch to instructions on other waves till the result of <instruction specified as an argument> is ready, otherwise your ALUs stall and can't execute instructions from other waves
21:09 fdobridge: <g​fxstrand> Yeah, pre-RDNA2 or so, AMD didn't have any instruction delay. All ALU type things tok one logical(ish) cycle. That's actually not 100% true and there's details in there but the end effect was that you could put instructions back-to-back-to-back without any delay.
21:09 fdobridge: <g​fxstrand> That changed with RDNA2 or RDNA3
21:30 fdobridge: <D​adSchoorse> RDNA1 made instruction latency matter, but switching waves happened automatically until RDNA3. s_delay_alu is puretly a optimization though, the alu just stalls if you omit it
21:31 fdobridge: <g​fxstrand> `Pass: 311401, Fail: 917, Crash: 1167, Skip: 1673845, Flake: 45, Duration: 1:00:31`
21:33 fdobridge: <D​adSchoorse> RDNA1 made instruction latency matter, but switching waves happened automatically until RDNA3. s_delay_alu is purely a optimization though, the alu just stalls if you omit it (edited)
21:51 fdobridge: <M​ohamexiety> I see, thanks! is there some material somewhere that explains this in detail? I am curious how they made the latency transparent in GCN (and how RDNA diverged). my initial guess is the layout of having 4 SIMD16s, where each SIMD block would be taking on the instructions of a particular wavefront over 4 cycles, thus helping to hide both execution and decode latency (i.e., you get 4 instructions/4 cycles) but not sure I get it correctly.
21:56 fdobridge: <E​sdras Tarsis> Whats the result with codegen?
21:59 fdobridge: <g​fxstrand> I think this one is recentish
22:02 fdobridge: <M​ohamexiety> I ran a full CTS run a week or so ago actually
22:02 fdobridge: <M​ohamexiety> `Pass: 301529, Fail: 2122, Crash: 164, Skip: 1608852, Timeout: 2, Flake: 68, Duration: 47:32`
22:02 fdobridge: <M​ohamexiety> (GTX 1660)
22:21 fdobridge: <M​ohamexiety> huh, looking closely at the numbers.. NAK is pretty close to codegen. that's very interesting 😮
22:40 fdobridge: <E​sdras Tarsis> RIP codegen 2011-2023
22:48 fdobridge: <g​fxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1104177817815621812/PXL_20230505_224220224.jpg
22:49 fdobridge: <M​ohamexiety> ooooohhh, very very niice!!!