00:02fdobridge: <airlied> I assume you are trying vkcube 🙂
00:10fdobridge: <airlied> ah vkcube needs fddx 😛
00:11fdobridge: <airlied> you are slower at the vulkan samplers gears than codegen 😛
00:13fdobridge: <airlied> for gears you just need https://gitlab.freedesktop.org/airlied/mesa/-/commit/85ba2b776164044dd4df2beff7968bffae453d99
00:35fdobridge: <gfxstrand> Nope
02:49fdobridge: <gfxstrand> @marysaka I'm doing a CTS run on the nvk/main branch in my gitlab right now.
02:49fdobridge: <gfxstrand> It's got a bunch of fixes tat at least get meta mostly working
02:49fdobridge: <gfxstrand> A few patches from Dave are in there but not everything
03:14fdobridge: <gfxstrand> `Pass: 17973, Fail: 1017, Crash: 991, Warn: 1, Skip: 157428, Flake: 90, Duration: 29:25, Remaining: 9:27:33`
03:14fdobridge: <gfxstrand> Anyone wants to take bets on whether or not my machine survives? 😂
03:23DemiMarie: gfxstrand: maybe try KASAN + KUBSAN + KCSAN next time you do that? that way if you do crash, you know if there is a kernel memory unsafety hole that needs to be fixed.
03:28fdobridge: <airlied> yeah don't do that, it would take 3 days
03:42Armanelgtron: is there no reclocking support on tesla/G86M currently?
04:11fdobridge: <gfxstrand> Fixed a few more things. I'm now getting an estimated duration of 4 hours, not 10. 🥳
04:13fdobridge: <gfxstrand> Still horrible but we're making progress.
04:14fdobridge: <gfxstrand> And, no, I haven't tried vkcube yet. Thanks for asking. 😝
08:27karolherbst: Armanelgtron: uhm.. good question. There is limited support for Tesla in general, if it doesn't work for yours, then either this helps https://gitlab.freedesktop.org/drm/nouveau/-/merge_requests/23/diffs or it's not supported
08:27karolherbst: but I think it's not enabled for G8X
12:17cwabbott: gfxstrand: I assume making sure your delay pass handles delays across blocks is on the TODO list?
14:37gfxstrand: cwabbott: Yeah, it's on my mental ToDo list
14:38gfxstrand: cwabbott: Good news is that we don't need to worry about non-uniform texture loops on Turing+
14:38gfxstrand: But, yeah, we need to do better than "stall at edges"
14:38gfxstrand: cwabbott: THat hardware also badly wants a scheduler
14:41karolherbst: what's like the perf difference on turing? I think on maxwell it was like a factor of 2
14:42karolherbst: uhh.. more even
16:22fdobridge: <gfxstrand> They both have a pipeline depth of 6
16:22fdobridge: <gfxstrand> So... yeah...
16:22fdobridge: <gfxstrand> That improves some with the re-use cache
16:23fdobridge: <gfxstrand> if I can figure out how to properly use it
16:23fdobridge: <gfxstrand> Oh, and 12 or 13 for predicates
16:23fdobridge: <gfxstrand> Which is awful
16:24fdobridge: <karolherbst🐧🦀> there is some wonkyness with predicates involved
16:24fdobridge: <karolherbst🐧🦀> also
16:24fdobridge: <karolherbst🐧🦀> some instructions don't have a fixed depth
16:24fdobridge: <karolherbst🐧🦀> some instructionc an also read/write from each source at different time
16:25fdobridge: <karolherbst🐧🦀> it makes sense in those exceptions, it's still pain 😄
18:50fdobridge: <gfxstrand> For those betting, the machine remained alive but had killed a number of corse and the GPU was dead.
18:53fdobridge: <gfxstrand> `Pass: 398246, Fail: 5239, Crash: 1676, Warn: 3, Skip: 3195084, Timeout: 2, Flake: 633, Duration: 2:33:43`
18:53fdobridge: <gfxstrand> That's the run that just completed. Last night's run was way worse.
18:54fdobridge: <gfxstrand> Like 1k of the crashes are derivatives.
18:54fdobridge: <gfxstrand> And @marysaka has a patch, I just need to look through it.
19:00DemiMarie: airlied: I still think a KASAN+KCSAN+KUBSAN would be a good idea once the other crashes have been fixed
19:02fdobridge: <airlied> So get a card, build a kernel and go for it 🙂
19:06fdobridge: <airlied> Once vkcube runs, we should just ask if it runs starfield yet
19:15DemiMarie: airlied: did I miss something?
19:27HdkR: Starfield on Tegra X1 wen?
19:32DemiMarie:needs to bow out
19:38fdobridge: <mohamexiety> I'd hate to imagine duration with all these activated
20:06Mary: HdkR: will certainly need more RAM to even boot
20:06HdkR: True, 3GB is tough to fit in
21:30fdobridge: <mhenning> I'm not sure where the 6 cycle number comes from, but it doesn't match the volta whitepaper or the turing numbers from Jia et al, which both put FFMA at a 4 cycle latency.
21:31fdobridge: <gfxstrand> It comes from codegen.
21:32fdobridge: <gfxstrand> There's a lot more detail to this mess that we need to figure out or get docs on.
21:32fdobridge: <mhenning> Oh, sorry, the codegen latencies are probably wrong after pascal. Whether an instruction is variable/fixed latency is probably correct though
21:33fdobridge: <gfxstrand> I'm reasonably confident in the functional correctness of my pass now, given accurate information. However, my model of input/output delays is likely inaccurate.
21:34fdobridge: <gfxstrand> We need to do some very carefully targeted tests to get more accurate numbers.
21:34fdobridge: <gfxstrand> Might be able to hijack the blob compiler for some of it
21:41fdobridge: <mhenning> Jia et al have their code available for determining some of this, although I haven't looked at it in enough detail to know if it's worth extending and it assumes you'll run the programs with the blob https://github.com/sjfeng1999/gpu-arch-microbenchmark/tree/master
21:42fdobridge: <mhenning> We definitely need more detail than just what they have in their paper though
21:56Armanelgtron: karolherbst: is there any particular reason it's not enabled? i see (device->chipset >= 0x94)
22:18fdobridge: <gfxstrand> Yeah, a lot of that is focused on memory, not registers. I'm also not finding any interesting information on cycle counts, scanning through the code on my phone.
22:26fdobridge: <mohamexiety> https://arxiv.org/pdf/1905.08778.pdf there's this paper for latencies
22:27fdobridge: <mohamexiety> overall microbenchmarking these could be really fun. hmm..
23:53fdobridge: <mhenning> Yeah, the one i"m looking at is https://arxiv.org/pdf/1903.07486v1.pdf but there are a few different papers along those lines