00:47 fdobridge: <a​irlied> finally confirming it, gsp on turing has less flakes than non-gsp in a cts run, now it hits other problems, but the flakes are definitely decreased
04:42 fdobridge: <g​fxstrand> Yeah, IDK if those numbers are reliable. That's the problem.
04:43 fdobridge: <g​fxstrand> I think what I want to do is improve my NAK unit test framework and add a bunch of very targeted latency tests.
04:43 fdobridge: <g​fxstrand> Basically, chain a bunch of the same instruction together in a way that we can detect if any one of them doesn't complete in time. Then we lower the latency until the test fails.
04:44 fdobridge: <g​fxstrand> By doing it in terms of NAK tests, we can guarantee complete control over the generated output.
04:51 fdobridge: <g​fxstrand> It would also make it really easy to run the tests on new hardware and verify that latencies are the same on chip X vs. chip Y.
04:51 fdobridge: <g​fxstrand> Like, I wouldn't at all be surprised if cheaper GPUs have higher latencies or something.
04:51 fdobridge: <g​fxstrand> Being able to easily verify that would be really nice.
08:38 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> https://gitlab.freedesktop.org/nouveau/wiki/-/commit/0a04fc0391e383485b441b4da64814d0b7700a49 🧓
13:40 fdobridge: <g​fxstrand> `Pass: 402169, Fail: 1553, Crash: 1679, Warn: 3, Skip: 3195085, Timeout: 2, Flake: 392, Duration: 2:05:3`
15:02 fdobridge: <g​fxstrand> Still need to figure out why MSAA stencil resolves don't work.
15:46 fdobridge: <m​henning> CUDA's nvcc lets you generate device code for specific groups of gpus, which at least gives us an upper bound on how many different variants there are https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list
15:46 fdobridge: <m​henning> but yeah, we probably do want our own instruction latency tests
16:18 fdobridge: <k​arolherbst🐧🦀> mhhh, does Cudas tooling allow to optimize shaders?
16:18 fdobridge: <k​arolherbst🐧🦀> wondering if we could let it "optimize" sched opcodes of shaders where it's defaulting to the default ones
16:21 benjaminl: are the sched opcodes emitted by the cuda tools not optimal?
16:22 benjaminl: my memory was that the people working on maxas figured out the latencies by looking at the cuda-generated stuff
16:54 HdkR: If cuda didn't generate optimal code then I would assume there would be some very angry people :P
18:24 karolherbst: it might be that maxas stuff was just "good enough" and cuda was more conservative with generating sched opcodes, but....
18:24 karolherbst: I think the differences were small
18:25 karolherbst: but my idea was mostly to just feed elf binaries with "disabled" sched opcodes to nvidia's tooling and verify we'd calculate the same sched opcodes
18:26 karolherbst: gfxstrand: ^^ definetly something we might want to try as we could also CI that potentially without needing a GPU
18:27 karolherbst: alternatively we could also compile ptx to SASS, but that could be... unreliable
18:27 karolherbst: and won't cover everything
20:40 fdobridge: <m​ohamexiety> What work would that involve? And how would it differ from e.g. a standard CUDA/PTX microbench?
20:48 gfxstrand: karolherbst: That's an interesing idea....
20:48 gfxstrand: karolherbst: IDK if we have a big enough shader corpus but it might work.
20:48 karolherbst: at least it would be a good starting point
20:48 karolherbst: even if we only use it for 80-90% of the instructions