00:47fdobridge: <airlied> finally confirming it, gsp on turing has less flakes than non-gsp in a cts run, now it hits other problems, but the flakes are definitely decreased
04:42fdobridge: <gfxstrand> Yeah, IDK if those numbers are reliable. That's the problem.
04:43fdobridge: <gfxstrand> I think what I want to do is improve my NAK unit test framework and add a bunch of very targeted latency tests.
04:43fdobridge: <gfxstrand> Basically, chain a bunch of the same instruction together in a way that we can detect if any one of them doesn't complete in time. Then we lower the latency until the test fails.
04:44fdobridge: <gfxstrand> By doing it in terms of NAK tests, we can guarantee complete control over the generated output.
04:51fdobridge: <gfxstrand> It would also make it really easy to run the tests on new hardware and verify that latencies are the same on chip X vs. chip Y.
04:51fdobridge: <gfxstrand> Like, I wouldn't at all be surprised if cheaper GPUs have higher latencies or something.
04:51fdobridge: <gfxstrand> Being able to easily verify that would be really nice.
08:38fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> https://gitlab.freedesktop.org/nouveau/wiki/-/commit/0a04fc0391e383485b441b4da64814d0b7700a49 🧓
13:40fdobridge: <gfxstrand> `Pass: 402169, Fail: 1553, Crash: 1679, Warn: 3, Skip: 3195085, Timeout: 2, Flake: 392, Duration: 2:05:3`
15:02fdobridge: <gfxstrand> Still need to figure out why MSAA stencil resolves don't work.
15:46fdobridge: <mhenning> CUDA's nvcc lets you generate device code for specific groups of gpus, which at least gives us an upper bound on how many different variants there are https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list
15:46fdobridge: <mhenning> but yeah, we probably do want our own instruction latency tests
16:18fdobridge: <karolherbst🐧🦀> mhhh, does Cudas tooling allow to optimize shaders?
16:18fdobridge: <karolherbst🐧🦀> wondering if we could let it "optimize" sched opcodes of shaders where it's defaulting to the default ones
16:21benjaminl: are the sched opcodes emitted by the cuda tools not optimal?
16:22benjaminl: my memory was that the people working on maxas figured out the latencies by looking at the cuda-generated stuff
16:54HdkR: If cuda didn't generate optimal code then I would assume there would be some very angry people :P
18:24karolherbst: it might be that maxas stuff was just "good enough" and cuda was more conservative with generating sched opcodes, but....
18:24karolherbst: I think the differences were small
18:25karolherbst: but my idea was mostly to just feed elf binaries with "disabled" sched opcodes to nvidia's tooling and verify we'd calculate the same sched opcodes
18:26karolherbst: gfxstrand: ^^ definetly something we might want to try as we could also CI that potentially without needing a GPU
18:27karolherbst: alternatively we could also compile ptx to SASS, but that could be... unreliable
18:27karolherbst: and won't cover everything
20:40fdobridge: <mohamexiety> What work would that involve? And how would it differ from e.g. a standard CUDA/PTX microbench?
20:48gfxstrand: karolherbst: That's an interesing idea....
20:48gfxstrand: karolherbst: IDK if we have a big enough shader corpus but it might work.
20:48karolherbst: at least it would be a good starting point
20:48karolherbst: even if we only use it for 80-90% of the instructions