00:37gfxstrand[d]: That's fine
00:56gfxstrand[d]: I think fp64 is slightly broken on Volta. I'm seeing some odd fails. Makes sense, I guess, since I think Volta might be one of the weird ones where it's kinda fixed-latency.
01:10mohamexiety[d]: yeah Volta is all GV100 so it had the full FP64
01:43gfxstrand[d]: Yeah, but dependency tokens should still work. 🤷🏻♀️
03:05mhenning[d]: gfxstrand[d]: I thought they were explicitly ignored for fp64 on fixed-latency chips?
03:11mhenning[d]: snowycoder[d]: I'm not really sure what you're asking, but the gpus require us to schedule instructions for them and track eg. which variable-latency instructions wait on others
11:42snowycoder[d]: mhenning[d]: Ah, so it's fermi that has a hardware scoreboard.
11:42snowycoder[d]: Scheduling data for each instruction (sched) has 0x1f bitmask for delay, but what are the higher 3 bits used for?
11:42snowycoder[d]: (They're set in `SchedDataCalculator::setDelay` but it's not really clear what's the encoding)
15:47mhenning[d]: snowycoder[d]: Some of that is explained here: https://envytools.readthedocs.io/en/latest/hw/graph/fermi/cuda/isa.html#notes-about-scheduling-data-and-dual-issue-on-gk104
15:52mhenning[d]: This paper also talks about some details https://www.researchgate.net/publication/313020335_Understanding_the_GPU_Microarchitecture_to_Achieve_Bare-Metal_Performance_Tuning