00:07 airlied[d]: marysaka[d]: did you test on turing? it fails a few 8/32 linear tests
00:07 airlied[d]: dEQP-VK.compute.pipeline.cooperative_matrix.khr_a.subgroupscope.matrixmuladd_cross.uint8_uint32.buffer.rowmajor.linear,Fail
00:07 airlied[d]: dEQP-VK.compute.pipeline.cooperative_matrix.khr_a.subgroupscope.matrixmuladd_cross.uint8_uint32.buffer.colmajor.linear,Fail
00:07 airlied[d]: dEQP-VK.compute.pipeline.cooperative_matrix.khr_a.subgroupscope.matrixmuladd_cross.sint8_sint32.buffer.rowmajor.linear,Fail
00:07 airlied[d]: dEQP-VK.compute.pipeline.cooperative_matrix.khr_a.subgroupscope.matrixmuladd_cross.sint8_sint32.buffer.colmajor.linear,Fail
00:08 airlied[d]: (a bunch more as well)
00:38 airlied[d]: though we probably going to want float16 support for this as well aren't we?
01:10 airlied[d]: gfxstrand[d]: any reason our maxBufferSize is lower than the prop driver?
01:10 airlied[d]: there's seems to be (1ULL << 40) - 0x10000
01:17 airlied[d]: marysaka[d]: oh I see those in the MR already, sorry for noise
01:19 airlied[d]: do we have a float16 support issue?
01:30 mhenning[d]: airlied[d]: I think there's a mesa-wide 32-bit ssbo limit
02:22 airlied[d]: ah I'm failing to read only turing doesn't have 16-bit floats? any idea whats needed?
03:23 mhenning[d]: airlied[d]: 16-bit floats in general got disabled on turing due to some cts failures that nobody's tracked down yet
03:28 HdkR: The spicy FTZ/DAZ behaviour strikes again
03:30 mhenning[d]: see https://gitlab.freedesktop.org/mesa/mesa/-/issues/11038
03:31 mhenning[d]: HdkR: Oh? I'm not aware of it being ftz/daz related
03:32 HdkR: Making a wild guess just because the behaviour of that changes constantly on NVIDIA stuff :D
06:17 airlied[d]: don't think Hdkr is entirely wrong
06:18 airlied[d]: <Text>ERROR: Sub-case #28 flushToZero:0 failed, inputs: 0xfc00;0x7c00 output: 0x0 expected output: 0x3c00</Text>
06:18 airlied[d]: <Text>ERROR: Sub-case #28 flushToZero:1 failed, inputs: 0xfc00;0x7c00 output: 0x0 expected output: 0x3c00</Text>
06:18 airlied[d]: but serial does fix it
06:25 airlied[d]: ah well might dig a bit more tomorrow, seems strange it passes for a lot of other values before it fails
06:34 HdkR: :D
07:26 gfxstrand[d]: airlied[d]: Not sure. There's a limit on SSBOs but Nvidia kinda has that anyway
07:27 marysaka[d]: airlied[d]: I did reenable fp16 to test it on coop and it was working with the CTS at least but we really need to do something to fix scheduling on Turing...
07:29 marysaka[d]: airlied[d]: no worries, if you have any ideas on why those tests behave like that I would love to hear 😄
07:29 marysaka[d]: It test mixed signed and unsigned while I never expose this currently (but could support it)
07:29 marysaka[d]: NVIDIA plainly always use unsigned in those case and pass those tests and I also pass those tests if I only respect the type of A
07:34 airlied[d]: marysaka[d]: also https://github.com/jeffbolznv/vk_cooperative_matrix_perf has a correctness mode might provide some more coverage beyond cts
07:35 marysaka[d]: oh I should give that a try then thanks!
09:36 demonkingofsalvation[d]: Might not be the best place to ask this but is it possible to undervolt a RTX 3050 Ti Mobile while using mesa? If not is there anything that I could do to help the possible development of this?
21:56 airlied[d]: gfxstrand[d]: karolherbst[d] mhenning[d] is there any summary of nvidia instruction scheduling anywhere, like the delay/rd/wr/wt stuff?
21:57 gfxstrand[d]: LMAO 🤣
21:58 airlied[d]: worth a try 😛
22:00 karolherbst[d]: well
22:00 airlied[d]: so the code in codegen/nak is the current best documented summary of how it works 🙂
22:00 karolherbst[d]: do you want one from nvidia or one from somebody else?
22:01 karolherbst[d]: because there is one, just it has uhm... flaws
22:01 karolherbst[d]: but codegen has the same one 🙃
22:01 karolherbst[d]: and probably nak as well
22:01 airlied[d]: I suppose if the NDA stuff covers the basics it might be worth reading, I just want to have some idea of wtf any of it means 🙂
22:02 karolherbst[d]: somebody tried to RE that stuff here: https://github.com/NervanaSystems/maxas
22:02 karolherbst[d]: airlied[d]: actually you are under the same NDA lol
22:02 airlied[d]: and this fp16 turing stuff seems like a good place to start
22:02 karolherbst[d]: do you have that nvidia partner stuff set up?
22:02 airlied[d]: yeah maybe just fwd me it internally 😛
22:02 karolherbst[d]: yeah.. let me drop it
22:06 karolherbst[d]: done
22:11 redsheep[d]: Also, this may or may not have what you are looking for but there's also this: https://github.com/kuterd/nv_isa_solver
22:11 redsheep[d]: And this info it generated, at least for ada: https://kuterdinel.com/nv_isa_sm89/
22:12 redsheep[d]: I don't think I see instruction scheduling specifically but it's probably still useful
22:12 redsheep[d]: You could probably run it on turing and learn something
22:12 mhenning[d]: airlied[d]: Section 2.1 of https://arxiv.org/pdf/1903.07486 has some reverse-engineered info
23:36 airlied[d]: okay now I think I have sensible questions, I get the calc instr deps iterates backwards but I'm not sure how that ensures the correct delays on subsequent instructions
23:37 airlied[d]: r0 = hadd2 -rZ |ur6| // delay=1 wt=000011 wr:0
23:37 airlied[d]: r1 = hadd2 -rZ.xx |ur8.xx| // delay=6 wt=010000 wr:1
23:37 airlied[d]: r1 = prmt r1 [0x5410] r0 // delay=1 wt=000011
23:37 airlied[d]: the prmt delay is wrong there
23:37 airlied[d]: but I'm not sure how the current algorithm can tell that
23:38 airlied[d]: we don't do anything for src reguse::reads
23:42 mhenning[d]: Why do you think the prmt delay is wrong? I believe the delay happens after the instruction, so it would depend on what happens after the fragment you pasted
23:43 airlied[d]: oh maybe that is what I'm missing then, I need the delays after the hadds to be longer, thanks
23:44 airlied[d]: indeed hacking those fixes it, okay now to work out how to do it nicer 🙂
23:45 airlied[d]: okay I have hacky dEQP-VK.spirv_assembly.instruction.compute.float16.* Passed: 426/426 (100.0%)
23:46 mhenning[d]: It might be as simple as needing to increase the value returned by raw_latency for those instructions
23:47 gfxstrand[d]: Yeah, delay is "wait N cycles before moving on". That took me a bit to grok too but it makes a lot of sense.
23:48 gfxstrand[d]: Essentially it lets the frontend know how much stalling needs to happen before it parses the next instruction.
23:49 airlied[d]: I've just hacked the instr_latency values higher, but I think we probably could do a bit better than that