03:22gfxstrand[d]: Got 64-bit integer division. Turns out `isetp.x` on Maxwell takes the accumulator into account somehow (I haven't bothered to figure out how) and it was picking up stray accumulator values from 64-bit adds. I replaced the 3-instruction `iadd64` pattern with a different 3-instruction pattern which doesn't use `iadd3.x` and that fixed it. I've also added piles of asserts to make sure this doesn't
03:22gfxstrand[d]: happen again.
03:27HdkR: Doesn't the .x explicitly enable carry usage?
03:32HdkR: Weirdo arch that can mix and match CF and predicates :P
10:22karolherbst[d]: yeah.. previous gens have a carry flag which is all implicit
10:28HdkR: It was deleted and it won't be missed
16:17gfxstrand[d]: HdkR: Yes, but I didn't know that at the time. 😝
16:18gfxstrand[d]: I was just like "Ooh! Here's this flag that we have on Turing but it seems to be a tiny bit different" and ran with it.
16:39gfxstrand[d]: Ugh... `f2f.ftz.f16.f32` doesn't actually flush denorms pre-Volta. 😭
17:02gfxstrand[d]: Okay, that
17:03gfxstrand[d]: Bigger issue: `f2f16(f2f32(f64))` is not the same as `f2f16(f64)` when you care about round-even details. 😭
17:09gfxstrand[d]: Maybe `f2f16.re(f2f32.rz(f64))` will work?
17:37triang3l[d]: gfxstrand[d]: 0.5 f16ulp + 1 f64ulp would end up 0 rather than 1 in this case, I think
17:42triang3l[d]: Manual comparison of `mantissa[51:10] + mantissa[9]` maybe should work
17:52triang3l[d]: not comparison to be more precise, but `+ BITFIELD64_MASK(52 - 10) + ((number >> (52 - 10)) & 1)` before truncating
21:48gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30402
21:48gfxstrand[d]: Marge has been assigned
21:48gfxstrand[d]: TBD how quick I'll delete codegen support
21:56gfxstrand[d]: And... https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30403
21:56gfxstrand[d]: Not sure how quick I'll merge that one but it
22:54mohamexiety[d]: wow, that is impressive... awesome work! ❤️
22:56karolherbst[d]: the next step would be to nuke codegen from mesa altogether, but I don't think anybody want to be that insane to add NAK support for all the older gens
22:57karolherbst[d]: especially pre kepler ISAs are pretty cursed
22:59gfxstrand[d]: Meh. I'm happy to let the old GL driver be. If people want NAK for GL, we have Zink.
22:59karolherbst[d]: yeah, fair
23:00gfxstrand[d]: I might wire up a CL driver for it just for grins at some point.
23:00karolherbst[d]: ~~and if I want to nuke code from mesa, there is always clover~~
23:00karolherbst[d]: yeah...
23:00karolherbst[d]: CL support would be my main motivation for it, because codegen is really not up to the task
23:04gfxstrand[d]: I'm debating whether or not I care about NVK+Kepler sans NAK. Most of Kepler NAK should be pretty easy, especially now that we can unit test things. The one hard part is images but that doesn't work with codegen anyway and I pity the person who tries to make it work.
23:05karolherbst[d]: the annoying part is that kepler has two different ISAs :blobcatnotlikethis:
23:06gfxstrand[d]: Ugh
23:06karolherbst[d]: and they are like... super different
23:06karolherbst[d]: then older one (fermi + 1st gen kepler) has like 63 registers max, where 2nd gen kepler has 255 :ferrisUpsideDown:
23:06karolherbst[d]: and quite some details are also different
23:06karolherbst[d]: sync is also different compared to maxwell afaik
23:07karolherbst[d]: it's a flag on the instruction
23:07karolherbst[d]: and then it syncs post execution or so
23:07karolherbst[d]: it's kinda weird
23:07mohamexiety[d]: wasn't second gen kepler exclusive to datacenter only?
23:08karolherbst[d]: at least they got rid of 32 bit instructions, which are still supported on fermi, but were removed with kepler
23:08karolherbst[d]: even though it's the same isa
23:08karolherbst[d]: mohamexiety[d]: nah
23:08karolherbst[d]: 780 is 2nd gen
23:08mohamexiety[d]: ....oh
23:08karolherbst[d]: and the gk20x
23:08karolherbst[d]: which is like 910?
23:08karolherbst[d]: ehh 710
23:09karolherbst[d]: `GK208` is quite common actually
23:09karolherbst[d]: low power GPU
23:09karolherbst[d]: GT 730 as well
23:22gfxstrand[d]: As long as the overall theory of operation isn't massively different from other gens, both should be implementable. It's just a lot of work for questionable benefit.
23:23karolherbst[d]: yeah...
23:23gfxstrand[d]: Not that there's much benefit to Maxwell given the reclocking issues
23:23karolherbst[d]: well.. 1st gen maxwell can be reclocked, but those are just mid-end (well low-end by today's standards) GPUs sadly
23:23airlied[d]: do we see zink/cl being really bad here?
23:24airlied[d]: like if rusticl/zink/nak can work, seems less effort than a gallium/nak driver
23:26gfxstrand[d]: 🤷🏻♀️
23:27gfxstrand[d]: The point of a direct rusticl driver would be to make it easier to implement features like SVM.
23:27airlied[d]: I think making mesa specifc vulkan extensions would also be a good path there
23:27gfxstrand[d]: Yeah
23:27gfxstrand[d]: No argument there
23:28airlied[d]: like it's nice to have khronos consensus, but it's also nice to get shit done occasionally 🙂
23:28gfxstrand[d]: Yeah
23:31karolherbst[d]: I've already have a passing run of the CTS on top of radv
23:32karolherbst[d]: probably should test nvk as well
23:32gfxstrand[d]: Have you ever run on NVK?
23:32karolherbst[d]: not yet
23:32karolherbst[d]: I should plug in an nvidia gpu and do that probably
23:32gfxstrand[d]: But I want to get all the CL stuff working on NAK directly
23:33karolherbst[d]: mhhh, yeah maybe
23:33karolherbst[d]: the conversion stuff might be interesting to wire up for real for once
23:34gfxstrand[d]: Yeah, Nvidia can do most of it pretty nicely
23:34airlied[d]: just expose CL spir-v over vulkan fastpath 🙂
23:35gfxstrand[d]: Yeah, that may be what I do
23:35gfxstrand[d]: vkDispatchKernel
23:36gfxstrand[d]: VkKernel
23:36airlied[d]: yeah something like the cuda launcher
23:36karolherbst[d]: the details will be bit annoying tho
23:36karolherbst[d]: like.. the inputs need to be bound somehow
23:37karolherbst[d]: but maybe the ext would just explain how the inputs are put into push constants or so
23:37airlied[d]: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkCudaLaunchInfoNV.html
23:37karolherbst[d]: though vulkan push constants are maybe tooooo small?
23:38karolherbst[d]: mhhh
23:38karolherbst[d]: I mean...
23:38karolherbst[d]: sure, you can just push CL objects into the params...
23:38karolherbst[d]: anyway, the interesting part, the parameters, is just "check out cuda and do that"
23:48gfxstrand[d]: Yeah, I don't mind just taking the args as an array parameter to vkDispatchKernel
23:48karolherbst[d]: the problem is, that CL args are kinda terrible
23:48gfxstrand[d]: Yes
23:49karolherbst[d]: there is a `cl_ext_bda` ext now where you just pass in pointers, but uhhh...
23:49karolherbst[d]: though maybe the Vk ext could just say "for memory objects use bda and pass in the pointer"
23:50karolherbst[d]: and then things would kinda work out maybe
23:50karolherbst[d]: maybe not, dunno
23:52gfxstrand[d]: Yup
23:52gfxstrand[d]: I'm happy to go all in on BDA
23:52karolherbst[d]: yeah, zink requires bda anyway, so....
23:52gfxstrand[d]: Images are kinda awkward, though.
23:52karolherbst[d]: not using bda is just a massive pain
23:52karolherbst[d]: how so?
23:53gfxstrand[d]: Maybe pass VkImageView?
23:53karolherbst[d]: mhhh
23:53karolherbst[d]: probably?
23:53karolherbst[d]: read_only images in CL are really sampled images tho I think?
23:53karolherbst[d]: with an explicit sampler
23:53gfxstrand[d]: Yeah
23:53karolherbst[d]: not sure how that's called in vk
23:53karolherbst[d]: and there are also samplerless images :ferrisUpsideDown:
23:54gfxstrand[d]: That should be fine
23:54karolherbst[d]: mhh... how are we handling those anyway... txl vs txf, right?
23:54gfxstrand[d]: Yup
23:54karolherbst[d]: right...
23:55karolherbst[d]: I've also finally enabled `read_write` images
23:55karolherbst[d]: ohh one thing... CL also kinda requires formatted load/stores, but I'm sure the ext could express that as well
23:56karolherbst[d]: `shaderStorageImageReadWithoutFormat` for read_write images and `shaderStorageImageWriteWithoutFormat` for write images
23:56karolherbst[d]:though
23:57karolherbst[d]: I think for nvidia it would be fine without `shaderStorageImageReadWithoutFormat` I think...
23:57karolherbst[d]: the lowering at least wouldn't be terrible
23:58karolherbst[d]: but yeah...
23:58karolherbst[d]: maybe passing in the objects/address directly like in CL is actually the better approach here, because then you can still dedup/nuke dead parameters under the hood without promising a specific layout for the input buffer (which most of the time will be either pus constants or an ubo)