00:01mhenning[d]: loryruta[d]: Maybe try denorm preserve?
00:01sonicadvance1[d]: If you want bit-accurate results, you're going to at some point need to reconcile spec language like `The maximum error is less than or equal to 1.5 * 2–12 times the true reciprocal.`
00:04dwfreed: "float is a hot mess" full stop :)
00:16loryruta[d]: sonicadvance1[d]: yes, but how come _everything_ else i tested behaves roughly the same (roughly, as there's still a bit of numerical instability=
00:17loryruta[d]: and nvidia on desktop behaves that different!?
00:17loryruta[d]: 🤷♂️
00:17loryruta[d]: float is a hot mess but they're making it messier
00:20sonicadvance1[d]: It's a reason why online games that need to sync state between PCs all used fixed-point math (Except when they don't and we get to laugh when it breaks)
00:45karolherbst[d]: sonicadvance1[d]: well CL defines accuracy in terms of ULPs for tze builtins at least
00:46sonicadvance1[d]: I definitely enjoy ULPs over..whatever the heck the x86 manuals do
00:47karolherbst[d]: anyway, I can only recommend verifying each calculation and check where the error is significantly different and check if it's within API limits or not
00:49karolherbst[d]: those kind of inaccuracies are generally a result of more aggressive optimizations, and nvidia is pretty aggressive in that area
09:32loryruta[d]: karolherbst[d]: i’m wondering if they can disabled though
09:32loryruta[d]: if not through the vulkan api, by compiling the spv to ptx and sending it to the driver
10:04karolherbst[d]: yeah, they have private options for that
10:05karolherbst[d]: loryruta[d]: https://registry.khronos.org/OpenCL/extensions/nv/cl_nv_compiler_options.txt
10:06karolherbst[d]: well on the cl side that is
11:26phomes_[d]: gfxstrand[d]: I am not sure how to answer that. The swapchain image format is VK_FORMAT_B8G8R8A8_UNORM. Is that the right thing to check?
11:46pac85[d]: loryruta[d]: Well it's up to you to write algos such that rounding errors don't accumulate. You can model the error and it's propagation and rearrange ops to avoid them. Unless you mean that optimizations make that impossible.
12:49karolherbst[d]: con 32 %137 = iadd %136 (0x7f), %135
12:49karolherbst[d]: con 32 %139 = ushr %137, %138 (0x7)
12:49karolherbst[d]: It's kinda impressive how many things I'm finding that could be optimized more with range analysis/uub
12:49karolherbst[d]: ehh wait.. that would be illegal to simplify..
14:05gfxstrand[d]: phomes_[d]: If you can break in `wsi_drm_configure_image()`, you can follow it and see what path it takes. (Or use printf): https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/vulkan/wsi/wsi_common_drm.c?ref_type=heads#L909
15:13phomes_[d]: it takes `chain->blit.type == WSI_SWAPCHAIN_BUFFER_BLIT`
15:40gfxstrand[d]: Okay. So yeah, if that doesn't actually change whether or not we get direct flips then we're taking an extra blit for no reason.
15:41gfxstrand[d]: But also, why are we setting ignore suboptimal?
15:47phomes_[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24818
17:00loryruta[d]: pac85[d]: Yes likely I would've to rewrite the algorithm to make it work with relaxed float computations that are apparently legal by spir-v spec and that nv driver is using ...
17:01loryruta[d]: nv driver results are off by a 0.0001
17:02loryruta[d]: which i believe to be a lot and this is what's causing the alg to diverge
17:11loryruta[d]: ok i think i found the culprit... i have a point where values get very small
17:11loryruta[d]: and denorm preserve isn't present 🙁
17:13karolherbst[d]: that would do it
17:14karolherbst[d]: nvidia has per instruction controls to preserve denorms and you can use float_controls2 to preserve them globally in spirv
17:14loryruta[d]: mhenning[d]: ^ that was likely it 🙂
17:15loryruta[d]: karolherbst[d]: ok, have to translate this in slang terms.. shaderDenormPreserveFloat32 is false... hence I can't set them at entrypoint level in spv
17:18karolherbst[d]: loryruta[d]: ahhh.. that's annoying
17:18loryruta[d]: yes...
17:18loryruta[d]: nvidia has been way too annoying this weekend
17:18loryruta[d]: ```glsl
17:18loryruta[d]: #ifdef NVIDIA
17:18loryruta[d]: typealias my_float = double;
17:18loryruta[d]: #endif
17:18loryruta[d]: 😄
17:19karolherbst[d]: oof
17:19karolherbst[d]: that's gonna kill perf
17:20loryruta[d]: is there any other path i didn't try? 😅 as far as i understand it's this or my calculations are wrong
17:20loryruta[d]: oh well, i could buy an amd card
17:20karolherbst[d]: yeah, but like if slang can't preserve denorms, you probably don't want to use it, if your stuff relies on the precision there
17:21loryruta[d]: it's a general purpose kernel, the value which breaks the computation is actually an input value that is supplied externally
17:21loryruta[d]: and is on the 1e-5 scale, and is multiplied with values computed within the kernel
17:22loryruta[d]: so if i want to keep the kernel general-purpose, i'd like to have the best precision
17:22loryruta[d]: for context: it's a computation node of a backward graph in a ML training network
17:22loryruta[d]: the input is a gradient (the gradient of a mean operation, which is 1/N)
17:23loryruta[d]: what I noted is: `(1/N) * dm/dalpha * dm/dbeta` leads to a final value of -0.6
17:23loryruta[d]: and on everything else I get -0.8
17:24loryruta[d]: while `dm/dalpha * dm/dbeta` (skipping 1/N) gives me a similar value across all implementations
17:24loryruta[d]: so i'm pretty sure it's the denorm thing
17:25loryruta[d]: (I was blaming driver optimizations...)
17:38karolherbst[d]: yeah well.. denorms are flushed on use
17:39karolherbst[d]: and as a result
17:39karolherbst[d]: it kinda depends
17:39karolherbst[d]: and your only option is to disable denorm flushing then
17:45loryruta[d]: karolherbst[d]: i can't 🥲
17:46loryruta[d]: shaderDenormPreserveFloat32 is false on my device
17:48karolherbst[d]: use NVK 🙃
17:50karolherbst[d]: like the coop matrix stuff isn't _as fast_ but it's fast enough with nvk
17:51karolherbst[d]: but yeah.. dunno what to do about denorms on nvidia's driver. Ask them to enable it? dunno
18:46saancreed[d]: Fwiw, <https://github.com/HansKristian-Work/vkd3d-proton/blob/master/libs/vkd3d/device.c#L9582-L9594>
19:09karolherbst[d]: that makes it even worse
22:49karolherbst[d]: perfection: `r12 = ld.global.a64.constant.b32 [r30+ur0..2+0x4], p0 // delay=1 wr:0`
22:55karolherbst[d]: okay.. seems like most of the regressions I'm seeing with the GPR + UGPR stuff is from the bug in the sched prepass