00:05 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> @karolherbst Here's the log of the failed submission: https://pastebin.com/KLPG2SK5
00:05 fdobridge: <k​arolherbst🐧🦀> @asdqueerfromeu mind also sharing the latest stuff in dmesg?
00:09 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> https://pastebin.com/fTBembfG
00:10 fdobridge: <g​fxstrand> That may cause you real problems, actually.
00:10 fdobridge: <g​fxstrand> If someone ever writes to that image, it could lead to any number of memory corruptions
00:11 fdobridge: <g​fxstrand> I wish that assert were a joke we could just remove. 😭
00:11 fdobridge: <k​arolherbst🐧🦀> ohh, so you think the GPU trashes the command buffers?
00:12 fdobridge: <k​arolherbst🐧🦀> that would be.... bad, yes
00:12 fdobridge: <g​fxstrand> It's certainly possible.
00:12 fdobridge: <k​arolherbst🐧🦀> I have a funky debug idea
00:12 fdobridge: <g​fxstrand> NAK gets rid of it but GL depends on it.
00:12 fdobridge: <k​arolherbst🐧🦀> snapshot the command buffer and validate its content is the same 🙃
00:12 fdobridge: <g​fxstrand> Might be another good use for an API flag.
00:13 fdobridge: <k​arolherbst🐧🦀> and whenever it changes, report it
00:13 fdobridge: <k​arolherbst🐧🦀> at least if the command buffer would be corrupted on the CPU side, printing it should produce random garbabe
00:13 fdobridge: <k​arolherbst🐧🦀> *garbage
00:13 fdobridge: <k​arolherbst🐧🦀> but it doesn't
00:13 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> I don't think NAK has proper support for non-compute shaders though yet
00:14 fdobridge: <k​arolherbst🐧🦀> `subc 7 mthd 2664 data 0000007f`....
00:15 fdobridge: <k​arolherbst🐧🦀> we don't use subc 7 at all 🙂
00:15 fdobridge: <k​arolherbst🐧🦀> anyway.. whatever happens, the CPU and GPU disagree on what got submitted
00:17 fdobridge: <g​fxstrand> @asdqueerfromeu nvk/codegen-11bit-image-hack
00:17 fdobridge: <k​arolherbst🐧🦀> why didn't it ever occur to me to ever verify that the GPU didn't trash the push buffers 🙃
00:18 fdobridge: <k​arolherbst🐧🦀> I should add a runtime flag for gl as well to detect those corruptions
00:19 fdobridge: <k​arolherbst🐧🦀> and track down all the tests hitting that
00:20 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> What issues did you encounter without the codegen part?
00:22 fdobridge: <k​arolherbst🐧🦀> 2d/3d handling in codegen is terrible
00:24 fdobridge: <k​arolherbst🐧🦀> but 2d slices of 3d images are also just pure evil
00:26 fdobridge: <g​fxstrand> Oh, just that we can't upstream it without totally breaking GL.
00:27 fdobridge: <g​fxstrand> Look at how NVK does it. 😝
00:27 fdobridge: <k​arolherbst🐧🦀> ~~we could revert that stuff~~
00:27 fdobridge: <k​arolherbst🐧🦀> vulkan has an explicit flag for it
00:28 fdobridge: <k​arolherbst🐧🦀> so yeah, disabling tiling on the z axis and then just binding the slice is a proper way of dealing wiht that in vulkan (I didn't check, but I assume that's what nvk does 😛 )
00:28 fdobridge: <k​arolherbst🐧🦀> and I think I even wrote some of the code? dunno 😄
00:28 fdobridge: <g​fxstrand> Yup
00:28 fdobridge: <k​arolherbst🐧🦀> in gl it's all madness
00:29 fdobridge: <g​fxstrand> But also, 3D only have 11 bits of Z and you have an extra 12 bits at the top of the 32 bit handle...
00:29 fdobridge: <g​fxstrand> You can pull the same stunt without restricting to 11 bits
00:29 fdobridge: <k​arolherbst🐧🦀> fair
00:32 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> I'll have to try DXVK v1.10.3 to see if synchronization2/vulkanMemoryModel is the problem
00:34 fdobridge: <g​fxstrand> We can about turn on sync2. I just haven't because it increases the test count so much for not really any extra coverage for us.
00:35 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> I remember the CTS tests failing like crazy (not sure for which extension though)
00:37 fdobridge: <g​fxstrand> That's quite possible. I did say "almost". 😝
00:49 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> But anyway besides the synchr2/memmodel faux enablement and EXT_memory_budget implementation patches I have no more feature patches in my local PKGBUILD 🐸
01:42 fdobridge: <a​irlied> we do submits with 0 no pushes, probably shouldn't have submits with 0 length pushes
01:45 fdobridge: <a​irlied> guess next week will be retarget the MRs week 😛
01:48 fdobridge: <g​fxstrand> @karolherbst I think your codegen patch is okay
01:48 fdobridge: <g​fxstrand> We don't. We already filter those out
01:52 fdobridge: <g​fxstrand> Why am I faulting inside the kernel's memory range?
01:53 fdobridge: <g​fxstrand> This sounds like not my bug. 😛
01:56 fdobridge: <g​fxstrand> @airlied Where should I file suspected kernel bugs?
01:58 fdobridge: <g​fxstrand> Hrm... It fails on the old API
02:00 fdobridge: <a​irlied> the mailing list or maybe the nouvelles issue tracker if you just want a place holder to track them
02:00 fdobridge: <a​irlied> if you suspect they are with the new uapi make sure to point dakr at them
02:18 fdobridge: <g​fxstrand> I'm going to have to hack on the CTS for this one. 😕
05:29 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> I think you forgot to mark certain features that are restricted to newer NVIDIA GPU architectures as such in the features.txt file (both anv and turnip already do that)
08:07 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> Also should I remove the synchr2/memmodel hacks in the AUR package? 🐸
09:31 fdobridge: <k​arolherbst🐧🦀> https://gitlab.freedesktop.org/drm/nouveau/-/issues
09:40 fdobridge: <k​arolherbst🐧🦀> though if anybody wants to get into nouveau kernel development, there is a good list of things to start with 🙃
09:41 fdobridge: <d​adschoorse> https://github.com/doitsujin/dxvk/pull/3603 I hope nak will also support ffmaz/fmulz when it gets merged
09:42 fdobridge: <k​arolherbst🐧🦀> isn't there a better way to check?
09:43 fdobridge: <d​adschoorse> no
09:43 fdobridge: <k​arolherbst🐧🦀> or is it about knowing if it's lowered or not?
09:43 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> This is probably one of the first NVK-specific changes outside of Mesa/kernel
09:43 fdobridge: <d​adschoorse> it's about knowing if the terrible emulation pattern that dxvk emits will be optimized to something that's fast
09:44 fdobridge: <k​arolherbst🐧🦀> ahh
09:44 fdobridge: <k​arolherbst🐧🦀> so the proper answer here would be a spir-v extension or something?
09:45 fdobridge: <d​adschoorse> yeah but do you want to put d3d9 garbage into spirv?
09:45 fdobridge: <k​arolherbst🐧🦀> it's one opcode
09:46 fdobridge: <k​arolherbst🐧🦀> there are already other weirdo things in spirv, I think it would be fine 😄
09:47 fdobridge: <d​adschoorse> it's two opcodes with fma
09:48 fdobridge: <k​arolherbst🐧🦀> could do it an evil way and make it a decoration instead
09:48 fdobridge: <d​adschoorse> but from past discussions there also was the question if it shouldn't be a per shader mode that allows intel to use their d3d9 hw
09:48 fdobridge: <k​arolherbst🐧🦀> ohh.. right
09:49 fdobridge: <k​arolherbst🐧🦀> so rather an execution mode
09:49 fdobridge: <k​arolherbst🐧🦀> ehh wait
09:49 fdobridge: <k​arolherbst🐧🦀> this other thing
09:50 fdobridge: <k​arolherbst🐧🦀> ahh no, execution mode might be fine
09:50 fdobridge: <k​arolherbst🐧🦀> why did they call it execution model and mode
09:50 fdobridge: <k​arolherbst🐧🦀> it's confusing
09:51 fdobridge: <d​adschoorse> anyway, just hard coding it in dxvk isn't too terrible imo, it works well for radv
09:51 fdobridge: <d​adschoorse> the only thing we would maybe gain with a two opcode extension is nvidia prop support
09:51 fdobridge: <k​arolherbst🐧🦀> yeah.. and a more reliable solution
09:51 fdobridge: <k​arolherbst🐧🦀> and intel
09:52 fdobridge: <k​arolherbst🐧🦀> if it's an execution mode isntead
09:52 fdobridge: <d​adschoorse> the problem with the intel mode is that it affects a lot more opcodes than just mul/fma afaiu
09:52 fdobridge: <k​arolherbst🐧🦀> mhhhh
09:52 fdobridge: <k​arolherbst🐧🦀> would be good to know what exactly
09:53 fdobridge: <d​adschoorse> I think it also changes things like rcp(0)
09:53 fdobridge: <k​arolherbst🐧🦀> well.. that might be fine even, because rcp is already defined in a wonky way
09:54 fdobridge: <d​adschoorse> or pow(0,0)
09:54 fdobridge: <k​arolherbst🐧🦀> same for pow 😄
09:54 fdobridge: <k​arolherbst🐧🦀> I could imagine that those return 0 a bit more often than they would normally do, but that's still all within spec limits, no?
09:54 fdobridge: <d​adschoorse> yeah maybe it's fine but I don't want to be the guy advocating for it in khronos 🐸
09:54 fdobridge: <k​arolherbst🐧🦀> but isn't pow also implemented with rcp and mul?
09:55 fdobridge: <d​adschoorse> exp and mul
09:55 fdobridge: <k​arolherbst🐧🦀> right..
09:55 fdobridge: <k​arolherbst🐧🦀> well.. fmulz technically
09:55 fdobridge: <d​adschoorse> but not on all intel hw
09:55 fdobridge: <r​hed0x> We hit that with Alan wake, didn't we?
09:55 fdobridge: <k​arolherbst🐧🦀> nouveau lowers with fmulz because it violates the spec otherwise
09:55 fdobridge: <d​adschoorse> I think only on gen12+, they actually had hw pow before
09:58 fdobridge: <d​adschoorse> I think that was nrm
09:58 fdobridge: <r​hed0x> Oh, yeah right
09:58 fdobridge: <d​adschoorse> which uses rsq internally
10:00 fdobridge: <d​adschoorse> the other question is if accommodating intel is worth it, it sounds like they might drop the d3d9 mode soon
10:00 fdobridge: <d​adschoorse> maybe they add fmulz instead (I doubt it but it would be cool 🐸)
10:05 fdobridge: <k​arolherbst🐧🦀> I'm annoyed that some parts of spir-v are super poorly speced 😢
10:08 fdobridge: <d​adschoorse> I'm annoyed that vulkan still has no real fma
10:19 fdobridge: <k​arolherbst🐧🦀> right... that's also terrible 😄
10:20 fdobridge: <k​arolherbst🐧🦀> it will be needed for CL anyway
10:20 fdobridge: <k​arolherbst🐧🦀> that reminds me.. I should work on getting proper ffma support in.. uhhh
10:20 fdobridge: <k​arolherbst🐧🦀> that can of worms
10:20 fdobridge: <k​arolherbst🐧🦀> we'd have to fix nir first
10:21 fdobridge: <k​arolherbst🐧🦀> and add a `fmad` and make `ffma` a real `ffma`
10:21 fdobridge: <k​arolherbst🐧🦀> otherwise it's all pointless
10:21 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> What does `mad` mean in this case?
10:21 fdobridge: <k​arolherbst🐧🦀> unfused
10:21 fdobridge: <d​adschoorse> muladd
10:22 fdobridge: <d​adschoorse> umad?
10:22 fdobridge: <k​arolherbst🐧🦀> it's a bit of a pain for opt_algebraic
10:22 fdobridge: <k​arolherbst🐧🦀> and it's a pain on so many levels
10:23 fdobridge: <d​adschoorse> are there backends that uses ffma for unfused mad atm?
10:23 fdobridge: <k​arolherbst🐧🦀> yes
10:23 fdobridge: <k​arolherbst🐧🦀> nouveau does
10:23 fdobridge: <k​arolherbst🐧🦀> nvidia hardware only has one of them
10:23 fdobridge: <k​arolherbst🐧🦀> never both
10:23 fdobridge: <k​arolherbst🐧🦀> they flipped from fmad to ffma at one point
10:23 fdobridge: <d​adschoorse> does your compiler not fuse in the backend?
10:24 fdobridge: <k​arolherbst🐧🦀> it does
10:24 fdobridge: <k​arolherbst🐧🦀> well.. unless it's precise
10:24 fdobridge: <d​adschoorse> then you can just lower ffma on hardware that doesn't have real fma
10:24 fdobridge: <d​adschoorse> even if the hw has unfused mad?
10:25 fdobridge: <k​arolherbst🐧🦀> good question...
10:25 fdobridge: <k​arolherbst🐧🦀> I think we never bothered?
10:25 fdobridge: <k​arolherbst🐧🦀> or maybe I did?
10:25 fdobridge: <k​arolherbst🐧🦀> it's all a bit funky
10:26 fdobridge: <d​adschoorse> I think in the end nir to tgsi might be the most problematic backend because tgsi also has the stupid definition of fma as "might be fused"
10:26 fdobridge: <k​arolherbst🐧🦀> yeah..
10:26 fdobridge: <k​arolherbst🐧🦀> the real battle is getting glsl vs cl semantics in place
10:26 fdobridge: <k​arolherbst🐧🦀> in CL fma is fma
10:27 fdobridge: <k​arolherbst🐧🦀> and fmad is whatever
10:28 fdobridge: <k​arolherbst🐧🦀> and then there is `-cl-mad-enable`
10:28 fdobridge: <k​arolherbst🐧🦀> `Allow a * b + c to be replaced by a mad instruction. The mad instruction may compute a * b + c with reduced accuracy in the embedded profile. See the OpenCL C or OpenCL SPIR-V Environment specification for accuracy details. On some hardware the mad instruction may provide better performance than the expanded computation.`
10:29 fdobridge: <d​adschoorse> that would just mean the mul could be imprecise in NIR I guess
10:29 fdobridge: <k​arolherbst🐧🦀> yeah
10:29 fdobridge: <k​arolherbst🐧🦀> I think we already handle that part
10:29 fdobridge: <k​arolherbst🐧🦀> but ffma not so much
10:29 fdobridge: <k​arolherbst🐧🦀> libclc has actual ffma lowering and we use that so far I think
10:29 fdobridge: <k​arolherbst🐧🦀> I really have to fix that nonsense
10:31 fdobridge: <k​arolherbst🐧🦀> anyway.. supporting it on hardware with just one is simple
10:31 fdobridge: <k​arolherbst🐧🦀> AMD is the weird oddball here
10:31 fdobridge: <k​arolherbst🐧🦀> having hardware with both, where fma is sometimes slower
10:32 fdobridge: <d​adschoorse> amd has hw with both at the same speed, both but fma is slow and hw with only fma
10:32 fdobridge: <k​arolherbst🐧🦀> yeah
10:32 fdobridge: <k​arolherbst🐧🦀> kinda don't get why there is hardware with both at the same speed 😄
10:32 fdobridge: <k​arolherbst🐧🦀> but whatever
10:32 fdobridge: <k​arolherbst🐧🦀> is there actually a good reason for that?
10:33 fdobridge: <d​adschoorse> because you can use mad if the mul is precise
10:33 fdobridge: <k​arolherbst🐧🦀> I question the benefits of making the architecture more complex, but maybe it pays off ...
10:33 fdobridge: <k​arolherbst🐧🦀> I just doubt it
10:34 fdobridge: <d​adschoorse> amd's unfused mad also always flushes denorms to zero, which is fun for fp16
10:34 fdobridge: <k​arolherbst🐧🦀> uhhhh
10:35 fdobridge: <k​arolherbst🐧🦀> yeah, that sounds like fun
10:35 fdobridge: <k​arolherbst🐧🦀> it's already such a corner case, but I also added that precise handling to TGSI because some games hit it 🙃
10:36 fdobridge: <d​adschoorse> radv had a ton of issues in games when rdna2 was released as the first hw without mad
10:36 fdobridge: <d​adschoorse> now we treat everything that is used to calculate position in vertex shaders as precise 🐸
10:37 fdobridge: <k​arolherbst🐧🦀> funky
11:19 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> Will nouveau/mesa repository still receive updates? 🐸
11:41 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> But anyway the new uAPI patches managed to apply without any conflicts on 6.4 (I had to fix one mistake though) and they work fine in some games that I've tried (the performance is too low to play them properly so I can't give a proper stability rating)
12:26 fdobridge: <c​onan_kudo> Congratulations on landing NVK!
12:51 Ermine: Congrats on merging NVK!
13:25 fdobridge: <g​fxstrand> Thanks! 🥳
13:56 fdobridge: <g​fxstrand> Should be able to. As long as it's correct and just a matter of perf if we can reduce the pattern.
14:09 fdobridge: <g​fxstrand> No. The branch there has been removed. Any still useful MRs should be re-targeted.
14:11 fdobridge: <g​fxstrand> There's a part of me that's inclined to continue using the issue tracker for stuff like tracking all the misc features just to keep them out of the public eye.
14:16 fdobridge: <g​fxstrand> Cool. I did successfully get through a run last night after I disabled iwlwifi. 🙃
14:30 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> I can still see nvk/main
14:32 fdobridge: <g​fxstrand> Probably need to push main and make it the primary branch before we can delete NVK/main
14:36 fdobridge: <g​fxstrand> There were go
14:38 fdobridge: <g​fxstrand> Really, I just renamed nvk/main to main.
14:47 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> I'm kind of considering making an Arch kernel package with new uAPI support or an option in linux-tkg to use the new uAPI patches :nouveau:
14:56 fdobridge: <g​fxstrand> Sure. Hopefully we won't need that for long. Getting something for GSP might be useful, too.
15:07 fdobridge: <k​arolherbst🐧🦀> ohh yeah.. I think we want a kernel with GSP and the new uapi so people can already try things out, but they'll be disappointed with the perf 😄
15:07 fdobridge: <k​arolherbst🐧🦀> @gfxstrand what's the pass called in anv to promote ubos to bound ones? I might look into it next week then.
15:07 fdobridge: <g​fxstrand> There isn't one
15:08 fdobridge: <g​fxstrand> Oh, in ANV? It's just part of apply_pipeline_layout
15:08 fdobridge: <k​arolherbst🐧🦀> yeah
15:09 fdobridge: <k​arolherbst🐧🦀> that `try_lower_direct_buffer_intrinsic` stuff?
15:10 fdobridge: <g​fxstrand> That's part of it but no
15:10 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> Hopefully hwmon and display parts can be eventually hooked up
15:11 fdobridge: <g​fxstrand> It's a more general part of the pass. It has some heuristics about what gets placed where. It counts uses of each resource and promotes the most commonly used ones to bound.
15:11 fdobridge: <k​arolherbst🐧🦀> right...
15:12 fdobridge: <k​arolherbst🐧🦀> we do however have 18 slots for graphics, so we might as well just use the first n and add some smarter heurestics later if we really care enough
15:12 fdobridge: <g​fxstrand> But before we can do that we need to fix codegen so it doesn't tweak with cb indices. Right now we're binding the roof descriptors to all the bind points just in case. 🙄
15:12 fdobridge: <k​arolherbst🐧🦀> not sure if it's common enough in vulkan to have more
15:12 fdobridge: <k​arolherbst🐧🦀> uhhh
15:12 fdobridge: <k​arolherbst🐧🦀> I see
15:13 fdobridge: <k​arolherbst🐧🦀> yeah, I can look into that as well while at it
15:13 fdobridge: <k​arolherbst🐧🦀> however
15:13 fdobridge: <k​arolherbst🐧🦀> we could bind the root descriptor at info.io.auxCBSlot and then use that
15:13 fdobridge: <k​arolherbst🐧🦀> and it would be the last and it _should_ mostly work
15:13 fdobridge: <k​arolherbst🐧🦀> but anyway
15:13 fdobridge: <k​arolherbst🐧🦀> I can look into that
15:14 fdobridge: <k​arolherbst🐧🦀> I guess a vulkan driver also won't need `lower_uniforms_to_ubo` actually?
15:14 fdobridge: <k​arolherbst🐧🦀> where do push constants land atm?
15:15 fdobridge: <g​fxstrand> I think we are? But at least for a time it was tweaking UBOn indices so load_ubo(n, offset) might load from n+1
15:15 fdobridge: <g​fxstrand> Maybe it isn't anymore?
15:15 fdobridge: <g​fxstrand> 🤷🏻‍♀️
15:15 fdobridge: <k​arolherbst🐧🦀> I've ported to `lower_uniforms_to_ubo` at some point
15:15 fdobridge: <k​arolherbst🐧🦀> so codegen itself shouldn't do any n+1 anymore
15:15 fdobridge: <g​fxstrand> Okay then maybe we're okay
15:15 fdobridge: <k​arolherbst🐧🦀> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22815
15:16 fdobridge: <k​arolherbst🐧🦀> maybe it was @mhenning 🙃
15:17 fdobridge: <k​arolherbst🐧🦀> anyway.. we have to decide where to put push constants, though I guess the best thing would be to reserve space in the root descriptor and that's probably already done
15:17 fdobridge: <k​arolherbst🐧🦀> so we need 1 ubo (the last?) for driver internal + root descriptor fun, and the other ones are free for ubos
15:17 fdobridge: <k​arolherbst🐧🦀> should make it easier to have it the last one, as then ubos are 0 to n - 1, where n is 7/16/18 depending on the stage/gpu
15:18 fdobridge: <k​arolherbst🐧🦀> *7/15/17
15:18 fdobridge: <k​arolherbst🐧🦀> mhh
15:18 fdobridge: <k​arolherbst🐧🦀> what about shader constants
15:19 fdobridge: <k​arolherbst🐧🦀> maybe 32k from the root descriptor as well? Not sure how big that one can become
15:20 fdobridge: <k​arolherbst🐧🦀> we kinda need to extract indirectly accessed constants from shaders, because perf
15:20 fdobridge: <g​fxstrand> Uh... We've had push constants since forever.
15:20 fdobridge: <k​arolherbst🐧🦀> like .. all of them
15:20 fdobridge: <k​arolherbst🐧🦀> sure, but where do they go atm? 😄
15:21 fdobridge: <g​fxstrand> In the root descriptor table.
15:21 fdobridge: <k​arolherbst🐧🦀> okay
15:21 fdobridge: <k​arolherbst🐧🦀> how big is the root descriptor in the worst case?
15:21 fdobridge: <k​arolherbst🐧🦀> or do we have some kind of limit?
15:21 fdobridge: <g​fxstrand> Like 1K maybe
15:22 fdobridge: <k​arolherbst🐧🦀> ahh, that's not much
15:22 fdobridge: <g​fxstrand> Maybe 512
15:22 fdobridge: <k​arolherbst🐧🦀> okay, so we can put all that driver stuff into one const buffer
15:22 fdobridge: <g​fxstrand> Yup
15:22 fdobridge: <k​arolherbst🐧🦀> root descriptor + push constants + nir_shader.constant_data
15:22 fdobridge: <g​fxstrand> And it already has a bunch of stuff in it.
15:22 fdobridge: <k​arolherbst🐧🦀> okay
15:23 fdobridge: <k​arolherbst🐧🦀> do we already have a pass to lower _all_ indirectly accessed constants to `constant_data`?
15:23 fdobridge: <g​fxstrand> nvk_cmd_buffer.h
15:23 fdobridge: <g​fxstrand> IDK what you mean by that.
15:23 fdobridge: <k​arolherbst🐧🦀> like if you have an array of constants in the sahder, and the shader accesses them indirectly
15:23 fdobridge: <k​arolherbst🐧🦀> we need them lowered
15:24 fdobridge: <k​arolherbst🐧🦀> all of them 😄
15:24 fdobridge: <g​fxstrand> Right. nir_lower_large_constants(), I think
15:24 fdobridge: <k​arolherbst🐧🦀> except we need it for all constants
15:24 fdobridge: <k​arolherbst🐧🦀> of any size
15:24 fdobridge: <g​fxstrand> It has a threshold
15:24 fdobridge: <k​arolherbst🐧🦀> yeah.. and we need it to be 1
15:25 fdobridge: <k​arolherbst🐧🦀> probably
15:25 fdobridge: <k​arolherbst🐧🦀> codegen lowers it to local mem in all cases
15:25 fdobridge: <k​arolherbst🐧🦀> and that hurts a lot
15:25 fdobridge: <g​fxstrand> Well, you can bcsel a couple if needed but yeah, pulling them into a UBO.
15:25 fdobridge: <k​arolherbst🐧🦀> yeah..
15:26 fdobridge: <k​arolherbst🐧🦀> I'd rather put them all into an ubo 😄
15:26 fdobridge: <k​arolherbst🐧🦀> it's easier
15:26 fdobridge: <k​arolherbst🐧🦀> so we have about 62k of space in the "driver" ubo, should be plenty
15:30 fdobridge: <k​arolherbst🐧🦀> anyway.. I had a funky idea once to have a pre baked constant table shared across all shaders, so we don't have to reupload buffers all the time, but we can also just pull out all constants as it does improve codegen especially for older gpus.. but that needs to be done in the backend compiler, because not all instructions can load from ubos directly.... anyway, later we should lower all indirectly accessed constants in nir, and d
15:30 fdobridge: <k​arolherbst🐧🦀> but first that ubo stuff
15:31 fdobridge: <k​arolherbst🐧🦀> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4945
15:31 fdobridge: <k​arolherbst🐧🦀> that's what I've played around in the past
15:37 fdobridge: <k​arolherbst🐧🦀> I also had a pass to pull in in shader constants into the buffer, but some people didn't like the idea of reuploading per shader constants on each bind
16:03 dakr: airlied, gfxstrand: sent out the nouveau svm fix (https://lore.kernel.org/dri-devel/20230805160027.88116-1-dakr@redhat.com/T/#u)
16:04 DodoGTA: dakr: I'm kind of surprised that mistake wasn't noticed
16:15 fdobridge: <g​fxstrand> Yeah, and because of how constants work, we can put multiple things in the same cb. IDK how that relates to recycling of the backing storage, though. It can't be infinite so there have to be stalls sometime. I'm still getting used to how Nvidia does things...
16:18 karolherbst: nobody builds with SVM enabled :')
16:20 fdobridge: <k​arolherbst🐧🦀> yeah... I think it doesn't make sense to have a shader with a constant buffer of like 8 bytes, and `opt_large_constants` with a big enough threshold should be useful. But then there are other considerations like, if you have constant table with tons of indirect accesses you might want to do it regardless
16:20 fdobridge: <k​arolherbst🐧🦀> but
16:20 fdobridge: <k​arolherbst🐧🦀> we upload the driver constant buffer _anyway_ so we might as well put constants there
16:23 fdobridge: <k​arolherbst🐧🦀> but anyway, nvidia has up to 18 slots and they all exists on hardware afaik
16:24 fdobridge: <k​arolherbst🐧🦀> I'm mostly curious on why compute only has 7 and what they use the space for? maybe it's some huge cache for other things?
16:24 fdobridge: <k​arolherbst🐧🦀> mhh.. apparently ptx has 11 slots..
16:26 fdobridge: <k​arolherbst🐧🦀> ohh interesting.. so apparently each SM has a 4/8kb constant memory cache on top of the backing storage
16:29 fdobridge: <g​fxstrand> Yeah, if we have 18 slots, we can burn one for shader constants. We just need to make lower_descriptors rewrite lower_constant accordingly.
16:29 fdobridge: <k​arolherbst🐧🦀> well.. we have only 8 in compute, but maybe they changed it recently...
16:29 fdobridge: <k​arolherbst🐧🦀> it's weird that ptx has 11
16:30 fdobridge: <k​arolherbst🐧🦀> maybe it was done for random reasons
16:30 fdobridge: <g​fxstrand> In ANV, we do a trick with shader relocations to put them at the end of the shader binary and not even have to worry about binding. 🙃
16:31 fdobridge: <k​arolherbst🐧🦀> heh
16:31 fdobridge: <g​fxstrand> We could go the same with NVK with bindless UBOs
16:31 fdobridge: <k​arolherbst🐧🦀> `NVC7C0_QMDV02_03_CONSTANT_BUFFER_VALID(i) MW((640+(i)*1):(640+(i)*1))`
16:31 fdobridge: <k​arolherbst🐧🦀> `NVC7C0_QMDV02_03_REGISTER_COUNT_V MW(656:648)`
16:31 fdobridge: <k​arolherbst🐧🦀> looks like 8 for compute still
16:32 fdobridge: <k​arolherbst🐧🦀> whatever...
16:32 fdobridge: <k​arolherbst🐧🦀> having 17 in the graphics pipeline is more important 😄
16:32 fdobridge: <k​arolherbst🐧🦀> (+1 for our internal stuff)
16:33 fdobridge: <k​arolherbst🐧🦀> I have to check which arch added two more, but I think it was 2nd gen maxwell
18:40 fdobridge: <g​fxstrand> I thought reading phoronix comments would be amusing but no... It's just depressing. 😢
18:41 fdobridge: <k​arolherbst🐧🦀> depends on the topic, nvidia is one to avoid
18:41 fdobridge: <g​fxstrand> Yeah, lots of hatred and fighting
18:42 fdobridge: <k​arolherbst🐧🦀> asahi is also such a topic :/
18:43 fdobridge: <k​arolherbst🐧🦀> sometimes I'm in the mood of "discussing" with trolls there, but often I'm not 🙃
18:47 fdobridge: <g​eorgeouzou> I encountered a weird thing. Generated primitive queries are correct if I use the push_sync debug variable. If not, I get some errors on the queries that also use xfb.
18:58 fdobridge: <k​arolherbst🐧🦀> sounds like a sync problem then
19:13 fdobridge: <g​fxstrand> Ugh... codegen optimizer is screwing up again. 🙄
19:14 fdobridge: <g​fxstrand> In this case, it's somehow turning a u2u64(load_ubo()) into just loading a 64-bit value.
19:15 fdobridge: <g​fxstrand> dEQP-VK.binding_model.buffer_device_address.set0.depth1.basessbo.convertuvec2.nostore.multi.std140.comp
19:16 fdobridge: <g​fxstrand> I guess I could just lower 64-bit arithmetic in NIR
19:16 fdobridge: <g​fxstrand> Or I could not care
19:16 fdobridge: <g​fxstrand> But also this is the source of a lot of faults
19:17 fdobridge: <m​henning> I'll take a look at that one - it might just be peephole messing up again
19:17 fdobridge: <g​fxstrand> Thanks!
19:17 fdobridge: <k​arolherbst🐧🦀> creative
19:18 fdobridge: <g​fxstrand> Yeah, when that value goes into a (x << 4) and then gets added to an address it turns out those top bits not being zero can screw things up. 🙃
19:19 fdobridge: <k​arolherbst🐧🦀> yeah...
19:19 fdobridge: <k​arolherbst🐧🦀> u2u64 is just a move, so it's kinda funky why that even messes things up...
19:19 fdobridge: <k​arolherbst🐧🦀> well.. cvt actually mhh
19:20 fdobridge: <k​arolherbst🐧🦀> and that becomes a merge(src0, 0)
19:20 fdobridge: <k​arolherbst🐧🦀> ohhh....
19:22 fdobridge: <k​arolherbst🐧🦀> I wouldn't be surprised if LoadPropagation messes that up
19:28 fdobridge: <g​fxstrand> I suspect something like that.
19:29 fdobridge: <g​fxstrand> Because AFAICT, the second UBO element never gets loaded by the NIR. That load is invented by codegen from whole cloth.
19:35 fdobridge: <k​arolherbst🐧🦀> yeah...
19:36 fdobridge: <k​arolherbst🐧🦀> I suspect it's something not checking if it's a 64 bit thing 🙂
19:36 fdobridge: <k​arolherbst🐧🦀> I think there is still some code in codegen which assumes that only 32 bit values exist
19:37 fdobridge: <k​arolherbst🐧🦀> but I let@mhenning to figure it out 😄 Just wanted to share my wild guess here
19:38 fdobridge: <m​henning> yeah, that wouldn't surprise me
19:49 fdobridge: <a​irlied> @karolherbst there are v3 qmds maybe they expanded it
19:49 fdobridge: <k​arolherbst🐧🦀> nah, they didn't. This PTX limit also existed since forever
19:50 fdobridge: <k​arolherbst🐧🦀> I should check what nvidia does if you actually use all 11
19:50 fdobridge: <k​arolherbst🐧🦀> let's see...
20:01 fdobridge: <p​rop_energy_ball> If I want to contribute an ext implementation to NVK, should I be using the tree in mesa/mesa now or nouveau/mesa
20:02 fdobridge: <a​irlied> The former
20:05 fdobridge: <m​henning> Which load is it? What generation are you on? On kepler the only ld u64 I see is at ubo 0 offset 0xa0 and that's already 64-bit in the nir
20:08 fdobridge: <g​fxstrand> Turing
20:09 fdobridge: <g​fxstrand> It's a little over half-way through the test.
20:09 fdobridge: <k​arolherbst🐧🦀> ```
20:09 fdobridge: <k​arolherbst🐧🦀> /*00f0*/ UMOV UR4, '(cb10) ; /* 0x0000000000047882 */
20:09 fdobridge: <k​arolherbst🐧🦀> /* 0x000fc60000000000 */
20:09 fdobridge: <k​arolherbst🐧🦀> ```
20:09 fdobridge: <k​arolherbst🐧🦀> the hell is this
20:10 fdobridge: <k​arolherbst🐧🦀> they deprecated that banked const buffer feature in ptx 2.2
20:10 fdobridge: <k​arolherbst🐧🦀> and it seems like they live patch the location at runtime 🙃
20:10 fdobridge: <g​fxstrand> A mov into a uniform register
20:10 fdobridge: <k​arolherbst🐧🦀> what's that `(cb10)` thing tho 😛
20:10 fdobridge: <k​arolherbst🐧🦀> it's not in the encoding
20:11 fdobridge: <k​arolherbst🐧🦀> I'm sure they play weirdo tricks at runtime, like checking what you actually bind on launch time and then live patch the locations of the kernels
20:11 fdobridge: <k​arolherbst🐧🦀> *instructions
20:12 fdobridge: <k​arolherbst🐧🦀> however, I also see stuff like this generated: `IADD3 R0, R0, c[0xe][0x0], R11 ;`
20:12 fdobridge: <g​fxstrand> In NIR, you'll see `iadd(x, ushl(i2i64(load_ubo(0, n)), 4)` and a load that uses the result. I think it's the only 64-bit shift in the NIR.
20:13 fdobridge: <k​arolherbst🐧🦀> even on 2nd gen kepler I see `MOV R0, c[0xe][0x0];` mhhhh
20:13 fdobridge: <k​arolherbst🐧🦀> this is.... interesting
20:13 fdobridge: <k​arolherbst🐧🦀> maybe there is a way to use those const buffers in compute?
20:13 fdobridge: <g​fxstrand> 🤷🏻‍♀️
20:17 fdobridge: <k​arolherbst🐧🦀> mhhh no idea
20:38 fdobridge: <m​henning> hmm. I switched to ampere and I still can't reproduce this. The test actually passes for me on ampere, and the only i2i64 I see in the nir is consumed by an ishl, not a ushl
20:39 fdobridge: <m​henning> I don't think we set any lowering options differently on ampere vs turing, so I'm puzzled why I have different nir than you
20:41 fdobridge: <m​henning> maybe I need a newer cts or something? my cts on this machine is from february
20:49 fdobridge: <m​henning> No difference with latest cts
20:52 fdobridge: <k​arolherbst🐧🦀> @gfxstrand I know you'll hate me for it, but I'll need derefs in spir-v constant ops 🙃
20:52 fdobridge: <k​arolherbst🐧🦀> maybe it's a memory corruption thing?
21:02 fdobridge: <g​fxstrand> Could be ishl. I'm not looking at it right now.
21:02 fdobridge: <g​fxstrand> ?
21:02 fdobridge: <k​arolherbst🐧🦀> `%17 = OpSpecConstantOp %_ptr_CrossWorkgroup_uchar InBoundsPtrAccessChain %a_var %uint_0 %ulong_1`
21:03 fdobridge: <g​fxstrand> Oh, right, that... 🙄
21:04 fdobridge: <k​arolherbst🐧🦀> I'm just confused why it's doing a `OpSpecConstantOp` at all.... because there are no spec constants, but maybe I don't fully understand that part of spir-v yet
21:05 fdobridge: <k​arolherbst🐧🦀> https://gist.githubusercontent.com/karolherbst/a60b4d92701f60d0c82cf69ce64a6082/raw/42b2cf3198071c75e71582e3022fdad7c226d01e/gistfile1.txt
21:07 fdobridge: <k​arolherbst🐧🦀> ohh, seems like `OpSpecConstantOp` can't be updated and it's simply the result of the inputs
21:07 fdobridge: <g​fxstrand> It's horrible. You can, like, initialize global variables with pointers to other global things.
21:07 fdobridge: <k​arolherbst🐧🦀> yeah, I can see that 🙃
21:08 fdobridge: <g​fxstrand> Full relocation madness
21:08 fdobridge: <k​arolherbst🐧🦀> indeed
21:10 fdobridge: <k​arolherbst🐧🦀> I'm not even sure I want to support that Intiializer/Finalize part, because that's soo poorly speced, I have not even any idea if you can have 100 of those and then in what order they'll have to be executed
21:10 fdobridge: <k​arolherbst🐧🦀> but that part is also deprecated, so maybe I just hope we'll never need it
21:11 fdobridge: <g​fxstrand> 🤷🏻‍♀️
21:12 fdobridge: <g​fxstrand> But, like, you can theoretically size an array with the difference between two global pointers.
21:13 fdobridge: <g​fxstrand> I think they were trying to do c++ constrexpr but they really flubbed it.
21:13 fdobridge: <k​arolherbst🐧🦀> yeah...
21:13 fdobridge: <g​fxstrand> IDK what we're going to do to deal with all the insanity
21:13 fdobridge: <k​arolherbst🐧🦀> it's deprecated
21:13 fdobridge: <k​arolherbst🐧🦀> well
21:13 fdobridge: <k​arolherbst🐧🦀> part of it
21:14 fdobridge: <k​arolherbst🐧🦀> Initializers and Finalizers are
21:14 fdobridge: <k​arolherbst🐧🦀> and for the rest? dunno.. care if something needs it?
21:27 fdobridge: <g​fxstrand> Are you running on the new UAPI or old? For some reason, the fall only reproduces on new. Something to do with higher buffer addresses, I think.
21:28 fdobridge: <m​henning> Oh, old. I haven't built a new kernel yet
21:28 fdobridge: <k​arolherbst🐧🦀> ohh right... nouveau usually only sees buffer addresses within the 32 bit range 🙃
21:29 fdobridge: <g​fxstrand> Yeah, lots of fun bugs you find when you place buffers at the top of the space. There's a reason util_vma_heap defaults to that. 😝
21:29 fdobridge: <k​arolherbst🐧🦀> yeah...
21:30 fdobridge: <k​arolherbst🐧🦀> huh.. isn't then the bug that codegen simply cuts of high bits?
21:30 fdobridge: <k​arolherbst🐧🦀> there... might be places where codegen assumes 32 bit pointers.... I think
21:31 fdobridge: <k​arolherbst🐧🦀> do we even emit the `.E` field?
21:31 fdobridge: <k​arolherbst🐧🦀> *flag
21:31 fdobridge: <k​arolherbst🐧🦀> `.E` makes an address 64 bit instead of 32
21:32 fdobridge: <k​arolherbst🐧🦀> `emitField(72, 1, insn->src(0).getIndirect(0)->getSize() == 8);` mhh. that's `.E` for `OP_LOAD`
21:33 fdobridge: <k​arolherbst🐧🦀> but yeah.. things might assume 32 bit where they shouldn't
21:36 fdobridge: <g​fxstrand> No, the problem is that codegen adds stuff in. IDK why that only faults on the new API
21:37 fdobridge: <g​fxstrand> The compile is probably bad either way. The ishl should have 0 in the top 32 bits but it doesn't.
21:46 fdobridge: <k​arolherbst🐧🦀> well.. the only reasonable explanation on why it faults on the new API is, if the high bit are now non 0 and they make it to where they shouldn't. And the shifts are a bit wonky, but it also depends on how they are emited. They are 64 bit operations, always, just the value is split across two registers
21:47 fdobridge: <g​fxstrand> Actually, I suspect it has to do with where the kernel places context memory.
21:47 fdobridge: <g​fxstrand> But whatever
21:47 fdobridge: <g​fxstrand> I kinda don
21:47 fdobridge: <g​fxstrand> I kinda don't care (edited)
21:50 fdobridge: <g​fxstrand> Yeah, it's LoadPropagation
21:51 fdobridge: <k​arolherbst🐧🦀> I wouldn't be surprised if there is a "high bits of addresses are 0, therefore it's safe to..." code around somewhere
21:52 fdobridge: <k​arolherbst🐧🦀> `TargetGV100::insnCanLoad` is probably at fault here
21:52 fdobridge: <g​fxstrand> No, it really is just this sift getting messed up
21:56 fdobridge: <g​fxstrand> Wait... this is more subtle than I thought
21:58 fdobridge: <g​fxstrand> Yeah, it's not just load propagation going banannas
22:04 fdobridge: <g​fxstrand> Maybe 64-bit shuffles can't take immediates?
22:06 fdobridge: <k​arolherbst🐧🦀> they can
22:06 fdobridge: <k​arolherbst🐧🦀> and if not, the hardware would complain in dmesg
22:07 fdobridge: <k​arolherbst🐧🦀> however, it can only be a 32 bit immediate
22:08 fdobridge: <k​arolherbst🐧🦀> could dump the binary and see if nvdisasm agrees
22:15 fdobridge: <g​fxstrand> I did
22:15 fdobridge: <g​fxstrand> nvdisasm seems okay
22:18 fdobridge: <k​arolherbst🐧🦀> mhh.. could be something subtle, like pulling 64 bit from a const buffer instead of just 32. It all kinda depends on the instruction they are used in. But can also be just something super subtle, like an instruction writing to a 64 bit reg, even though only 32 was intended. nvdisasm is kinda terrible at showing that
22:18 fdobridge: <k​arolherbst🐧🦀> I... kinda remember having fixed a rando bug like that before...
22:20 fdobridge: <k​arolherbst🐧🦀> I also hope that I actually upstreamed such a fix
22:23 fdobridge: <k​arolherbst🐧🦀> uhhhhmmmm
22:23 fdobridge: <k​arolherbst🐧🦀> what kinda shift do you have there?
22:23 fdobridge: <g​fxstrand> I've pretty much got it narrowed down
22:26 fdobridge: <g​fxstrand> Works:
22:26 fdobridge: <g​fxstrand> ```
22:26 fdobridge: <g​fxstrand> 10: mov u32 $r8 c0[0x24] (16)
22:26 fdobridge: <g​fxstrand> ...
22:26 fdobridge: <g​fxstrand> 68: shf (SUBOP:3) s32 $r3 $r255 31 $r8 (16)
22:26 fdobridge: <g​fxstrand> ```
22:26 fdobridge: <g​fxstrand> Doesn't:
22:26 fdobridge: <g​fxstrand> ```
22:26 fdobridge: <g​fxstrand> 68: mov u32 $r2 0x0000001f (16)
22:26 fdobridge: <g​fxstrand> 69: shf (SUBOP:3) s32 $r7 $r255 $r2 c0[0x24] (16)
22:26 fdobridge: <g​fxstrand> ```
22:26 fdobridge: <k​arolherbst🐧🦀> I should skim through my local git gree more often... I don't even remember having ever written this: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/4dad0e35cf29d2d28df878cf13b5dba64ac2db79
22:26 fdobridge: <g​fxstrand> It's the 32-bit shift that's borked
22:27 fdobridge: <k​arolherbst🐧🦀> mhhh
22:27 fdobridge: <k​arolherbst🐧🦀> at first it looks equal...
22:28 fdobridge: <k​arolherbst🐧🦀> mhhh
22:28 fdobridge: <k​arolherbst🐧🦀> it would be weird to pull 64 bits from `c0[0x24]`
22:28 fdobridge: <k​arolherbst🐧🦀> _but_...
22:29 fdobridge: <k​arolherbst🐧🦀> let me read some docs
22:29 fdobridge: <g​fxstrand> Forget the 32 bits
22:29 fdobridge: <g​fxstrand> It's just a sign-extend
22:29 fdobridge: <g​fxstrand> Forget the 64 bits (edited)
22:29 fdobridge: <k​arolherbst🐧🦀> uhhhh
22:29 fdobridge: <g​fxstrand> But it's ending up with garbage somewhere
22:29 fdobridge: <k​arolherbst🐧🦀> :painpeko:
22:29 fdobridge: <k​arolherbst🐧🦀> mhhh
22:29 fdobridge: <k​arolherbst🐧🦀> yeah well...
22:30 fdobridge: <k​arolherbst🐧🦀> that's kinda funky
22:30 fdobridge: <k​arolherbst🐧🦀> sooo
22:30 fdobridge: <k​arolherbst🐧🦀> shf is kinda weird
22:30 fdobridge: <g​fxstrand> Yeah, I know
22:30 fdobridge: <k​arolherbst🐧🦀> let's see...
22:31 fdobridge: <k​arolherbst🐧🦀> what's the thing nvdisasm print for that shf?
22:33 fdobridge: <g​fxstrand> Good:
22:33 fdobridge: <g​fxstrand> ```
22:33 fdobridge: <g​fxstrand> /*01e0*/ SHF.R.S32.HI R3, RZ, 0x1f, R8 ; /* 0x0000001fff037819 */
22:33 fdobridge: <g​fxstrand> ```
22:33 fdobridge: <g​fxstrand> Bad:
22:33 fdobridge: <g​fxstrand> ```
22:33 fdobridge: <g​fxstrand> /*01a0*/ MOV R2, 0x1f ; /* 0x0000001f00027802 */
22:33 fdobridge: <g​fxstrand> /* 0x000fcc0000000f00 */
22:33 fdobridge: <g​fxstrand> /*01b0*/ SHF.R.S32.HI R7, RZ, R2, c[0x0][0x24] ; /* 0x00000900ff077619 */
22:33 fdobridge: <g​fxstrand> /* 0x000fde0000011402 */
22:33 fdobridge: <g​fxstrand> ```
22:33 fdobridge: <k​arolherbst🐧🦀> okay, so those are at least clamped shifts...
22:34 fdobridge: <k​arolherbst🐧🦀> yeah so that should be `signed(c[0x0][0x24]) >> R2`
22:37 fdobridge: <k​arolherbst🐧🦀> heh..
22:37 fdobridge: <k​arolherbst🐧🦀> I think my docs are buggy
22:38 fdobridge: <k​arolherbst🐧🦀> there are two examples of `SHF.R.S32.HI` and they show a different operation each 🙃
22:38 fdobridge: <g​fxstrand> heh
22:39 fdobridge: <k​arolherbst🐧🦀> once `signed(c[0x0][0x24]) >> R2` and the other is `signed(c[0x0][0x24]) >> R2 >> 32`
22:39 fdobridge: <g​fxstrand> right
22:40 fdobridge: <k​arolherbst🐧🦀> I suspect the first should have been `.LO`
22:41 fdobridge: <k​arolherbst🐧🦀> ehh.. or maybe that's fine, because of where the value went
22:42 fdobridge: <k​arolherbst🐧🦀> yeah.. nvm
22:42 fdobridge: <k​arolherbst🐧🦀> it's just weirdly explained
22:42 fdobridge: <k​arolherbst🐧🦀> anyway.. yes, the value would be sign extented
22:42 fdobridge: <k​arolherbst🐧🦀> even if the other reg is RZ
22:43 fdobridge: <k​arolherbst🐧🦀> which kinda makes sense
22:44 fdobridge: <g​fxstrand> Yeah, lo doesn't help
22:45 fdobridge: <k​arolherbst🐧🦀> dest, low, shift, high is the encoding
22:46 fdobridge: <g​fxstrand> Right
22:46 fdobridge: <k​arolherbst🐧🦀> I still don't see why the two are different...
22:47 fdobridge: <g​fxstrand> I suspect it doesn't support all the encodings we think it does
22:47 fdobridge: <g​fxstrand> But that also seems weird
22:47 fdobridge: <k​arolherbst🐧🦀> nah, the encodings are both valid
22:47 fdobridge: <k​arolherbst🐧🦀> and again, the hardware would shout unless we encode random nonsense
22:47 fdobridge: <k​arolherbst🐧🦀> but then nvdisasm would shout
22:48 fdobridge: <k​arolherbst🐧🦀> It kinda feels like, that the bug might not be that instruction, it just shows because it's writing to a different reg now or something...
22:48 fdobridge: <k​arolherbst🐧🦀> to me both of them, good and bad, look identical
22:49 fdobridge: <k​arolherbst🐧🦀> mhhhhh
22:50 fdobridge: <k​arolherbst🐧🦀> @gfxstrand mind running with `NV50_PROG_SCHED=0` just in case it's something with the sched stuff?
22:50 fdobridge: <k​arolherbst🐧🦀> the last I want is, that it's a bug with the scheduling 🙃
22:51 fdobridge: <g​fxstrand> Yeah, doesn't help
22:51 fdobridge: <k​arolherbst🐧🦀> I hope you also made sure it recompiled? Not sure if caching is set up already
22:51 fdobridge: <g​fxstrand> we have no caching
22:51 fdobridge: <k​arolherbst🐧🦀> mhhhhh
22:53 fdobridge: <g​fxstrand> How has this never shown up?!?
22:53 fdobridge: <k​arolherbst🐧🦀> some of those random weirdo bugs are terrible hard to track down
22:54 fdobridge: <k​arolherbst🐧🦀> but it still feels like it's something else
22:54 fdobridge: <k​arolherbst🐧🦀> sooo
22:54 fdobridge: <k​arolherbst🐧🦀> `c[0x0][0x24]` is aligned for a 4 byte load
22:54 fdobridge: <k​arolherbst🐧🦀> _if_ the shift would load 8 bytes, it wouldn't load them at `c[0x0][0x24]`, but `c[0x0][0x20]` instead
22:55 fdobridge: <g​fxstrand> The broken shift is the 32-bit one. The whole 64-bit thing was a red herring
22:55 fdobridge: <k​arolherbst🐧🦀> yeah.. but I don't see how it's broken
22:56 fdobridge: <g​fxstrand> Okay, so this is interesting...
22:56 fdobridge: <g​fxstrand> handleShifts handles SHL and SHR differently
22:58 fdobridge: <k​arolherbst🐧🦀> yeah, it has to
22:59 fdobridge: <k​arolherbst🐧🦀> the ISA doc states "for 32 bit left shifts, do that:" and that's the code you see
22:59 fdobridge: <k​arolherbst🐧🦀> anyway, I think the shift is alright, and it's something else going wrong
23:00 fdobridge: <g​fxstrand> Maybe
23:00 fdobridge: <g​fxstrand> Could be RA, I guess, but it seems pretty stable
23:00 fdobridge: <k​arolherbst🐧🦀> might sharing the good and bad shader?
23:01 fdobridge: <k​arolherbst🐧🦀> *mind
23:02 fdobridge: <g​fxstrand> Good
23:02 fdobridge: <g​fxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1137521073609506877/message.txt
23:02 fdobridge: <g​fxstrand> Bad
23:02 fdobridge: <g​fxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1137521151283822692/message.txt
23:05 fdobridge: <k​arolherbst🐧🦀> ehhh...
23:07 fdobridge: <k​arolherbst🐧🦀> ehh no, that's fine
23:08 fdobridge: <g​fxstrand> The value in `c[0x0][0x24]` is 1 for whatever it's worth
23:09 fdobridge: <g​fxstrand> The output I'm seeing seems consistent with it doing `0x1f >> 1` instead of `1 >> 0x1f`
23:09 fdobridge: <k​arolherbst🐧🦀> yeah...
23:11 fdobridge: <g​fxstrand> But there's plenty of other SHFs with a constbuf argument that seem fine
23:12 fdobridge: <g​fxstrand> I mean, maybe our dependency tracker is busted?
23:19 fdobridge: <k​arolherbst🐧🦀> okay.. I think I got it..
23:20 fdobridge: <k​arolherbst🐧🦀> mind sharing the codegen ir output? 😄
23:20 fdobridge: <k​arolherbst🐧🦀> but yeah... it's busted
23:20 fdobridge: <k​arolherbst🐧🦀> but in the most annoying way
23:20 fdobridge: <k​arolherbst🐧🦀> in the bad one you have `SHF.L.W.S64 R2, R6, 0x4, RZ ;`
23:20 fdobridge: <k​arolherbst🐧🦀> and `SHF.R.S32.HI R7, RZ, R2, c[0x0][0x24] ;` before that
23:20 fdobridge: <k​arolherbst🐧🦀> but.. who actually reads `R7`?
23:21 fdobridge: <k​arolherbst🐧🦀> can't be that `SHF.L.W.S64 R2, R6, 0x4, RZ ;`, because it uses 32 bit regs, not 64 bit ones
23:21 fdobridge: <k​arolherbst🐧🦀> but besides that?
23:22 fdobridge: <g​fxstrand> No one
23:22 fdobridge: <g​fxstrand> But that's a 64-bit shift
23:22 fdobridge: <k​arolherbst🐧🦀> so?
23:22 fdobridge: <g​fxstrand> Or do 64-bit shifts not actually exist?
23:22 fdobridge: <k​arolherbst🐧🦀> again
23:22 fdobridge: <k​arolherbst🐧🦀> the encoding is: dest, low, shift, high
23:23 fdobridge: <g​fxstrand> Right, so we need to put both regs in there for a 64-bit shift?
23:23 fdobridge: <k​arolherbst🐧🦀> yes
23:23 fdobridge: <k​arolherbst🐧🦀> I suspect codegen got a 64 bit source
23:23 fdobridge: <k​arolherbst🐧🦀> and never bothered splitting it
23:24 fdobridge: <g​fxstrand> Yeah, handleShifts doesn't take that into account
23:24 fdobridge: <k​arolherbst🐧🦀> which... with 32 bit addresses doesn't matter one bit 🙃
23:24 fdobridge: <k​arolherbst🐧🦀> or well...
23:24 fdobridge: <k​arolherbst🐧🦀> 64 bit values with their high bits being 0
23:25 fdobridge: <g​fxstrand> but `c[0x0][0x24] >> 0x1f` should be 0
23:26 fdobridge: <k​arolherbst🐧🦀> the right shift never makes it out of the shader
23:26 fdobridge: <k​arolherbst🐧🦀> it's the left shift following that having a different result
23:26 fdobridge: <k​arolherbst🐧🦀> ehh wait.. shouldn't
23:27 fdobridge: <k​arolherbst🐧🦀> that `IADD3.X R3, R5, R3, RZ, P0, !PT ; ` is wonky
23:27 fdobridge: <g​fxstrand> But I think maybe I see your point
23:27 fdobridge: <k​arolherbst🐧🦀> that `R3` is `SHF.R.S32.HI R3, RZ, 0x1f, R8 ;` in the good version
23:27 fdobridge: <k​arolherbst🐧🦀> or uhhh..
23:28 fdobridge: <k​arolherbst🐧🦀> SHF.L.W should write R3.. nevermind me
23:29 fdobridge: <k​arolherbst🐧🦀> it's still kinda weird.. all of that
23:29 fdobridge: <g​fxstrand> Still, your explanation of low and high makes sense
23:29 fdobridge: <g​fxstrand> I'm just not quite sure how to modify the code
23:29 fdobridge: <k​arolherbst🐧🦀> yeah.. it's definetly one bug
23:29 fdobridge: <k​arolherbst🐧🦀> but there might be another one
23:33 fdobridge: <k​arolherbst🐧🦀> I have to figure out how to use mksplit again..
23:34 fdobridge: <k​arolherbst🐧🦀> totally untested https://gist.github.com/karolherbst/93b137a962334aff0b27d209e4a52c9a
23:36 fdobridge: <g​fxstrand> I typed something similar and no dice
23:36 fdobridge: <g​fxstrand> I
23:36 fdobridge: <g​fxstrand> I've got to go now (edited)
23:36 fdobridge: <g​fxstrand> We can look more on Monday
23:36 fdobridge: <k​arolherbst🐧🦀> yeah..
23:36 fdobridge: <g​fxstrand> Or you can play with the test yourself
23:36 fdobridge: <k​arolherbst🐧🦀> I suspect there are more bugs
23:37 fdobridge: <k​arolherbst🐧🦀> did something change at least?
23:40 fdobridge: <k​arolherbst🐧🦀> it _could_ also be, that `.S64` selects the source to be 64 bit for real, but still only writes a 32 bit result
23:40 fdobridge: <k​arolherbst🐧🦀> and then `R3` is whatever random garbage it was before
23:40 fdobridge: <k​arolherbst🐧🦀> maybe it's that instead
23:42 fdobridge: <k​arolherbst🐧🦀> the docs aren't... as great as I wished they were here
23:43 fdobridge: <k​arolherbst🐧🦀> I also think I have a patch for that somewhere....
23:43 fdobridge: <k​arolherbst🐧🦀> yeah.. let's talk on Monday