18:43hatfielde: I am enabling `NAK_DEBUG=print` but not seeing any output. Do I have to set other env vars or some build flags or something? It could also be that my test program, which creates a compute pipeline and then shuts everything down doesn't trigger compilation.
18:55karolherbst: hatfielde: MESA_SHADER_CACHE_DISABLE=1
18:55karolherbst: if the binary is already cached, it won't compile it again and hence you won't see any prints
19:08hatfielde: karolherbst: Any option that will be very noisy? I'm not seeing any output with `MESA_LOG_LEVEL=debug`, `NIR_DEBUG=help`, or the flag you suggested. I'm questioning whether I can log anything whatsoever...
19:09marysaka[d]: NIR_DEBUG needs a debug build not sure for MESA_LOG_LEVEL but might need it for the debug level
19:12karolherbst: right.. could be that not everything actually does work in a release build..
19:19hatfielde: Seems like the output that I'm looking for appears with vkcube with my current release build, so it must be something wrong with my test program. I never dispatch the compute shader bc I didn't want to create descriptors buffers and command buffers, but I have to eventually anyway so may as well write it now. Probably the shader doesn't compile until it absolutely has to
19:19hatfielde: Of course re-running vkcube requires that I enable `MESA_SHADER_CACHE_DISABLE=1` to see the same compilation output.
19:19hatfielde: I am trying to observe the current behavior for: https://gitlab.freedesktop.org/mesa/mesa/-/work_items/13774
19:52mhenning[d]: hatfielde: Make sure your test program is using the correct device
19:52mhenning[d]: eg if you get llvmpipe on accident then you won't hit nak and nothing will be printed
19:55hatfielde: mhenning[d]: good point, I am indeed using the correct device (verified by printing VkPhysicalDeviceProperties::deviceName)
20:12chiku: huh
20:12chikuwad[d]: I thought my znc was on my server
21:23hatfielde: To close the loop above, I was able to get the NAK_DEBUG output after dispatching the compute shader in my test program. And I'm able to reproduce #13774
21:33phomes_[d]: most games improved with compression, occupancy, or other performance MRs. X4 Foundations is an example of a game that did not improve at all
21:33phomes_[d]: I took a closer look at the scene I use in the perf sheet
21:33phomes_[d]: Frame time on the proprietary 590 driver is 20 ms. On NVK it takes 44 ms
21:34phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1474519530960064615/Screenshot_From_2026-02-20_21-40-32.png?ex=699a24d3&is=6998d353&hm=2e892762bc94d220d4a2c28a387b78deafe389d4a0f20f077341acdd7b4f21fc&
21:34phomes_[d]: Out of the 44 ms we spend 31 ms in vkWaitForFences after queue submit/present:
21:35phomes_[d]: The frame does mostly CmdDrawIndexed. 2090 calls to be exact, 70 CmdDraws, and 53 CmdDispatches
21:35phomes_[d]: Comparing the time duration in Renderdoc's event browser for both drivers show that
21:36phomes_[d]: for what renderdoc calls Depth-only passes nvk performs 50x to 180x worse
21:36mhenning[d]: unfortunately waiting in vkWaitForFences might not tell us much - it just means we're waiting for the gpu to do work not what work it is
21:36phomes_[d]: the relative performance on nvk becomes worse as the number of index count grows
21:37phomes_[d]: yes, that is what I tried to dive into with renderdoc. I don't know how else to do it 🙂
21:38mhenning[d]: phomes_[d]: oh, that's interesting. there's a scene in the witcher 3 that's really bad on nvk (like 2 fps while the rest of the game is playable) that also has a lot of depth-only passes
21:38mhenning[d]: I tried a few things that didn't fix it
21:39phomes_[d]: Compute and Copy/Clear passes are are only 2x as slow. Color bad but not as bad as depth-only
21:40phomes_[d]: now I am not sure what to go for next. I could try to compare the shader compiles, but that is not really my comfort zone 🙂
21:41mhenning[d]: Is renderdoc working well for you right now? most stuff I run into https://gitlab.freedesktop.org/mesa/mesa/-/issues/14518 and can't trace at the moment
21:42mhenning[d]: also, do the slow depth-only calls have fragment shaders specified?
21:42phomes_[d]: I had to use an earlier version of nvk. But performance had not changed so I thought it would still be useful resutlts
21:43mhenning[d]: The witcher 3 example had no fragment shader which is a bit odd and I was wondering if that was triggering the issue or if it's just a red herring
21:45mhenning[d]: phomes_[d]: okay, still need to fix the renderdoc issue then
21:45phomes_[d]: it does have fragment shader
21:46phomes_[d]: decompile shows:
21:46phomes_[d]: `#version 460
21:46phomes_[d]: layout(set = 3, binding = 0, std140) uniform BUFFER_WORLD
21:46phomes_[d]: {
21:46phomes_[d]: mat4 M_worldviewprojection;
21:46phomes_[d]: mat4 M_world;
21:46phomes_[d]: mat4 M_prevworldviewprojection;
21:46phomes_[d]: mat4 M_shadowCSM0;
21:46phomes_[d]: mat4 M_shadowCSM1;
21:46phomes_[d]: mat4 M_shadowCSM2;
21:46phomes_[d]: mat4 M_shadowCSM3;
21:46phomes_[d]: mat4 M_shadowCSM4;
21:46phomes_[d]: vec4 V_blendcolor;
21:46phomes_[d]: float F_alphascale;
21:46phomes_[d]: uint B_packedtangentframe;
21:46phomes_[d]: uint B_vertexdata0;
21:46phomes_[d]: uint B_vertexdata1;
21:46phomes_[d]: uint B_vertexdata2;
21:46phomes_[d]: uint B_useskinning;
21:46phomes_[d]: } _90;
21:46phomes_[d]: void main()
21:46phomes_[d]: {
21:46phomes_[d]: }
21:46phomes_[d]: `
21:47mhenning[d]: okay, that's not the issue then
22:15marysaka[d]: phomes_[d]: Is there any change in term of descriptor sets in between?
22:18marysaka[d]: One theory I have is that rebinding cbufs cause a WFI, NVIDIA proprietary drivers seems to use `LOAD_ROOT_TABLE` (and `LOAD_CONSTANT_BUFFER` before Turing) to update the ones that are bound
22:19marysaka[d]: I know that cbufs have a concept of versioning that would allow new version of the cbufs to just be used without invalidation at least
22:21mhenning[d]: Yeah, I have some initial work on using LOAD_ROOT_TABLE and friends
22:22mhenning[d]: or at least I figured out roughly how they work. I haven't actually adapted the root descriptor to use it yet
22:22marysaka[d]: Yeah would be nice to see
22:23marysaka[d]: but I wonder if we could rework how we handle things a bit to be able to use LOAD_CONSTANT_BUFFER on other cbufs than 0
22:27mhenning[d]: maybe for data from opt_large_constants?
22:29marysaka[d]: That could probably a nice easy one
22:30mhenning[d]: In the general case I guess you could LOAD_CONSTANT_BUFFER and then stream in some data with the command streamer but I'd be a little surprised if that were more efficient
22:33marysaka[d]: Yeah I'm not sure.... I kind of still want to write some tests to see if `BIND_GROUP_CONSTANT_BUFFER` can cause a WFI at some point anyway
22:33marysaka[d]: and from the bit I gathered NVIDIA blobs seems to be streaming descriptor sets update under certain conditions that aren't sure
22:33mhenning[d]: yeah, that's worth testing
22:51airlied[d]: Are we missing some early depth handling or anything like that?
22:53mhenning[d]: There's a branch for zcull that doesn't fix this issue
22:56phomes_[d]: marysaka[d]: There are a bunch of CmdBindDescriptorSets before each CmdDrawIndexed. For 1 frame there are 11k total calls to CmdBindDescriptorSets
23:04phomes_[d]: if anyone wants to look at the perfetto trace, renderdoc captures etc then I put them in folder here: https://drive.google.com/drive/folders/1oBGfd4gYiFfC8SKJS20oaN_BCkubQfHq?usp=drive_link