05:41phomes_[d]: karolherbst[d]: +1 fps in Recipe for disaster
16:20karolherbst[d]: I've noticed we don't use `nir_opt_fp_math_ctrl` yet and wondering how much that matters, given `Applications like DXVK spam float control restriction flags on all ALU instructions.` which is stated in the opt
16:25karolherbst[d]: okay.. who is working on `u_trace`, mohamexiety[d] was that you?
16:26mohamexiety[d]: yeah
16:31mohamexiety[d]: what's up
16:32karolherbst[d]: well I want to know which shaders are used the most in games 😄
16:33karolherbst[d]: and I've heard this could help with that
16:35mhenning[d]: karolherbst[d]: nir_opt_fp_math_ctrl is probably worth trying
16:36karolherbst[d]: yeah... I had the same thought yesterday
16:36karolherbst[d]: maybe I get to it today, maybe I won't
16:39mohamexiety[d]: i am deep trying to solve some nasty sync bugs sadly :/
16:39mohamexiety[d]: but might have found a few not too terrible solutions
16:40mohamexiety[d]: i dont think it can help you find which shaders are used the most though? :thonk:
16:40karolherbst[d]: mohamexiety[d]: `Then integrating u_trace + perfetto would give you a nice overview of game rendering. You can add which shaders draw call is using to drawcall's tracepoint and you e.g. would be able to focus on high impact ones.` is what I got told
16:41mohamexiety[d]: ahhh
16:41mohamexiety[d]: yeah that can work
16:43marysaka[d]: oh yeah totally yeah
16:45karolherbst[d]: but yeah.. I think we have some low hanging fruits with x-com 2.. my GPU is screaming when running it, but I get 1/3 of the blob perf in the main menu 🙃
16:45karolherbst[d]: so wondering if I could figure out what's up there
16:45marysaka[d]: Is there some compute dispatches?
16:45marysaka[d]: because we use MME also on Turing after all
16:45karolherbst[d]: I don't know 🙂
16:45karolherbst[d]: ohhhh right...
16:46marysaka[d]: so that could be terrible for your poor TU102
16:46karolherbst[d]: I tested on my TU102 🥲
16:46marysaka[d]: yeah
16:46karolherbst[d]: I should check with my GA102 🙃
16:46marysaka[d]: tho considering phomes testing it might be different to that and all
16:49karolherbst[d]: yeah.. I was just surprised it was way worse.. but if it's compute, then maybe it's "trivial" to improve the perf there..
16:51marysaka[d]: Yeah... my whole MRs about merging barriers and having DMA utils was to start having different path to stay as much as possible on the last subchannel
16:51karolherbst[d]: I should try that one with it...
16:51marysaka[d]: One thing I want to type is query pools support on 3D for copy and clear
16:51karolherbst[d]: mhhhh
16:51karolherbst[d]: soo
16:51karolherbst[d]: what I did see was, that with x-com2 the app is hammering events and queries
16:53marysaka[d]: well if it's query, we use compute and meta for copy sooo
16:53karolherbst[d]: like a lot of CPU cycles spent on `nvk_GetEventStatus` and `nvk_GetQueryPoolResults`
16:53karolherbst[d]: almost as if it busy waits on those 🙃
16:53karolherbst[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/c53cbf555a8d63e609b199373b5a637bb457d548 made `nvk_GetEventStatus` almost 0 overhead
16:54karolherbst[d]: and it showed up as 10% CPU cycles without that
16:54marysaka[d]: karolherbst[d]: hmm you should probably keep the NVKMD_MEM_COHERENT but yeah I think it should be on GART
16:54marysaka[d]: query pools should be on gart already at least
16:54karolherbst[d]: yeah... the GPU only writes to it once, right?
16:55karolherbst[d]: and polling over PCIe sounds worse 🙃
16:55karolherbst[d]: than the GPU writing a value once
16:55marysaka[d]: karolherbst[d]: yes but I'm not too sure how it reads it with acquire
16:55karolherbst[d]: mhhhh... right...
16:55marysaka[d]: not that it should be that terrible anyway
16:55karolherbst[d]: atomics over pcie also suck
16:57mhenning[d]: I'm wondering if it should be in vram but do an l2 cache flush along with the signal or similar
16:57karolherbst[d]: sadly it doesn't seem to help with perf much...
16:57marysaka[d]: karolherbst[d]: Hmm did you try with my host barrier patches btw
16:58karolherbst[d]: yeah... dunno.. I'm more concerned about the CPU busy looping around `nvk_GetEventStatus` 🙃
16:58karolherbst[d]: marysaka[d]: nope
16:58karolherbst[d]: the issue was, that it wasn't the only CPU expensive thing
16:58karolherbst[d]: and the same thread was also hammering on `nvk_GetQueryPoolResults`
16:59karolherbst[d]: and `nvk_sync_queries_from_gpu` was quite expensive
16:59marysaka[d]: karolherbst[d]: that shouldn't be possible
16:59marysaka[d]: there is a problem then because all mapping should be considered coherent on x86
17:00karolherbst[d]: or maybe it was `nvk_query_is_available`...
17:00karolherbst[d]: I should check again
17:00marysaka[d]: that might be the case but yeah that's "normal" if the value is visible but way too late
17:02marysaka[d]: also unrelated but shouldn't we be checking for device lost state in `nvk_GetEventStatus`?
17:25mhenning[d]: marysaka[d]: not sure if we need to or not. A full check to see if it's lost since the last operation is a uapi call which would make that function a lot more expensive
17:32marysaka[d]: mhenning[d]: yeah I was mostly wondering because I'm seeing most driver in mesa do a check with `vk_device_is_lost`...
17:56karolherbst[d]: okay.. `nir_opt_fp_math_ctrl` time
18:02karolherbst[d]: oh yeah.. that's looking good
18:02karolherbst[d]: but not as good as it would matter with anything much 😄
18:03karolherbst[d]: phomes_[d]: if you want to give it a try: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/2ca49faa9693df102c97f0f35ff4229a12d5a4da
18:04karolherbst[d]: some games seem to be more heavily impacted than others it seems
18:04karolherbst[d]: batman e.g.
18:04airlied[d]: Radv puts query buffer in gart
18:05karolherbst[d]: yeah, we do as well. But the event status is in VRAM
18:06esdrastarsis[d]: karolherbst[d]: which batman
18:15zmike[d]: I'm batman
18:16karolherbst[d]: oh I found more passes.. maybe...
18:17karolherbst[d]: mhhh another one: `nir_opt_move_discards_to_top` maybe?
18:21mhenning[d]: karolherbst[d]: I thought that was something that hurt some games and helped others?
18:22karolherbst[d]: mhhhh
18:22mhenning[d]: also if anyone wants to review a small opt loop improvement: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41118
18:22karolherbst[d]: in stats or in perf?
18:22mhenning[d]: in perf, I thought, although I forget where I read that
18:22karolherbst[d]: my hope would be that if invocations are terminated earlier that that's better
18:23karolherbst[d]: though it also only helps if the entire quad gets terminated
18:23karolherbst[d]: but anyway.. will focus on the fp_math thing first, and also checking if `nir_opt_algebraic_distribute_src_mods` helps at all
18:23karolherbst[d]: it _looks_ like it would help
18:24karolherbst[d]: a little bit
18:25phomes_[d]: karolherbst[d]: XCOM 2 is same fps as before. Will do the full tests in a bit. Also Mel's MR
18:25karolherbst[d]: phomes_[d]: oh maybe wait a bit then
18:25karolherbst[d]: I'll add another opt
18:25karolherbst[d]: maybe
18:25karolherbst[d]: stats:
18:25karolherbst[d]: Totals from 77360 (6.38% of 1212873) affected shaders:
18:25karolherbst[d]: CodeSize: 1255332672 -> 1250129888 (-0.41%); split: -0.44%, +0.03%
18:25karolherbst[d]: Number of GPRs: 4233257 -> 4226625 (-0.16%); split: -0.20%, +0.05%
18:25karolherbst[d]: Static cycle count: 937314398 -> 935865851 (-0.15%); split: -0.22%, +0.07%
18:25karolherbst[d]: Spills to memory: 11371 -> 11373 (+0.02%)
18:25karolherbst[d]: Fills from memory: 11371 -> 11373 (+0.02%)
18:25karolherbst[d]: Spills to reg: 24245 -> 24262 (+0.07%); split: -0.65%, +0.72%
18:25karolherbst[d]: Fills from reg: 23689 -> 23742 (+0.22%); split: -0.55%, +0.77%
18:25karolherbst[d]: Max warps/SM: 2912604 -> 2916096 (+0.12%); split: +0.15%, -0.03%
18:25karolherbst[d]: a bit all over the place
18:26phomes_[d]: DXVK_HUD=full gives us a lot of info like number of compute shaders for XCOM 2 in case there is something to find there
18:27mhenning[d]: phomes_[d]: fwiw, https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41118 has identical stats on shaderdb. It's only meant to improve shader compile times, shouldn't affect anything else unless something's going wrong
18:27karolherbst[d]: phomes_[d]: we really need async compute I feel :ferrisUpsideDown: but yeah.. any compute heavy game will suffer from WFI as things are today, so that's not great
18:29karolherbst[d]: ohhh yeah
18:29karolherbst[d]: `nir_opt_algebraic_distribute_src_mods` helps even more
18:29karolherbst[d]: but not sure if it's a good trade of 😄
18:30karolherbst[d]: phomes_[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41214 if there are regressions with both I'd be curious which one of the passes is causing it
18:33karolherbst[d]: I'm kinda curious though.. I thought I looked into `nir_opt_algebraic_distribute_src_mods` before, but maybe now that copy prop is more useful it actually helps
18:37karolherbst[d]: but yeah.. both passes are in the "yeah might improve things in the sub 1%" area..
20:30phomes_[d]: karolherbst[d]: they were
20:31karolherbst[d]: pain 🥲
20:59zmike[d]: karolherbst[d]: re: gpu profiling, renderdoc timings aren't totally bogus if you're just trying to find outliers (assuming you have semi-credible timestamps)
21:00karolherbst[d]: mhhhh, maybe?
21:00zmike[d]: from there you can examine/fossilize the shader
21:01karolherbst[d]: yeah... I think my issue is mostly I have never done any of that yet..
21:01zmike[d]: if you e.g., bookmark a specific draw in a rdoc capture, you can then relaunch rdoc with fossilize layer, play that draw, and then you'll have a fossilize file with just that pipeline
21:01karolherbst[d]: so I'm kinda lost where to even start
21:02zmike[d]: capture a frame, click the clock icon above the event list, and it'll run timers for all the commands
21:02karolherbst[d]: I think the issue is that the way we do that skewes the results, because our 3D vs compute situation is a bit odd
21:02zmike[d]: you can also view perf counters from the dedicated tab on the right side if you have that view active
21:03zmike[d]: yeah it won't be super accurate
21:03zmike[d]: but if you do have any outliers perf-wise those are usually reflected and obvious
21:03karolherbst[d]: but yeah.. if it can tell me "looks, this draw takes like 5 seconds, you wanna do something about that" that might already help 🙃
21:03zmike[d]: this is caveman-level profiling if you have nothing else available but timestamps
21:04phomes_[d]: I ran some of those for XCOM 2 and put them in my visualizer for comparing prop driver to mesa
21:04phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1498429664342048959/image.png?ex=69f120df&is=69efcf5f&hm=576afed099d0db4cab8c7f95b375c3aeedd392ec52540aedaf977ad884e150e0&
21:04karolherbst[d]: that uhm.. that big one looks like a problem
21:05karolherbst[d]: but yeah..
21:05karolherbst[d]: the issue is that I think something about compute is messed up
21:05karolherbst[d]: like we do certain things on compute and that causes WFIs
21:05karolherbst[d]: but yeah.. the big one would be interesting to look at
21:05airlied[d]: should be easier to write vkoverhead tests for that
21:06zmike[d]: probably marek's thing and not vkoverhead since vkoverhead doesn't execute anything on the gpu
21:06zmike[d]: I assume you're already doing this, but the ideal scenario is isolating a frame using a gfxreconstruct trace/trim so you can profile back and forth vs blob
21:07karolherbst[d]: phomes_[d]: didn't you share the shader of the draw call with me?
21:07mhenning[d]: karolherbst[d]: uh, yeah nvk_CmdWriteTimestamp2 always hits the 3d engine, which forces a wfi if you've been doing compute stuff
21:08mhenning[d]: we might need to decrease the overhead of timestamps a bit
21:08karolherbst[d]: yeah....
21:08phomes_[d]: karolherbst[d]: That was for one of the later passes. There was for the draw call that was slowest compared to prop
21:09karolherbst[d]: phomes_[d]: ahh.. okay. I want this big orange one 😄
21:10karolherbst[d]: but yeah.. definitely something odd going on there
21:11karolherbst[d]: like that feels like something is wrongly configured even
21:12karolherbst[d]: no matter what WFI silliness we do, I doubt it would cause a x10? decrease in performance in a single draw
21:13karolherbst[d]: I suspect this bug is somewhere in the "we are doing more work than necessary" area
21:13karolherbst[d]: or "we run less threads than we could"
21:15phomes_[d]: karolherbst[d]: the render pass is a group of many draw calls that use several different shaders. I could group them by shader instead
21:16karolherbst[d]: ohh, mhhh
21:16karolherbst[d]: might showing a list of all the commands in that render pass?
21:16karolherbst[d]: *mind
21:19phomes_[d]: It is 229 calls of vkCmdDrawIndexed wrapped in a vkCmdBeginRendering/vkCmdEndRendering
21:19karolherbst[d]: nothing else?
21:19karolherbst[d]: like...
21:19karolherbst[d]: vkCmdBeginRendering
21:19karolherbst[d]: vkCmdDrawIndexed
21:19karolherbst[d]: ... repeate 228 times
21:19karolherbst[d]: vkCmdEndRendering
21:19karolherbst[d]: ?
21:22phomes_[d]: yes
21:23karolherbst[d]: mhhhhhhh
21:23phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1498434537787625612/image.png?ex=69f12569&is=69efd3e9&hm=d50d418719e8872a399faf8e578d483ddc77327b54bd28369f150e587fee19e5&
21:24karolherbst[d]: any of them standing out in terms of duration, or are they all kinda similar slow?
21:25karolherbst[d]: but also...
21:25karolherbst[d]: those on the prop side are looking....
21:26karolherbst[d]: all I could imagine is that `nvk_cmd_flush_gfx_state` doing some heavy flushing
21:26phomes_[d]: if duration is very short then renderdoc cannot measure and just give a 0 value
21:27karolherbst[d]: let me check if I find something weird
21:28karolherbst[d]: can you run only that renderpass?
21:29karolherbst[d]: or up to that?
21:29karolherbst[d]: with `NVK_DEBUG=push_dump`
21:29karolherbst[d]: kinda curious what's happening there in real
21:30phomes_[d]: not sure if I can run just the renderpass. But let me check
21:30karolherbst[d]: well
21:30karolherbst[d]: running the things prior would be fine
21:31karolherbst[d]: so everything included that render pass
21:38karolherbst[d]: but it's kinda weird.. I don't see anything there being a problem unless the indirect draw macro is doing something silly...
21:40karolherbst[d]: or we just suck at back to back draws
21:41mhenning[d]: Is it possible it's just taking some time to clock up in that trace? I've seen a lot of variability between runs with renderdoc timestamps in the past.
21:41karolherbst[d]: ~~maybe we spend most time in the macro spinning on the loop...~~
21:41marysaka[d]: phomes_[d]: there is like really no change of descriptor sets?
21:42marysaka[d]: karolherbst[d]: we do emit some writes to cb0 when changes...
21:42karolherbst[d]: mhenning[d]: so you suggest running glxgears in the background and redo the timestamps? 🙃
21:42karolherbst[d]: marysaka[d]: mhhhh
21:43zmike[d]: yeah typically you want to peg your gpu freq when you're profiling
21:43zmike[d]: same as cpu
21:43marysaka[d]: karolherbst[d]: SET_GLOBAL_BASE_VERTEX_INDEX and SET_VERTEX_ID_BASE are not guarded by any MME scratch too
21:43karolherbst[d]: I wonder if we are thrashing some state and the GPU goes constantly into WFI
21:44mhenning[d]: karolherbst[d]: not really sure what to do about it. I guess ideally we'd hook up some of the frequency control on the kernel side
21:44karolherbst[d]: yeah...
21:44karolherbst[d]: Mary is on it, or so I've heard
21:44marysaka[d]: problem about freq control, it's using private GSP commands btw
21:44karolherbst[d]: (it also sounded like she being in pain working on it)
21:44marysaka[d]: I tried to use the interface to lock frequency to base that is public but it seems to not behaves properly so I sitll have to dig more into it
21:45karolherbst[d]: what's nvidia doing with their "prefer performance" or wahtever thing? but that's also not capping, it's just making it go vroooom all the time
21:46zmike[d]: mhenning[d]: you can mitigate this somewhat by just spamming the timestamp button a bunch of times
21:46phomes_[d]: timestamps change when rerunning it. But it is roughly is the same time summed for the render pass.
21:46karolherbst[d]: phomes_[d]: tried the spamming method? 😄
21:46mhenning[d]: zmike[d]: yeah, I've been there before 😛
21:47karolherbst[d]: anyway.. we know for certain that this game runs like really bad
21:47karolherbst[d]: but it's also kinda odd, that like...
21:47karolherbst[d]: doing draws back to back like this is causing issues
21:47karolherbst[d]: so I'm wondering if this is some funky case of nvidia outsmarting us
21:48karolherbst[d]: or we just doing something silly
21:48marysaka[d]: we should probably test to hack around and not do what we do in nvk_mme_build_set_draw_params
21:48zmike[d]: I would make a gfxr capture, trim to a single frame, and then dump your cmdbuf packets against blob for easy compare
21:48zmike[d]: should be easy to find the problem renderpass
21:48karolherbst[d]: ~~is it reallly fast if you skip the P_1INC(p, NV9097, CALL_MME_MACRO(NVK_MME_DRAW_INDIRECT)); part~~
21:48marysaka[d]: those could very well be tracked outside of the macro for anything non-indirect
21:49phomes_[d]: marysaka[d]: yes sorry. I had to set the filter to `$action() Descriptor` and they showed up in renderdoc
21:50marysaka[d]: okay so it totally could be the rebinding too
21:50karolherbst[d]: though that opens up the question.. what if some games are slow even if we skip all draws?
21:50phomes_[d]: I will go get the gfxr capture now
21:50mhenning[d]: zmike[d]: are gfxr captures portable across drivers? with renderdoc we can't run the same capture on prop and nvk
21:51zmike[d]: yes
21:51zmike[d]: across completely different hardware too
21:51mhenning[d]: okay good to know
21:51zmike[d]: you may need `-m rebind`
21:51zmike[d]: also for profiling run with `--wsi headless`
21:51zmike[d]: and you'll want to build the dev branch of gfxr because main is just for sdk releases
21:52zmike[d]: and obviously `--pcj -1` for unlimited async shader compile (which somehow isn't default)
21:52mhenning[d]: obviously
21:53marysaka[d]: zmike[d]: ... you are a lifesaver :blobcatnotlikethis:
21:53karolherbst[d]: I should build that at some point...
21:54zmike[d]: I may or may not be the world's biggest gfxr user so feel free to ping if you have issues
22:07phomes_[d]: gfxr capture: https://drive.google.com/file/d/1_IQftSrQTMhWORKEpAUcoiqa1_J60Un7/view?usp=drive_link
22:20karolherbst[d]: `Replay has encountered a fatal error and cannot continue: std::bad_alloc` 🥲
22:20karolherbst[d]: do I need to be like on the identical version or something?
22:24phomes_[d]: I was on this MR https://github.com/LunarG/gfxreconstruct/pull/2568 back when I built it
22:28karolherbst[d]: funny, getting the same error..
22:30karolherbst[d]: just checking.. what's the sha512sum of that file? It's `c7722bece985e6a015d7036d8ee394aef3108235200ca3f3b446bcf52fa67e8578bf18fb4a91d9b20f183215a0967e4488f9663b96f83cc34d90ea4ea7bfd5bc` here
22:32phomes_[d]: same sum here
22:34phomes_[d]: not sure if it helps, but here is the same file but run trough gfxrecon-optimize https://drive.google.com/file/d/1ndf4CTaEURg0KR-ReBDUHuO7q6DdfCoe/view?usp=sharing
22:35karolherbst[d]: ahh building with debug helps: `gfxrecon-replay: /home/kherbst/git/gfxreconstruct/framework/decode/struct_pointer_decoder.h:133: size_t gfxrecon::decode::StructPointerDecoder<T>::Decode(const uint8_t*, size_t) [with T = gfxrecon::decode::Decoded_VkApplicationInfo; size_t = long unsigned int; uint8_t = unsigned char]: Assertion `(GetAttributeMask() & format::PointerAttributes::kIsStruct) ==
22:35karolherbst[d]: format::PointerAttributes::kIsStruct' failed.`
22:35karolherbst[d]: whatever that means
22:36karolherbst[d]: zmike[d]: I'm having issues
22:37mhenning[d]: have you tried turning it off and back on? 😛
22:39zmike[d]: never seen that before
22:39zmike[d]: file a ticket I guess and I'll try to chase it
22:42karolherbst[d]: I can't rule out it's not related to the captures I've gotten here tho...
22:43karolherbst[d]: maybe I should create my own and see if that works
22:46phomes_[d]: I grabbed mine with this (I turned off the in-game option that was set for F9) : `MESA_SHADER_CACHE_DISABLE=true VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct GFXRECON_CAPTURE_TRIGGER="F9" GFXRECON_CAPTURE_TRIGGER_FRAMES=1 GFXRECON_CAPTURE_FILE="/home/phomes/new-tests/traces-db/trace.gfxr" %command%`
22:49karolherbst[d]: yeah...
22:49karolherbst[d]: so the gfxr file I created myself works
22:50karolherbst[d]: phomes_[d]: what's the `gfxrecon-info` output of one of those files?
22:51phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1498456538279575702/message.txt?ex=69f139e6&is=69efe866&hm=d3d93d31480722f12a7d419a060041cbbdd67d40c11ee8e4c3f668a068952120&
22:53karolherbst[d]: phomes_[d]: I wonder if that PR branch was kinda fucked and if doing a new trace on the newest dev branch would be better
22:53karolherbst[d]: let me see if I can get on the exact same commit
22:53phomes_[d]: I will do that. Also... "Engine name: mesa zink"
22:56karolherbst[d]: yeah.. getting the same assert even on the PR branch
22:56karolherbst[d]: but it's a different commit hash
22:56karolherbst[d]: ohh `810ec2d` exists in the fork..
22:56karolherbst[d]: let's check that
23:05karolherbst[d]: you won't believe this
23:06karolherbst[d]: on the exact commit it indeed works
23:07karolherbst[d]: [gfxrecon] ERROR - /home/kherbst/git/gfxreconstruct/external/VulkanMemoryAllocator/include/vk_mem_alloc.h:14840 pCreateInfo->vulkanApiVersion == 0 || (VK_VERSION_MAJOR(pCreateInfo->vulkanApiVersion) == 1 && VK_VERSION_MINOR(pCreateInfo->vulkanApiVersion) <= 3)
23:07karolherbst[d]: [gfxrecon] INFO - Replay adjusted the vkGetPhysicalDeviceSurfaceFormatsKHR array count: capture count = 2, replay count = 0
23:07karolherbst[d]: mhh might be better that I do my own traces anyway