00:02 Timmy[m]: Greetings to all great minds in the room 🤗... (full message at <https://matrix.org/oftc/media/v1/media/download/AaaOdxh4cjJrlVLRLXvzP2Pyrz3SLqmMsEHTNHDHU33OYaimTdhGzHaD2pS-JGkfA70b2m1qUHtgCSJHe3ixhdFCeYjCKGeAAG1hdHJpeC5vcmcvTWFXRHhvSENTSEFUUklieXVWY1FaYVpL>)
00:02 sonicadvance1[d]: Oh, they aren’t talking to me then. Perfect.
00:06 karolherbst[d]: how convenient that the matrix bridge does this so everybody has to click the link just to get disappointed
00:17 swee: lmao
00:20 swee: this is why there will never be a telegram bridge
00:20 swee: to irc
00:35 karolherbst[d]: ah yes.. those membars are all very important: https://gist.githubusercontent.com/karolherbst/39b601b990c8bd1a263c475369ca80c2/raw/0e0b13502c33a56068af1509d0baa06b5e6be7c5/gistfile1.txt
12:48 snowycoder[d]: What does this warning mean in dEQP runs?
12:48 snowycoder[d]: ERROR - dEQP error: SPIR-V WARNING:
12:48 snowycoder[d]: ERROR - dEQP error: In file ../src/compiler/spirv/spirv_to_nir.c:1459
12:48 snowycoder[d]: ERROR - dEQP error: A pointer to a structure decorated with *Block* or *BufferBlock* must not have an *ArrayStride* decoration
12:48 snowycoder[d]: ERROR - dEQP error: 2212 bytes into the SPIR-V binary
12:59 mohamexiety[d]: ignore it
13:00 mohamexiety[d]: export MESA_SPIRV_LOG_LEVEL=error
13:00 mohamexiety[d]: export MESA_VK_ABORT_ON_DEVICE_LOSS=1
13:00 mohamexiety[d]: pass in those for full CTS runs to minimize spam etc
14:46 karolherbst[d]: what's a bit scary is that NIR sometimes is able to look behind those shared_store/shared_load dances CTS tests are doing which sometimes means the shared ops are eliminated entirely 🙃
15:12 karolherbst[d]: but also... there is still stuff to DCE before going into NAK, so it kinda feels like we don't call dce in the end somewhere right before from_nir 🙃
15:13 karolherbst[d]: ohh yeah.. myabe nak_postprocess_nir needs to call dce?
15:13 gfxstrand[d]: We can call dce again. It's safe to call pretty much anywhere.
15:13 karolherbst[d]: not that it matters.. NAK DCEs it as well, might just speed up the compiler a bit tho
15:14 karolherbst[d]: but it's a bit strange
15:15 karolherbst[d]: though all the opt_dces are conditionally called inside nak_postprocess_nir, so maybe I got unlucky in some of the CTS tests 😄
15:15 karolherbst[d]: I created ldsm for a couple of tests and on the nir level they never got dced, but NAK did so later
15:15 karolherbst[d]: tests were passing and I was like: great, seems to work 🙃
15:18 snowycoder[d]: Latest rustc (version `1.88.0`) throws a lot of warnings around elided lifetimes during compilation.
15:18 snowycoder[d]: It's not critical but it would be nicer to fix it
15:18 karolherbst[d]: it's the edition 2024 compat stuff, no?
15:19 snowycoder[d]: I don't know, there's just a `warning: hiding a lifetime that's elided elsewhere is confusing`.
15:19 snowycoder[d]: Doesn't seem like compatibility issues, just confusion issues
15:20 karolherbst[d]: ahh
15:22 karolherbst[d]: ldsm is quite funky.... I like it, I really want to use it outside of coop matrix stuff. It's not _that_ special and I think it's quite easy actually to use it elsewhere
15:23 karolherbst[d]: anything that does shared memory ops based on quads should be able to benefit from it
15:23 karolherbst[d]: like all it does is to load a continues range of 16 bytes per quad
15:26 karolherbst[d]: so if each thread loads at base + thread_id * 4, ldsm can be used for it, and the base address would be base + quad_id * 16, however, in this trivial case address calculation barely matters, it's more interesting for more complex data layouts which are inherently quad based
15:26 karolherbst[d]: though maybe nothing really benefits from it...
15:27 karolherbst[d]: if each quad loads from a different 16 byte aligned base address, then it would be super useful
17:19 gfxstrand[d]: karolherbst[d]: NIR's is better. Let's add more DCE.
17:20 karolherbst[d]: okay. will open an MR later then
17:20 karolherbst[d]: kinda want shader stats and see how much it missed 😄
17:20 gfxstrand[d]: And NIR's DCE is one of the few NIR passes that's actually unstructured-safe so we can run it at the very end.
17:21 karolherbst[d]: yeah.. maybe that's better than doing all the progress checks there
17:21 karolherbst[d]: mhhh
17:22 karolherbst[d]: it could impact divergence analysis, no?
17:22 karolherbst[d]: does NAK make use of the block convergence tags?
17:22 gfxstrand[d]: karolherbst[d]: No. It just deletes stuff
17:22 gfxstrand[d]: Don't run dead control flow. Just DCE.
17:23 karolherbst[d]: right..
18:01 karolherbst[d]: build_txq_size asserts 🥲
18:02 karolherbst[d]: num_components being 0
18:02 karolherbst[d]: ohhhh
18:02 karolherbst[d]: duh....
18:03 karolherbst[d]: gfxstrand[d]: do you have a fix for the wrong argument order of build_txq_size?
18:04 karolherbst[d]: or MR rather...
18:06 gfxstrand[d]: Is a test failing?
18:07 karolherbst[d]: hits an assert in a fossil
18:07 gfxstrand[d]: Oh
18:07 karolherbst[d]: like the argument order is clearly wrong 😄
18:07 gfxstrand[d]: Nope. No idea. Feel free to make an MR.
18:07 karolherbst[d]: wondering how the CTS didn't catch it
18:08 gfxstrand[d]: <a:shrug_anim:1096500513106841673>
18:08 gfxstrand[d]: Just make sure you add a `Fixes:`
18:12 karolherbst[d]: luckily the breakage was merged after the branch point, but yeah: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36394
18:13 karolherbst[d]: uhh.. there is another crash
18:18 mhenning[d]: karolherbst[d]: has anyone actually run cts on that? my most recent run is earlier than that commit
18:20 karolherbst[d]: probably not
18:21 mhenning[d]: as a side note, I really hate the argument order in the nak change in https://gitlab.freedesktop.org/mesa/mesa/-/commit/688a639117e4fbf5d33261be5d41bcb798d593b9
18:22 mhenning[d]: can_speculate could have, for example, always gone last. but instead it's sprinkled either first or second with no rhyme or reason
18:23 karolherbst[d]: well the core issue is C just pretending a bool to be an int and int to be a bool 😛
18:23 mhenning[d]: sure, that sucks too
18:23 mhenning[d]: I wonder if there's a warning we could turn on for that
18:23 karolherbst[d]: anyway, the other crash I'm running into is on the 25.2 release 😢
18:24 karolherbst[d]: mhenning[d]: I'm sure it's impossible, because `if (some_int)` is such a common pattern, even sometimes for function args
18:24 karolherbst[d]: though
18:24 karolherbst[d]: using a bool as an int argument might be reliable
18:26 karolherbst[d]: though... I kinda want to see the MR converting them all to explicit checks
18:37 karolherbst[d]: apparently C23 did a thing on this
18:38 karolherbst[d]: okay, the other regression is caused by https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36105/diffs?commit_id=d5037a34bb05f4304b1ccae70635f77612c3ada9
18:39 karolherbst[d]: nir validation error tho: error: def->loop_invariant == BITSET_TEST(loop_invariance, def->index) (../src/compiler/nir/nir_validate.c:1890)
18:40 karolherbst[d]: "con 32 %6958 = phi b23: %868, b78: %868"
18:41 karolherbst[d]: ehh on con 32 %6959 = @as_uniform (%6958)
18:41 karolherbst[d]: anyway... might need somebody knowing more about it to figure out what's wrong there
22:12 gfxstrand[d]: karolherbst[d]: File a bug with reproduction instructions
22:12 karolherbst[d]: run the radv fossils, but yeah, I should file a proper bug 🙃
22:16 gfxstrand[d]: mhenning[d]: Ugh... 😩
22:36 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36396
22:38 karolherbst[d]: let's hope that doesn't cause similar issues 😛
22:39 gfxstrand[d]: review it
22:39 gfxstrand[d]: I doubt anyone really reviewed Marek's change
22:45 karolherbst[d]: though kinda cool we've reached the point where using nouveau is indeed faster than using an iGPU for gaming, and that's despite Intel having improved their GPUs quite a lot actually 😄
22:45 karolherbst[d]: though there are a lot of micro stutters
22:45 karolherbst[d]: 120 fps gaming :ferrisUwU:
22:46 karolherbst[d]: my iGPU only managers roughly 40 fps in that game 😄
22:46 karolherbst[d]: *manages
22:46 karolherbst[d]: and I'm running my RTX a6000
22:46 karolherbst[d]: which is like almost the fastest ampere out there
22:48 karolherbst[d]: maybe I should recheck with main, because that's on 25.1
22:49 karolherbst[d]: but yeah.. Mary wants to fix the PCIe overhead issues, and I think that's gonna be quite important for perf
22:50 karolherbst[d]: Mary found some cases where changing the PCIe link from 4x to 8x gives like a 10% perf speedup, but 0% on nvidia
22:51 gfxstrand[d]: karolherbst[d]: Yeah, that's a good experiment to run.
22:52 karolherbst[d]: nvk should use the LAUNCH_DMA stuff to upload data to the gpu 😄
22:52 karolherbst[d]: and not memcpy
22:52 karolherbst[d]: but there might be more
22:52 karolherbst[d]: oh and prolly memory compression
22:52 gfxstrand[d]: I doubt that's making a massive difference
22:52 karolherbst[d]: I'm sure it does
22:53 karolherbst[d]: you copy the entire push buffer once and you _could_ pin it to VRAM if you never change it anymore, but random access writes over pcie? gonna suck
22:53 gfxstrand[d]: Oh, DMA for pushbufs? Yeah, maybe.
22:54 karolherbst[d]: the good thing about having the data uploads in the push buffer is, that you only upload it once (e.g. if a game changes small pieces of data, but constantly between each draw)
22:54 karolherbst[d]: if it's baked into the command buffers.. that's gonna help
22:54 gfxstrand[d]: Wait... I have no clue what you're getting at. What data should we be DMAing?
22:55 karolherbst[d]: the LAUNCH_DMA methods
22:55 karolherbst[d]: you can literally upload constant data with it
22:55 mohamexiety[d]: I did start typing up doing QMDs this way but focusing more on compression stuff first
22:55 karolherbst[d]: it's been there forever
22:55 gfxstrand[d]: What constant data do you think isn't in VRAM?
22:55 karolherbst[d]: `NVA097_LOAD_INLINE_DATA`
22:55 karolherbst[d]: nvk doesn't use it
22:55 karolherbst[d]: it should
22:55 gfxstrand[d]: Yes, I know what LOAD_INLINE_DATA is.
22:56 gfxstrand[d]: But I'm not sure what you think we should use it for
22:56 karolherbst[d]: instead of memcpy into gpu buffers directly
22:56 gfxstrand[d]: We're not in control of 90% of GPU buffers. The app is
22:56 gfxstrand[d]: There is no BufferSubData in Vulkan
22:56 karolherbst[d]: not that
22:57 karolherbst[d]: for big data uploads it doesn't matter
22:57 karolherbst[d]: it's more about all the little things
22:57 karolherbst[d]: e.g. uploading shaders
22:57 gfxstrand[d]: We're already DMAing shaders
22:57 gfxstrand[d]: shaders live in VRAM
22:57 karolherbst[d]: setting up vertex buffers
22:57 gfxstrand[d]: The client provides the vertex buffer. There's no `vkCmdVertex4f()`
22:58 karolherbst[d]: uploading descriptors
22:58 karolherbst[d]: whatever meta is doing
22:58 gfxstrand[d]: meta is barely used
22:58 gfxstrand[d]: For descriptors, we have no way to upload therm. They don't get written from a command buffer. They get written from `vkUpdateDescriptorSets()`.
22:59 gfxstrand[d]: Also, they already live in vram with a WC
23:00 karolherbst[d]: `nvk_cmd_buffer_flush_push_descriptors` might be a good place as well
23:00 karolherbst[d]: or `nvk_cmd_buffer_upload_data` generally depending on how often and how much data
23:01 karolherbst[d]: though I guess the latter uploads a cmd buffer?
23:01 gfxstrand[d]: The one place where I think it might matter is QMDs and CS root constants. I'm not sure those are in VRAM today.
23:01 mohamexiety[d]: QMDs aren’t at least
23:02 mohamexiety[d]: Root constants I don’t remember but we memcpy QMDs currently
23:02 gfxstrand[d]: I did a bunch of testing and determined that pushbufs being in VRAM didn't matter.
23:03 karolherbst[d]: mhhh
23:03 gfxstrand[d]: But QMDs and CS root constants might matter some. IDK.
23:03 karolherbst[d]: any internal UBOs being used?
23:04 gfxstrand[d]: Only CS root constants
23:04 gfxstrand[d]: For 3D, root constants live in memory in VRAM and we use constant uploads for them
23:04 karolherbst[d]: though updating through the push buffer might also mean that buffers would need to be reused instead of swapping to a different one with different content, though I haven't looked super closely to see where good places would be
23:05 karolherbst[d]: right.. constant uploads do also help
23:06 karolherbst[d]: I wished there would be a good tool to simply tell us if the PCIe bus is more busy than it should be...
23:06 karolherbst[d]: rather than guessing what's potentially causing issues
23:06 gfxstrand[d]: Yeah
23:07 gfxstrand[d]: If we have high PCI traffic, we might have another place where the CPU is reading back. I got rid of a bunch of those but there might still be some lying around.
23:07 karolherbst[d]: possibly
23:08 karolherbst[d]: I wonder if something could be done about UpdateDescriptorSets tho
23:08 gfxstrand[d]: We're already using WC maps of VRAM there
23:09 karolherbst[d]: yeah,b ut like I suspect that API to be called a lot
23:09 karolherbst[d]: could have push buffers doing the update instead
23:09 gfxstrand[d]: We could maybe batch things more but there are limits
23:09 gfxstrand[d]: No we can't
23:09 karolherbst[d]: and just execute those buffers insted of memcpy
23:09 gfxstrand[d]: I mean maybe
23:09 karolherbst[d]: like async.. on the next submission, just add those internal ones
23:09 karolherbst[d]: or something
23:10 gfxstrand[d]: That's all kinds of fragile
23:10 karolherbst[d]: yeah...
23:10 gfxstrand[d]: Like the app could update then delete before we submit our push
23:10 karolherbst[d]: sure
23:10 gfxstrand[d]: The reality is that Vulkan just doesn't have many places where inline data is useful
23:11 karolherbst[d]: yeah.. though the descriptor set stuff does potentially look like a place if you rework the entire thing 🙃
23:11 gfxstrand[d]: Not really
23:12 gfxstrand[d]: Half the point of descriptor sets is that they live in memory and get accessed directly on the GPU rather than being put through a CPU upload path.
23:15 karolherbst[d]: I kinda hate how it's all a guessing game with PCIe
23:16 gfxstrand[d]: Yeah
23:21 gfxstrand[d]: We could also look at how those numbers change if we pin more stuff to VRAM rather than `VRAM | GART`
23:25 x512[m]: But according to my tests, some demos and applications starts failing with out of mappable range if exclude GART.
23:26 x512[m]: VRAM mappable range is sometimes only 256 MB.
23:26 gfxstrand[d]: Yeah
23:26 gfxstrand[d]: I'm not saying we should force too much VRAM in production, Just that it's a useful experiment
23:57 mhenning[d]: nvk_upload_queue_upload could probably pick inline for small upload sizes. I have no idea how much of a difference that would make though