00:05 airlied[d]: marysaka[d]: did you have any ideas on how best to use LDSM?
00:08 marysaka[d]: airlied[d]: you can use it to load certain size of matrix from shared memory, I can pull out example probably tomorrow
00:09 marysaka[d]: or actually I might have some dumps on usami I forgot to push let me see
00:10 airlied[d]: oh a dump of nvidia using it would be good, I was going to go make one at some point
00:10 airlied[d]: is the plan to add cmat_load_nv and lower cmat_loads to use it when we can pull 8x8 shared mem loads
00:12 marysaka[d]: airlied[d]: yes that's the idea
00:12 marysaka[d]: there is also 816 form
00:12 marysaka[d]: and we have MOVM to transpose matrix too
00:13 marysaka[d]: Hmm it seems I didn't push my codegen for load/muladd/store so I guess that will have to wait tomorrow
00:14 marysaka[d]: but for codegen I basically have this script that I used to grab all the layouts: <https://github.com/marysaka/usami/blob/master/scripts/generate_test_shaders.py> (output being <https://github.com/marysaka/usami/tree/master/coop_matrix_layout_store_shaders>)
00:15 marysaka[d]: and I have this over-engineered runner to test my layout prototype against the shader ASM (by emulating it) <https://github.com/marysaka/usami/blob/master/scripts/mat_store_index.py>
00:15 airlied[d]: okay turing only has 8x8 so I expect for 8x16 I would issue two
00:18 marysaka[d]: airlied[d]: Found some early notes I did https://github.com/marysaka/usami/blob/92326d30b34b7e2142f98d4cccf6e403aa0920e5/tests/compute/coop_mat/testing.comp.glsl#L184
00:19 marysaka[d]: but anyway I will generate the full set tomorrow and give the full dump for Turing and Ampere 👍
00:20 airlied[d]: thanks! this is great stuff!
10:14 marysaka[d]: airlied[d]: Seems I forgot to write that generator for that so I just did and pushed the codegen for SM75 (it will be under full_muladd subdirectory as it was quite verbose) https://github.com/marysaka/usami/tree/master/coop_matrix_layout_store_shaders
14:11 snowycoder[d]: For NAK it could be useful a unit test for passes to ensure that we don't have regressions in NAK-to-NAK lowering or optimizations
14:32 gfxstrand[d]: Yeah. It would be nice if we had a decent framework for that.
14:35 gfxstrand[d]: I kinda wonder if we could do something with a trait to autogenerate the print routines and also autogenerate an assembler of sorts. 🤔
14:36 snowycoder[d]: That would be awesome! Seems something a proc-macro could handle
15:01 snowycoder[d]: gfxstrand[d]: Commit `f18483d2657 ("nak: Use suld.constant when ACCESS_CAN_REORDER is set")` is causing DeviceLost on some tests as `dEQP-VK.binding_model.shader_access.primary_cmd_buf.bind2.storage_image.vertex.multiple_contiguous_descriptors.1d_array`
15:01 snowycoder[d]: Maybe turing does not support that op?
15:03 karolherbst[d]: yeah, it's ampere and newer
15:09 gfxstrand[d]: Damn...
15:26 gfxstrand[d]: I've got a fix. Doing a bit of testing now.
15:28 karolherbst[d]: I have an uhm.. idea what to do about those
15:29 karolherbst[d]: we have this to convert read only images to textures and... I wonder if that would help
15:29 gfxstrand[d]: The problem is that I just converted textures to read-only images. :blobcatnotlikethis:
15:30 gfxstrand[d]: This is a good excuse to figure out the `strong.sys` vs. `strong.cta` perf
15:30 gfxstrand[d]: I'm testing now
15:30 gfxstrand[d]: Veilguard is rally damn sensitive to texel buffer perf
15:30 karolherbst[d]: I don't see why read-only images would be any faster
15:30 karolherbst[d]: or is it just for perf testing?
15:30 gfxstrand[d]: Oh, they're not. But `suld` handles massive descriptors and `tld` doesn't.
15:31 karolherbst[d]: I see
15:31 gfxstrand[d]: I had to switch all buffers (even uniform) to `suld` to make my crazy EDB hacks work.
15:31 gfxstrand[d]: Then perf was terrible, so I switched to `suld.constant` but that's busted
15:31 snowycoder[d]: Just to catch up, what's the difference between memorder constant and weak? the PTX ISA doc doesn't document a constant order
15:31 gfxstrand[d]: So now I'm testing `suld.sys.cta`
15:32 karolherbst[d]: I see
15:32 gfxstrand[d]: I'm not sure exactly how different constant is vs. weak
15:32 gfxstrand[d]: I suspect constant is able to optimize more because it will never dirty anything ever
15:32 gfxstrand[d]: Weak has to land eventually
15:32 snowycoder[d]: Ah, so const means the data we are accessing is constant
15:34 karolherbst[d]: constant has unpredictable results if you write to it
15:35 gfxstrand[d]: I think we can probably fall back to weak if we don't have constant
15:35 gfxstrand[d]: But I'm also taking this as an opportunity to figure how how different cta and sys are in practice
15:37 gfxstrand[d]: Also, playing around with this yesterday I realized that even the loading screen of *Dragon Age: The Veilguard* only runs at 50 FPS and that drops to 2 with slow texel buffers. It's literally got one tiny thing animating on it. I'm going to get a RenderDoc grab of it today because it might give me something really targeted to look at. There's no way a progress bar should only run at 50 FPS.
15:57 magic_rb[d]: It must be a *really* fancy progress bar damn
16:00 gfxstrand[d]: Yeah
16:01 gfxstrand[d]: Okay, `.strong.sys` is 2 FPS, `.sys.cta` is 50 FPS
16:02 gfxstrand[d]: Now let's try .weak just to sanity check.
16:02 gfxstrand[d]: I think this gives more than enough evidence to go ahead with mhenning[d] 's memory order MR.
16:03 karolherbst[d]: .weak is SC-DRF btw
16:03 gfxstrand[d]: SC-DRF?
16:04 karolherbst[d]: mhh maybe I need to rephrase
16:05 karolherbst[d]: it's just weak as in memory model, kinda like arm
16:05 gfxstrand[d]: Yeah, that's fine.
16:05 gfxstrand[d]: I'm planning to use it instead of `.constant` on Turing and Volta.
16:05 gfxstrand[d]: Because it should be more efficient than `.strong.cta`
16:05 karolherbst[d]: do you even need strong?
16:06 gfxstrand[d]: No I don't. That's the poing
16:06 karolherbst[d]: I see
16:06 gfxstrand[d]: I don't have `.constant` and I need something
16:06 gfxstrand[d]: So I'm going for the weakest possible thing
16:06 gfxstrand[d]: So `.weak`
16:07 karolherbst[d]: but impressive that it makes that much of a difference
16:08 gfxstrand[d]: I don't know that it does
16:08 gfxstrand[d]: It kinda looks like it doesn't, actually
16:08 gfxstrand[d]: But I'm going to wait for the game to finish loading before I declare victory or not
16:09 karolherbst[d]: I suspect .sys is the bigger part killing perf, because that's gonna nuke all caches
16:09 gfxstrand[d]: Oh, yeah
16:09 gfxstrand[d]: I'm sure it's the worst part
16:09 gfxstrand[d]: But I may as well give the hardware as much help as we can
16:09 gfxstrand[d]: gfxstrand[d]: In fact, I know it is. Perf testing confirms it.
16:10 gfxstrand[d]: gfxstrand[d]: See above
16:10 karolherbst[d]: I still can't get over that nvidia added .MMIO
16:10 gfxstrand[d]: https://tenor.com/view/i-don%27t-know-idk-idk-about-that-gif-7336250873949261946
16:10 karolherbst[d]: though I hope the main purpose is to interact with files
16:11 karolherbst[d]: or network...
16:12 gfxstrand[d]: It's -6C outside and I just cracked a window in my office because it's getting that warm in here. 🥵
16:12 gfxstrand[d]: Too many space heaters. Not enough space. 😂
16:13 gfxstrand[d]: And I don't even have my 4090 yet!
16:14 gfxstrand[d]: Okay, .weak seems to work well enough. Now I need to swap in the Turing to make sure that's okay.
16:16 magic_rb[d]: gfxstrand[d]: have you broken a card by swapping it too much? Id imagine pcie isnt made to be swapped constantly
16:17 gfxstrand[d]: Not yet
16:42 snowycoder[d]: `i2b(b2i(x))` folding in the copy-prop pass implemented! https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33646
17:03 mohamexiety[d]: magic_rb[d]: from what reviewers mention, it's usually fine actually
17:04 mohamexiety[d]: although recently one vendor messed up on their new motherboards, and the PCIe slot scratches the pins of the device plugged in if you swap too many times. (ASUS on 800 series boards)
17:06 mhenning[d]: gfxstrand[d]: sequentially consistent data race free
17:07 mhenning[d]: A lot of the memory ordering stuff is actually well specified in the PTX docs, but I get a headache whenever I open that and the vulkan memory model side by side and try to compare
17:07 mhenning[d]: which is why I tried to just mimic whatever the prop driver does for those settings
17:08 mhenning[d]: Also, I guess that the 25.0.0 release is just completely broken on turing? That's bad
17:13 gfxstrand[d]: Yeah, I've got a fix
17:14 gfxstrand[d]: I'm just trying to put a few things together before I make the MR.
17:14 gfxstrand[d]: And perf test the shit out of it
17:19 gfxstrand[d]: At least with what I'm seeing with DA:TV, almost all of the perf loss comes from `.sys`. Even `.gpu` is massively faster.
17:19 gfxstrand[d]: So I think we can make everything `.gpu` and get our perf back and keep the memory model happy.
17:20 mhenning[d]: Yeah, that sounds plausible
17:20 gfxstrand[d]: That feels like a nice middle ground for now.
17:23 karolherbst[d]: tried `.vc`? though I doubt it makes any difference
17:24 mhenning[d]: What is .vc?
17:24 karolherbst[d]: apparently cuda doesn't have it
17:24 karolherbst[d]: mhhh
17:24 mhenning[d]: I don't think I've ever seen nvcc emit that
17:24 karolherbst[d]: I don't really know what .VC is.. but it's between SM and GPU
17:24 karolherbst[d]: I suspect it's virtual channel
17:25 karolherbst[d]: so might be good enough for memory within the same gpu context
17:25 mhenning[d]: That sounds wrong? You can encode `CONSTANT.VC.` for example https://kuterdinel.com/nv_isa/ATOMG.html
17:25 karolherbst[d]: but that won't work with external memory maybe
17:25 karolherbst[d]: constant is semantic, not scope
17:26 karolherbst[d]: and .VC is just another scope
17:27 karolherbst[d]: I _think_ .GPU synchronizes for the entire GPU, where .VC just restricts it within a channel
17:27 karolherbst[d]: but no clue if that changes anything
17:27 mhenning[d]: On ampere+ there's only one field for both, so eg. there is no encoding difference between .weak.gpu and.weak.sm because they're the same thing
17:27 mhenning[d]: scopes only matter for .strong
17:27 karolherbst[d]: both probably operate on L2 caches
17:27 karolherbst[d]: mhhh interesting
17:28 karolherbst[d]: ohhh
17:28 karolherbst[d]: you are right
17:28 karolherbst[d]: scope is ignored for .WEAK
17:28 mhenning[d]: Oh, I guess we do have .constant.sm and .constant.cta
17:28 mhenning[d]: karolherbst[d]: It's not just ignored, it's unencodable on ampere+
17:29 karolherbst[d]: fair enough
17:29 karolherbst[d]: but yeah... there is .VC but I have no idea if that matters for perf, but if it wouldn't why did nvidia add it
17:30 gfxstrand[d]: I doubt .VC is going to matter much on desktop cards.
17:30 gfxstrand[d]: But also, .GPU seems to be good enough at least for what I'm looking at at the moment
17:30 gfxstrand[d]: Time to see if the memory model is happy with it.
17:30 karolherbst[d]: I hope some weird dma-buf stuff isn't failing with it
17:31 gfxstrand[d]: nah. Vulkan requires pretty heavy barriers between GPU and CPU
17:31 karolherbst[d]: sounds fine then
17:31 gfxstrand[d]: Worst case, we have to insert a DC flush at the end of the command buffer
17:31 karolherbst[d]: yeah..
17:44 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33649
17:45 gfxstrand[d]: That fixes Turing and gets rid of `MemOrder::System` everywhere.
17:45 gfxstrand[d]: So far, the big perf less for everyone to learn is that PCI is evil.
17:58 gfxstrand[d]: Maybe now that we're not completely destroying the PCI bus, I can actually start to see perf improvements from things.
17:58 gfxstrand[d]: Or maybe we're still destroying the PCI bus and just need to find another culpret
18:01 gfxstrand[d]: mhenning[d]: RE the backport. I'm pretty sure `.gpu` is safe. Going all the way to CTA would make me more nervous but `.gpu` should be okay and `.sys` is really bad.
18:04 gfxstrand[d]: But yeah, backporting perf fixes is always a little sketchy
18:06 mhenning[d]: Right, I get that it's significant for perf, just we used the same reasoning to backport https://gitlab.freedesktop.org/gfxstrand/mesa/-/commit/ffdc0d8e98eeb68abcfff3c48b3691b999305004 and we have a broken 25.0.0 as a result
18:07 mhenning[d]: It would be unfortunate if we fixed that commit and then immediately broke something else so we have a broken 25.0.1 too
18:10 gfxstrand[d]: Yeah.
18:11 gfxstrand[d]: That's fair.
18:11 gfxstrand[d]: I'll drop the backport tag
18:11 karolherbst[d]: mhhhhhh
18:11 gfxstrand[d]: I just want 25.0 to not suck. 😢
18:11 karolherbst[d]: was the game using host visible memory in the loading screen?
18:11 gfxstrand[d]: I'm going to figure that out in a bit.
18:12 gfxstrand[d]: I'm grabbing a RenderDoc trace of just the loading screen. Given that just the progress bar is only 50 FPS, there's gotta be some low hanging fruit there.
18:15 mhenning[d]: gfxstrand[d]: To be honest, I'd feel okay about it if we let people test the patch out on main for a few weeks and then backported it. It's mostly just doing a backport without much time for people to report issues that I'm worried about.
18:15 gfxstrand[d]: That's fair
18:16 gfxstrand[d]: And I did think about that.
18:16 gfxstrand[d]: We can always ask eric_engestrom to backport stuff later if it's super important
18:16 gfxstrand[d]: Okay, once this second CTS run is done. I'll land it without the backport tag.
18:22 gfxstrand[d]: gfxstrand[d]: That was less instructive than I'd hoped. Frostbite is really dumb. It renders the entire menu scene and then renders a big purple rectangle over it with the loading bar.
18:24 zmike[d]: that's definitely not gonna beat csgo rendering a fullscreen A8_UNORM over the menu
18:30 asdqueerfromeu[d]: zmike[d]: "Fullscreen alpha format downloads EVERY frame before EVEN reaching the title screen"
18:32 magic_rb[d]: mohamexiety[d]: that is exactly the thing i had in mind, id assume the traces will eventually wear down, there is always some rubbing on insertion of the gpu or there wouldnt be a connection
18:39 gfxstrand[d]: We can never be less dumb than that app...
18:40 asdqueerfromeu[d]: mhenning[d]: What is broken in Mesa 25.0.0 NVK?
18:41 gfxstrand[d]: Storage buffers on Turing
18:47 gfxstrand[d]: Just assigned the fix to Marge.
18:50 gfxstrand[d]: I'm a little annoyed by this one. There aren't many differences between Turing and Ampere on the shader side so I usually figure my 3060 CTS runs are enough. 😕
19:42 airlied[d]: Ive had pcie slots wear out, not sure a GPU connector ever has, but I've also accidentally hot unplugged cards and I once blew the motherboard ethernet up swapping gpus
19:44 mohamexiety[d]: yeah the connectors are usually pretty sturdy unless the motherboard does something weird (e.g. https://videocardz.com/newz/asus-pcie-slot-q-release-slim-mechanism-may-scratch-your-gpu-first-rtx-5090-affected)
19:47 tiredchiku[d]: <a:catscratch:968620761923342396>
19:48 tiredchiku[d]: 👆 asus mechanism
19:49 mhenning[d]: magic_rb[d]: Did you ever get around to testing the deep rock galactic workaround?
19:52 magic_rb[d]: mhenning[d]: No not yet sorry
19:53 magic_rb[d]: I havent forgotten
20:11 mhenning[d]: No worries, just wanted to check in about it
20:51 redsheep[d]: mhenning[d]: What workaround? The game worked for me just the other day on main
20:54 mhenning[d]: redsheep[d]: This: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33502
20:54 mhenning[d]: We accidentally landed it without testing, so if it works on main that's a good sign
20:55 redsheep[d]: Yeah, that makes sense. That merged before I tested
20:56 redsheep[d]: Gas anyone tested other games besides veilguard now that the .GPU memory ordering is merged? I'm super curious
20:56 redsheep[d]: I'll test it myself in a few hours
20:56 mhenning[d]: redsheep[d]: Not that I'm aware of
21:07 phomes_[d]: I am testing games right now. The first 4 games are all still working. 3 vkd3d games improved fps but only by 1
21:08 redsheep[d]: Only be 1... Like 1 fps?
21:09 redsheep[d]: That's not a ton
21:12 phomes_[d]: 144->145, 62->63, and 30->31
21:13 phomes_[d]: compared to main yesterday
21:14 gfxstrand[d]: +1 fps is fine
21:14 gfxstrand[d]: I honestly don't expect EDB to make a huge difference
21:14 gfxstrand[d]: I just want to be sure it's working
21:31 phomes_[d]: gfxstrand[d]: do you still prefer not to have issue reports for games? Some problems seem interesting. Like Sniper elite 5 working well in vulkan mode, but dx12 mode hangs and triggers the annoying "error fencing pushbuf: -19" that I hit once or twice per week
21:33 gfxstrand[d]: I think it's okay to have bugs in games if they're known to work on RADV. So anything steam deck certified is okay.
21:49 magic_rb[d]: Can i ask why specifically if it works on radv?
21:51 dwfreed: works on radv is a pretty good indicator that it's a driver bug and not a game bug
21:56 magic_rb[d]: Right makes sense, i wasnt sure if there isnt like a common part between nvk and radv
22:15 asdqueerfromeu[d]: magic_rb[d]: Both use the Vulkan runtime (but to different degrees)
22:19 gfxstrand[d]: Mostly I don't want to be chasing ghosts where a game doesn't work but it's a Wine problem or a DXVK/VKD3D bug.
22:20 gfxstrand[d]: Because there are still a lot of games that are pretty sketch on Linux for reasons totally unrelated to the Vulkan driver
22:21 magic_rb[d]: Makes sense, it just didnt occur to me initially
22:29 gfxstrand[d]: I'm wondering what all other rediculous sources of PCIe traffic we have lying around.
22:30 gfxstrand[d]: Once the game boots, I'm going to get a snapshot of heap usage.
22:30 magic_rb[d]: Btw can i help fixing up suspend resume? It seems to consistently break nouveau requiring a reboot. Also if i stumble upon a kernel panic, what do i gather to help you?
22:31 gfxstrand[d]: airlied[d]: skeggsb9778[d] ^^
22:31 gfxstrand[d]: I'm not sure what's the most useful there. The NVIDIA GPU in my laptop goes out for lunch pretty regularly, unfortunately.
22:32 gfxstrand[d]: 570 may fix some of it. IDK. The firmware we're currently using wasn't really meant for anything but server cards.
22:32 gfxstrand[d]: I haven't pulled the 570 branch to my laptop
22:32 magic_rb[d]: Ah right that's a thing, i forgot. Then probably should wait
22:33 magic_rb[d]: Ill dedicate some time tomorrow to try the deep rock patch and also 570, there is a branch?
22:33 magic_rb[d]: (Hoepfully itll build on nixos)
22:57 airlied[d]: I keep getting a bit into trying to work out my turing laptop problems and then forget, I should also try 570 on it
23:04 mhenning[d]: magic_rb[d]: I think the 570 branch is https://gitlab.freedesktop.org/bskeggs/nouveau/-/tree/03.00-r570?ref_type=heads
23:04 rinlovesyou[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1342270779940802641/image.png?ex=67b90694&is=67b7b514&hm=e3b2ea1bb61c2463fa03513b0e4d657733fad1a0ee499be70acdfb8a1f33699c&
23:04 rinlovesyou[d]: oh wow
23:05 mhenning[d]: uh, that's probably not a reliable commit count
23:05 rinlovesyou[d]: yeah something has broken there
23:05 mhenning[d]: I think drm/nouveau is way behind linux mainline
23:06 gfxstrand[d]: Yeah. The main branch doesn't always get propagated to subtrees
23:07 rinlovesyou[d]: ah right so that's >100k linux kernel commits merged into this fork of drm/nouveau?
23:10 rinlovesyou[d]: yeah that's what's going on, that's 2 years of linux kernel development alright
23:10 airlied[d]: yeah that's just gitlab creating information from places that don't make sense for kernel
23:10 gfxstrand[d]: It just tells you how old the drm/nouveau master branch is
23:10 rinlovesyou[d]: yeye
23:25 gfxstrand[d]: Okay, looks like the game is using about 5.5 GB of VRAM and 1 GB of system RAM. That seems like too much system RAM. 🤔
23:26 gfxstrand[d]: No idea how much memory traffic is on the system RAM
23:26 gfxstrand[d]: I wish I had counters
23:32 gfxstrand[d]: Something funky is happening with memory budgets
23:33 gfxstrand[d]: memoryHeaps: count = 2
23:33 gfxstrand[d]: memoryHeaps[0]:
23:33 gfxstrand[d]: size = 8585740288 (0x1ffc00000) (8.00 GiB)
23:33 gfxstrand[d]: budget = 392167424 (0x17600000) (374.00 MiB)
23:33 gfxstrand[d]: usage = 0 (0x00000000) (0.00 B)
23:33 gfxstrand[d]: flags: count = 1
23:33 gfxstrand[d]: MEMORY_HEAP_DEVICE_LOCAL_BIT
23:33 gfxstrand[d]: memoryHeaps[1]:
23:33 gfxstrand[d]: size = 75642175488 (0x119ca00000) (70.45 GiB)
23:33 gfxstrand[d]: budget = 42083549184 (0x9cc600000) (39.19 GiB)
23:33 gfxstrand[d]: usage = 0 (0x00000000) (0.00 B)
23:33 gfxstrand[d]: flags:
23:33 gfxstrand[d]: None
23:34 gfxstrand[d]: I guess I do have RenderDoc running so maybe that's okay
23:38 gfxstrand[d]: Looks like lots of UBOs and similar are living in VRAM
23:38 gfxstrand[d]: Depending on caching that could be okay or that could be terrible.
23:44 gfxstrand[d]: Let's throw everything in VRAM and see what that does
23:45 gfxstrand[d]: gfxstrand[d]: This makes the whole texture buffer thing make more sense. They use a bunch of texture buffers for *something* and those texture buffers live system RAM and if we're doing `suld.sys`, uh... Yeah, that'll sync with the universe.
23:46 gfxstrand[d]: That explains why we haven't seen as much perf difference with regular images. There's some but they typically live in VRAM so it's not nearly as bad.
23:46 gfxstrand[d]: But with texture buffers, yeah....
23:55 gfxstrand[d]: Okay, shoving everything to VRAM drops the game to 8 fps. 😢