00:03karolherbst[d]: airlied[d]: we could just honor that flag when doing the lowering
00:03karolherbst[d]: e.g. inside `nak_nir_lower_load_store`
00:03karolherbst[d]: mhh maybe not there
00:03karolherbst[d]: lower descriptors I think
00:04karolherbst[d]: mhh
00:04karolherbst[d]: maybe not there either...
00:04karolherbst[d]: mhhh
00:04karolherbst[d]: yeah no idea 🙂
00:13mhenning[d]: something like this perhaps https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38874/diffs?commit_id=680d684783c562e9ab2bfb81e444d2a70e6b5016
00:15karolherbst[d]: I think we need to change explicit_ios interface to allow for two address modes to be set: for bound checked and non bound checked accesses
00:15karolherbst[d]: not sure if that's a good idea tho
00:28cosmicemotion[d]: mohamexiety[d]: Well it seems to be working. Cyberpunk definitely works better on my 5070M and FF VII Rebirth now gives a VK Exception error instead of out of VRAM. Howd o i actually check the VRAM being used on NVK though?
00:38airlied[d]: karolherbst[d]: it might be once we lower ssbo loads that the compiler just sorts out the extra stuff
00:40mohamexiety[d]: cosmicemotion[d]: The memory accounting stuff should be all part of ttm and all that but I don’t know if it’s exposed anywhere :thonk:
00:41mohamexiety[d]: It’s not part of the rusd stuff for nouveau at least
00:41karolherbst[d]: for loads that's trivial, but sto4e instructions don't know they are bound checked
00:47esdrastarsis[d]: cosmicemotion[d]: you can test capcom games, they support vram usage
00:55cosmicemotion[d]: esdrastarsis[d]: I have just recorded a video of RE3 on Max Settings. It, theoretically says it's using 13.5 GBs of VRAM. Well I was getting close to 100+ FPS most of the time. I think that's more than fine. XD
00:55cosmicemotion[d]: i will laucnh again and see if I can find the VRAM option
00:58cosmicemotion[d]: i dont' see a VRAM reporting option so I'm just uploading the video and you can watch it. In the end it crashed with nouveau reporting `[...] stallled at ffffffff`. I have no idea if that has something to fo with VRAM or not though.
01:02cosmicemotion[d]: Here it is -> https://youtu.be/z9nwvAIzNHI
01:04cosmicemotion[d]: I'm trying Monster Hunter Rise next
01:28cosmicemotion[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1491973139142283446/Screenshot_2026-04-10_04-27-14.jpg?ex=69d9a3c3&is=69d85243&hm=65364544fdf563a87d7fbadda0b6f011508766a30c7d6cf27bf8d6a7d12776ba&
01:28cosmicemotion[d]: Ok this is weird. MHR reports that I have 21 GBs of VRAM lol
01:36esdrastarsis[d]: cosmicemotion[d]: I think dxvk shows mem usage
01:36cosmicemotion[d]: I see thanks! 🙂
03:00cosmicemotion[d]: Is there any way to get the VRAM used using the Mesa Vulkan Overlay?
03:01HdkR: Does nvtop not work with nouveau/nvk yet?
03:01HdkR: That's always what I use on my various platforms :D
03:36esdrastarsis[d]: cosmicemotion[d]: Maybe you can use https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39898 made by phomes_[d]
08:20pixelcluster[d]: https://lore.kernel.org/dri-devel/20260410081322.5577-1-natalie.vock@gmx.de/T/#u :)
08:47notthatclippy[d]: How is dmem cg controller supposed to work with 0FB chips?
08:49notthatclippy[d]: I was looking at it for the prop driver a bit, but as is customary it's full of square peg + round hole issues.
08:53pixelcluster[d]: notthatclippy[d]: i'm not that familiar with nv terminology, can you give a tl;dr on what about 0FB chips could cause issues? (do you mean chips like APUs that don't have any VRAM of their own?)
08:54notthatclippy[d]: Yeah, sorry, context switch fail here. Yeah, for APUs - do you account memory in mem, dmem or both?
08:54notthatclippy[d]: Similar for NUMA stuff, which again I have zero clue how it works with TTM (because I have about 0.3 clue about TTM in general)
08:56pixelcluster[d]: so for APUs, I'm pretty sure the driver still manages two regions - one "VRAM" region which is really just the BIOS carveout, and one "GTT"/"sysmem" region that gets placed in the rest of RAM like usual
08:56pixelcluster[d]: allocations in "VRAM" get charged to dmem as usual, sysmem ones don't get accounted via dmem
08:57notthatclippy[d]: Ah, so carved out ahead of time. I don't think we do that.
08:58pixelcluster[d]: whether or not to account host memory allocations in dmem or memcg (or both) is generally a bit in flux and especially arm platforms have somewhat different needs there afaiu
08:58rhed0x[d]: esdrastarsis[d]: DXVK also does its own vram optimizations
08:58mohamexiety[d]: pixelcluster[d]: The problem is NV APUs don’t have carveout
08:58mohamexiety[d]: It’s all one truly unified pool
08:58pixelcluster[d]: are there x86 NV APUs?
08:59mohamexiety[d]: Oh, no
08:59mohamexiety[d]: (And none work with nouveau yet but Spark might soon)
08:59pixelcluster[d]: yeah like i said ARM platforms have funny considerations and that stuff is still shaping out as we speak
09:00notthatclippy[d]: Do you have some reading material there for me handy perchance?
09:01mohamexiety[d]: Yeah I am curious what’s different with arm platforms :thonk:
09:02pixelcluster[d]: I’ll dig out the mailing list threads for this once I’m done making breakfast 😋
09:03pixelcluster[d]: but fun fact that stuff is basically the question that blocks the systemd implementation :D
09:15cosmicemotion[d]: esdrastarsis[d]: Trying it right now! 🙂
10:12cosmicemotion[d]: Ok I don't know if everything is working properly . Sometimes, in VKD3D, the currently used VRAM is above budget and there's not asingle hich but I have had a persistent crash at single point in RE3 when VRAM is overbudget. I dont' know i it's about VRAM though or it has to do with something else. I'll upload the video so you can see for yourselves. The crash is at the end of the video.
10:16cosmicemotion[d]: I applied the patch with `git apply 0001-WIP-drm-nouveau-wire-up-dmem-cgroups.patch`. That's the proper way to do it right?
10:17cosmicemotion[d]: Here's the video -> https://youtu.be/QiqXLmhj2eg
11:11cosmicemotion[d]: DXVK seems to lag much more when VRAM is close to full but doesn't crash -> https://youtu.be/SsV9IsubZJA
11:11cosmicemotion[d]: I also verified manually that the patch was indeed applied correctly.
11:12cosmicemotion[d]: Also tested Control, I'll post on DXVK and VKD3D and I'll stop spamming. XD
11:12cosmicemotion[d]: VKD3D seems to not lag but when new shaders (?) need to be compiled it freezes. DXVK lags more but doesn't crash.
11:28cosmicemotion[d]: Here's DXVK in Control -> https://youtu.be/HgXG3C6R0Hw
11:29cosmicemotion[d]: And here's VKD3D in Control -> https://youtu.be/6llcxJosnxY
11:29cosmicemotion[d]: OK, I'm done! XD
11:35pixelcluster[d]: pixelcluster[d]: mohamexiety[d] notthatclippy[d] https://lore.kernel.org/dri-devel/CABdmKX0LpKJ9tw48oQh7=3CF0UR5uFtgo0OMwQhHBB40LnijyQ@mail.gmail.com/
11:40mohamexiety[d]: thank you!
12:09blisto[d]: I should become a pixel
12:11pixelcluster[d]: join the cluster
12:56chikuwad[d]: mhenning[d]: I finally got around to addressing your comments on !37888, ready for review again :)
17:16phomes_[d]: what do you think about landing the video MR behind a NVK_EXPERIMENTAL=video flag? It is a pain to rebase it when I test it
17:16phomes_[d]: Faith mentioned in the MR that she would like to add bug fixes on top of the tree anyway
17:18phomes_[d]: The MR description says that it should be good enough for general use, so it would be nice to make it easy to test
17:32chikuwad[d]: that sounds neat
18:07mohamexiety[d]: https://pastebin.com/8AiV0Z6v utrace working (i think)
18:08esdrastarsis[d]: phomes_[d]: Someone needs to send the nvdec patch to the kernel, but this is a good idea
21:18karolherbst[d]: mhenning[d]: have you thoughts on dealing with modifier propagation in NAK? Seeing some concerning patterns:
21:18karolherbst[d]: r99 = fadd.ftz -rZ |r99| // delay=1
21:18karolherbst[d]: r100 = fadd.ftz r99 0xbf19999a // delay=1
21:18karolherbst[d]: and wondering if anybody has good thoughts on how to solve that?
21:20karolherbst[d]: this should really compile to `r100 = fadd.ftz |r99| 0xbf19999a` instead
21:22karolherbst[d]: maybe we should make `copy` allowed to have modifier on the source and legalize to `fadd`/`iadd` later if not copy propped?
21:25karolherbst[d]: I have a 5800 instruction shader, that has `fadd.ftz -rZ |...|` as apattern 450 times 🙃
21:25karolherbst[d]: and many of them could be folded
21:28karolherbst[d]: I wished we could do easy pattern matching in NAK...
21:28karolherbst[d]: `CopyPropEntry::Modifer` 🥲
21:28karolherbst[d]: I could add.. maybe I try that
21:29karolherbst[d]: or reuse Copy? mhh
21:29karolherbst[d]: yeah... that should work...
21:29karolherbst[d]: heh wait.. it alraedy tries this...
21:41karolherbst[d]: same happens with negative sources and even more often...
21:41karolherbst[d]: ~~guess I found one perf issue~~
21:46karolherbst[d]: okay.. we don't do this when .ftz is enabled...
21:46karolherbst[d]: why?
21:48karolherbst[d]: phomes_[d]: mind benchmarking this branch? https://gitlab.freedesktop.org/karolherbst/mesa/-/tree/nak/opt/fabs_fneg
21:48karolherbst[d]: I see some 3% perf improvements in pixmark piano with that..
21:48karolherbst[d]: wondering how other games are doing
21:53karolherbst[d]: I'm surprised it's only 3% tho because it cuts the shader in half 🥲
21:53karolherbst[d]: but it does tons of mufu
21:54karolherbst[d]: and it's a massive loop
21:54karolherbst[d]: maybe we need to have `contract` as part of NAK so we can fold those in a correct way?
21:54karolherbst[d]: like it does change the result...
22:11karolherbst[d]: okay..
22:11karolherbst[d]: soo
22:11karolherbst[d]: that shader uses like 182 regs
22:11karolherbst[d]: if I cap it to 168 (to run 12 warps instead of 8) I get more perf
22:11karolherbst[d]: `Score: 1968 points (FPS: 98)` vs `Score: 2060 points (FPS: 102)` for nvidia
22:12karolherbst[d]: from original: `Score: 1642 points (FPS: 82)`
22:15mhenning[d]: karolherbst[d]: I think you need both adds to have .ftz enabled for the folding to return an identical result. So we'd need to add logic to check for that
22:15karolherbst[d]: mhenning[d]: mhhhhhh... yeah maybe I should read up on that
22:15karolherbst[d]: at least that would work without changing the IR
22:15karolherbst[d]: I think we can be more aggressive with that on any `(contract` alu op
22:16mhenning[d]: yeah, we probably can, although we don't have the contract flags in nak right now
22:17karolherbst[d]: yeah..
22:17karolherbst[d]: anyway it halfs the huge FP shader in pixmark piano...
22:17karolherbst[d]: but it doesn't do much for cycle counts because of loops and tons of mufus
22:18karolherbst[d]: (out of ~3000 instructions it does like 200 mufus)
22:18karolherbst[d]: anyway, I hope it helps with actual games 🙃
22:18karolherbst[d]: I should run some stats
22:19mhenning[d]: karolherbst[d]: is that mostly trig functions? sounds like an odd shader
22:19karolherbst[d]: yeah.. and sqrt
22:19karolherbst[d]: exp2
22:19karolherbst[d]: rcp
22:19karolherbst[d]: sin
22:19karolherbst[d]: I think it's all of it 😄
22:19karolherbst[d]: .cos
22:19karolherbst[d]: .rsq
22:19mhenning[d]: a bit of everything
22:19karolherbst[d]: it renders a piano in a fp
22:19karolherbst[d]: it's insane
22:20karolherbst[d]: sqrt seems more often
22:20karolherbst[d]: anyway...
22:20karolherbst[d]: with the 12 warps and some spilling instead of 8 SM warps and no spilling I get super close to nvidia blob perf
22:20karolherbst[d]: that's like instead of 182 regs I use 168 regs
22:20karolherbst[d]: ehh instead of 184
22:21mhenning[d]: yeah, in general that's a hard tradeoff to make in the compiler
22:21karolherbst[d]: and I wonder if for that massive shaders I should tell assign_regs to do that...
22:21karolherbst[d]: but yeah...
22:21karolherbst[d]: it's difficult to judge at this point
22:21mhenning[d]: right now we pretend less spilling is always better but that's not really true
22:21karolherbst[d]: I only would risk spilling if we are at 8 warps and 8 or 16 regs close to bumping to 12
22:22karolherbst[d]: that's +50% threads
22:22karolherbst[d]: and local memory is aggressively cached...
22:22karolherbst[d]: Spills to mem: 16
22:22karolherbst[d]: Spills to reg: 42
22:22karolherbst[d]: SLM size: 64
22:22karolherbst[d]: in this shader
22:22karolherbst[d]: maybe we should just "try" wth spilling and see how bad it would be?
22:23karolherbst[d]: and if it's too bad, we just use the original amount of gprs?
22:23karolherbst[d]: and only try if we are at 8 warps and 8/16 above able to use 12?
22:23karolherbst[d]: should benchmark it maybe..
22:49karolherbst[d]: huh okay.. so the fabs/fneg folding doens't half the shader, I'm curious where I got the old stat from.. prolly messed up somewhere
22:50phomes_[d]: karolherbst[d]: I will run them now
22:50karolherbst[d]: Instruction count: 3069 -> 2830
22:50karolherbst[d]: Static cycle count: 172892 -> 165617
22:51karolherbst[d]: instr count: +8.4% static cycle: +4.4% perf, actual: 3.1%
22:52karolherbst[d]: so _stats_ kinda agree on the perf increase actually
22:54karolherbst[d]: phomes_[d]: cool thanks! If that shows nice improvements across the board, we can probably get it merged quickly, so far it looks promising, just need to properly type in the patch
23:04karolherbst[d]: phomes_[d]: though might have to do some new baseline, because it's been two weeks 🙃 but yeah...
23:04karolherbst[d]: or do you rebase on top of the baseline?
23:06phomes_[d]: I should redo the baseline. I used to keep the baseline very close to main. I must have been slacking off 🙂
23:06karolherbst[d]: I have no doubt that this patch makes every game faster, but.. 😄
23:18phomes_[d]: almost all games improve with 1-4 fps
23:19karolherbst[d]: which is good to know actually
23:19phomes_[d]: as is usual X4 Foundations is not affected
23:19karolherbst[d]: X4 had something weird going on with depth buffer or something?
23:19karolherbst[d]: but also.. 22 fps is quite low to see a ~3% perf increase
23:20phomes_[d]: it does. But we see that in other game as well
23:20karolherbst[d]: yeah anyway.. shader improvement -> more speed across the board
23:20karolherbst[d]: which is something I am not surprised at
23:21phomes_[d]: yes it could be that. It is just something that seems to always be the case. Everything improves but X4 does not
23:28karolherbst[d]: okay.. my cleaned up patch leaves out cases where the fabs/fneg value is used across loop iterations, but.. better play it safe for now
23:48karolherbst[d]: phomes_[d]: but I'm always skeptical of +1 fps changes 🙃 , because that could also be random inaccuracy, but at least it's across the board, so...
23:51phomes_[d]: yes. I agree. +1 fps could also be just a drop that happened to caused it to increase by 1
23:51karolherbst[d]: okay.. fossil stats are going to be impressive...
23:51karolherbst[d]: phomes_[d]: or thermals
23:52karolherbst[d]: maybe it's hotter outside and your room is hotter and...
23:52karolherbst[d]: okay.. the biggest impact I see in vulkan shaders from gtk 😄
23:54phomes_[d]: yes. These tests are not so suitable for smaller fps gains
23:55karolherbst[d]: you might want to add pixmark_piano to your testing.. or gputest in general, it can be fully automated and they print csv files with the results
23:55karolherbst[d]: e.g. `/GpuTest /test=pixmark_piano /width=1920 /height=1080 /benchmark_duration_ms=20000 /benchmark /no_scorebox /print_score`
23:56karolherbst[d]: and you can use the score which is just the nr of rendered frame in the duration
23:57karolherbst[d]: and with that MR I go from ~1650 to ~1700
23:57karolherbst[d]: and due to thermals the variance is like +-10
23:58karolherbst[d]: not sure how stable the other tests are..
23:58karolherbst[d]: but pixmark_piano is pretty reliable there
23:58karolherbst[d]: with disabled boosting the varriance will be like +- 1