IRC Logs of #nouveau on irc.freenode.net for 2026-04-09

16:01 eueumesmo: hi
16:02 eueumesmo: Any idea why nouveau driver still doesnt scale like open-gpu-kernel-modules and nvidia-proprietary drivers ?
16:03 karolherbst: "scale"?
16:03 eueumesmo: maybe something early about linux and open-gpu-kernel-modules and nvidia-proprietary drivers ...
16:04 eueumesmo: karolherbst, 3d performance
16:04 karolherbst: it's a lot of work
16:04 karolherbst: nvidia has lots of money to hire lots of people, nouveau doesn't
16:04 karolherbst: that's basically the reason
16:07 eueumesmo: karolherbst, but nvidia pushed basic source code too ( open-gpu-kernel-modules )
16:07 karolherbst: and?
16:07 eueumesmo: Now we have the oportunity to discuss more that before about things
16:07 karolherbst: sure, but you still need people to do the work
16:07 karolherbst: and the kernel driver isn't the only reason why things might be slow
16:08 eueumesmo: And nouveau is a old project already
16:08 karolherbst: what's your point?
16:09 karolherbst: if you don't like the answer to your question, maybe don't ask. Not here to debate it. There is constantly new hardware released and it's a lot of work. If you don't like the situation can always help or pay somebody to help
16:09 eueumesmo: karolherbst, I don't know t why nvidia's proprietary and closed parts has strict relation with others peace of code in performance, including proprietary UEFI firmware sctructure.
16:10 karolherbst: what you are saying makes little sense
16:11 eueumesmo: karolherbst, some apple products has magic numbers for a while making only apple products recognize its gadgets ( apple too ). I have heard on the web.
16:11 karolherbst: what?
16:12 karolherbst: how is that related
16:13 eueumesmo: Gemini answer: Visão geral criada por IA
16:13 eueumesmo: "Magic numbers" in device drivers are specific, hard-coded numeric constants (often hexadecimal) used within software code to uniquely identify data structures, verify integrity, or signify specific states. In Linux kernel drivers, they are frequently placed at the start of a struct to detect debugging errors, such as memory corruption or invalid pointers.
16:13 karolherbst: Maybe stop asking an AI about why nouveau is slow
16:14 karolherbst: It appears you were successfully confused about many things
16:14 eueumesmo: karolherbst, maybe it is related strictLY and sometimes bad supported ACPI
16:14 karolherbst: it has nothing to do with ACPI
16:14 eueumesmo: Or some other feature
16:15 karolherbst: yeah, but an AI won't be able to tell you either way
16:16 eueumesmo: karolherbst, do you know if nvidia has UEFI firmwares files or only that needed by nouveau on /lib/firmware
16:16 eueumesmo: !?
16:16 karolherbst: stop asking questions based on AI output
16:16 eueumesmo: karolherbst, i stop already
16:16 eueumesmo: *stopped
16:17 karolherbst: it has nothing to do with any system firmware related stuff
16:17 eueumesmo: I study Science Computing degree
16:17 eueumesmo: I have worked at Accenture as analyst
16:18 eueumesmo: karolherbst, nor g-sync stuff
16:18 eueumesmo: ?
16:18 eueumesmo: freesync/g-sync
16:18 karolherbst: I don't see how that's relevant to performance either...
16:19 eueumesmo: karolherbst, Because the driver do not performs high compared with open-gpu-kernel-modules and nvidia-proprietary drivers. Even with NVK.
16:20 karolherbst: like people are happy to answer questions, but I'd recommend forgetting about everything an AI "told" you, because it seems you are heavily confused about many things here
16:21 eueumesmo: karolherbst, As i said to you i was Accenture programmer and very eventually i search something on google ( AI related )...
16:21 eueumesmo: karolherbst, I appreciate your attention with my questions!
16:21 karolherbst: most performance gaps are from userspace implementations of APIs and the compiler we have there
16:23 eueumesmo: karolherbst, hum. You mean about compiler flags ...
16:24 eueumesmo: karolherbst, I am not an expert with c/c++ but if you need to make some pre-defined monitored configuration i can help
16:25 eueumesmo: karolherbst, To test something else
16:26 karolherbst: No, I meant the compiler we written for nvidia hardware in Mesa. It's used to compile shaders and compute kernels
16:29 eueumesmo: karolherbst, OK!
16:38 eueumesmo: karolherbst, i will start debugging kernel Linux and eventually specific modules like ( nvk ) and if meanwhile if you need any help please contact me. I will keep joining #nouveau frequently.
16:38 eueumesmo: karolherbst, Thank fo you attention and your explanations!!
18:58 cosmicemotion[d]: Hello! 🙂 Is there any chance we can get this on Nouveau/Nova and NVK? -> https://pixelcluster.github.io/VRAM-Mgmt-fixed/
19:40 airlied[d]: karolherbst[d]: you are also making an assumption that is the compiler and userspace 🙂 I don't think we have any great proof that we aren't just missing some random 2080 ctrl call that will jump perf :-p
20:19 karolherbst[d]: airlied[d]: heh, that might be the case for 3D indeed
20:20 karolherbst[d]: and there are things that need kernel side support like compression, but from what I've seen most of the work still needs to be on userspace side
20:20 karolherbst[d]: like how long would it take to fix those things on the kernel side compared to all the things we still need to do in mesa
20:23 karolherbst[d]: though I think marysaka[d] looked into this at some point?
20:24 marysaka[d]: yeah I did start some patches for some weirdness and also attempt to get channel stop/idle implemented for GSP but nothing more than that
20:24 karolherbst[d]: ohh I meant, I think you also compared performance, no?
20:24 marysaka[d]: Oh you mean comparing with the blobs as a kernel stack
20:25 karolherbst[d]: yeah
20:25 marysaka[d]: well I still have WSI to figure out with it sadly but CTS time were identical than with nouveau in headless 🫠
20:26 karolherbst[d]: well okay, but that ain't a good perf benchmark, mhh I kinda thought you did test real games, seems I was mistaken
20:28 marysaka[d]: yeah no I didn't go that far yet sadly
20:29 marysaka[d]: I still have no ideas how sync fds work with nvidia-drm I should probably figure that out at some point...
20:29 marysaka[d]: it's the only blocker now
20:29 karolherbst[d]: I still think our problems are mostly in mesa though 🙃
20:29 marysaka[d]: yeah I believe that too
20:29 karolherbst[d]: but yeah, would be good to verify
20:29 marysaka[d]: one thing that would help a lot would be ZBC
20:30 karolherbst[d]: our compiler is still big sad sometimes
20:30 karolherbst[d]: marysaka[d]: ahh yeah
20:30 karolherbst[d]: maybe
20:30 karolherbst[d]: that needs new UAPI?
20:30 marysaka[d]: tbh I'm more concerned about the ~900 WFI that shadow of the tomb raider end up with on NVK on a single frame 🙃
20:30 marysaka[d]: karolherbst[d]: yes and integration on GSP side
20:30 karolherbst[d]: marysaka[d]: pain
20:31 marysaka[d]: we had a old uAPI for the old way of stuffs it was never wired to gallium anyway...
20:31 karolherbst[d]: 3D <-> compute a lot?
20:31 marysaka[d]: yeah
20:31 karolherbst[d]: what about copy? 🙃
20:31 mhenning[d]: marysaka[d]: oh, do you have counters for wfis?
20:31 marysaka[d]: it's basically only that and a ton of dispatch + indirect
20:31 karolherbst[d]: marysaka[d]: ahhh
20:32 karolherbst[d]: marysaka[d]: I wonder if async compute could help, because that's not doing WFI
20:32 karolherbst[d]: in theory
20:33 marysaka[d]: mhenning[d]: no I just used NVK_DEBUG=push_dump and added some prints around for each queue submit, and then did a capture with gfxreconstruct and replayed it to get a more "clean" test env
20:33 mhenning[d]: fair
20:34 marysaka[d]: ... might want to wire something to make that more usable tho like maybe the overlay that phomes made could be useful for that?
20:34 karolherbst[d]: don't we have real counters for that I wonder?
20:34 marysaka[d]: nSights have stuffs I think
20:34 karolherbst[d]: mhhh yeah we need those counters as well 🙃
20:34 marysaka[d]: but it might be command injection
20:35 marysaka[d]: they have a whole infra for aftermath to add markers from what I understood
20:35 karolherbst[d]: mhhh
20:35 karolherbst[d]: so you mean they just count if the commands they execute is doing a WFI?
20:36 marysaka[d]: yeah but unsure, I should do some capture with nsight hooked....
20:36 karolherbst[d]: I really want those shader counters, maybe I look into this next month lol
20:36 marysaka[d]: envyhooks on top of that was... not very stable last time I tried
20:37 marysaka[d]: karolherbst[d]: you should but I think those are different in term of how you record them 😄
20:38 marysaka[d]: tho hmm if you have a headless benchmark that you could poke at we could maybe wire that on nvrm in userspace for testing at first?
20:38 karolherbst[d]: vk_coop_matrix?
20:38 marysaka[d]: after all it's just some GSP calls to wire up from userspace in that case...
20:38 karolherbst[d]: https://github.com/jeffbolznv/vk_cooperative_matrix_perf
20:38 marysaka[d]: oh that could be it then yeah
20:40 karolherbst[d]: that benchmark is so trivial it's actually great tho
20:40 karolherbst[d]:but
20:40 karolherbst[d]: we are already 80% close compared to nvidia there soo ... 😄
20:41 karolherbst[d]: (except fpr fp32)
20:41 karolherbst[d]: needs my GPR+UGPR, the phi vec shit and other random stuff to pump that up to 90%
20:43 karolherbst[d]: but yeah.. would be interesting to see if there is a huge difference there with nvrm
20:50 airlied[d]: any idea how many WFIs the prop driver does though?
20:58 karolherbst[d]: knowing nvidia, probably 1/10 that much. They do command reordering and I would be hugely surprised if they don't optimize for reducing WFI by reordering compute and 3D into chunks, or just do SCG for async compute tasks
20:58 karolherbst[d]: I should properly setup nsights at some point
21:00 marysaka[d]: I need to redo a gfxrecon capture as I cannot find the proprietary one again but yeah it was less than 100 when I was grepping the envyhooks output
21:07 airlied[d]: back when I started working on AMD we had all these ideas of what the AMD drivers did to optimise the command stream, and in the end they never showed more than a 0.005% type of improvement
21:08 airlied[d]: like they were in the long tail of things to close the final 1% not the 50%
21:08 airlied[d]: same with compiler ALU 😛
21:09 pixelcluster[d]: my honest experience with catching up to proprietary drivers is that the first 80% are the dumbest stuff you could ever imagine
21:09 pixelcluster[d]: very much "flip one config register and double performance" levels of dumb
21:09 airlied[d]: yes totally that 🙂
21:11 karolherbst[d]: airlied[d]: yeah but it's AMD, not nvidia 🙃
21:11 karolherbst[d]: no offense
21:12 karolherbst[d]: but I do know a lot of things we aren't doing that do matter for perf
21:15 karolherbst[d]: pixelcluster[d]: yeah so I had that with compute where we configured shared memory in a way that only allowed us to run one workgroup in parallel instead of like 8 or so 🙃
21:16 pixelcluster[d]: amd rt bringup had like 6 of these occurrences
21:16 karolherbst[d]: but we run the same "kernel" driver as nvidia these days, it's just a big blob of firmware
21:16 pixelcluster[d]: radv rt, i should say
21:16 karolherbst[d]: so I still believe it's mostly just userspace at this point
21:16 karolherbst[d]: mostly
21:16 karolherbst[d]: we should predicate global memory writes 🥲
21:18 airlied[d]: well userspace command submission might affect some things, but also userspcae might need to configure the firmware with magic we aren't doing
21:18 karolherbst[d]: yeah maybe
21:19 karolherbst[d]: but I kinda doubt that
21:19 karolherbst[d]: like there are a few hw features we still want to wire up
21:19 karolherbst[d]: but our perf isn't _that_ terrible
21:20 marysaka[d]: We are missing TSG, ZBC and pinning of queues to a specific engine that's for sure but I wouldn't way this is the biggest thing
21:20 karolherbst[d]: like for compute we can get 90% close or even closer
21:20 airlied[d]: it would be good to run the coop matrix perf test on nvk on openrm
21:20 airlied[d]: since it won't need WSI
21:20 karolherbst[d]: it just a lot of compiler work to get the other stuff close as well
21:20 marysaka[d]: TSG might have an impact at the very least to avoid some WFI and switch between contexts that are grouped together
21:20 karolherbst[d]: ahh yeah..
21:21 karolherbst[d]: there are a couple of compute benchmarks where we perform way worse
21:21 karolherbst[d]: but all I looked at it was always "our shaders surely suck"
21:23 karolherbst[d]: I know that llama-bench PP sucks a lot for us: https://www.phoronix.com/review/linux-70-nouveau/5
21:35 karolherbst[d]: gfxstrand[d]: oh yeah, what's the status on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40465 ?
21:38 karolherbst[d]: oh yeah, if somebody wants to review some patches: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39384#note_3418460 I think those 5 would be nice to land to move stuff out of that MR
22:12 gfxstrand[d]: karolherbst[d]: I think it might be moot now that the loop stuff landed
22:12 karolherbst[d]: ahh so it did
22:12 karolherbst[d]: I should test my MR on top of main then
22:32 karolherbst[d]: did some pixmark_piano benchmarking: nvk+zink: 1642, nvidia: 2137, nvidia+zink: 2060
22:32 mangodev[d]: we're maybe getting closer?
22:32 karolherbst[d]: well it's just a massive fragment shader
22:33 mangodev[d]: zink is looking pretty good though
22:33 mangodev[d]: would that be almost margin of error for pixmark scores?
22:33 karolherbst[d]: nah, the margin of error is like +-1 with stable clocks
22:33 karolherbst[d]: +-10 with boosting
22:33 mangodev[d]: ah
22:33 mangodev[d]: even still though
22:34 karolherbst[d]: the good thing about that benchmark is, that you'll notice the impact pretty reliably for any optimization you are doing
22:34 karolherbst[d]: even for micro optimizations
22:34 mangodev[d]: a really small difference for running through a conversion layer
22:34 mangodev[d]: it's less than 100 points of difference between nvidia+zink and plain nvidia gl
22:34 karolherbst[d]: even doing FADD over F2F for implementing fabs/fneg gave me reliably improvements there
22:56 airlied[d]: karolherbst[d]: one other hack might be to try and rewrite the nvidia coop mat shader to be nvk compatible and run it
22:56 karolherbst[d]: airlied[d]: well I know that their 2nd ext makes a 33% difference
22:56 karolherbst[d]: and I got more patches to improve perf quite a bit more
22:57 karolherbst[d]: but anyway.. better shaders -> more perf is a pretty stable pattern with those so far
22:57 airlied[d]: I was more thinking to see if there are any other hidden non-shader diffs
22:57 karolherbst[d]: yeah.....
22:57 karolherbst[d]: I _think_ it's shared memory
22:57 airlied[d]: but yeah coop mat tests are severe memory bw tests
22:57 karolherbst[d]: I did the math
22:57 karolherbst[d]: in theory you could run 5 workgroups with some of them instead where we run 4
22:57 karolherbst[d]: if you alias shared memory
22:57 karolherbst[d]: that's +25% more perf
22:58 karolherbst[d]: so that's something I wanted to look into
22:58 karolherbst[d]: and then I got 40% + 25% ~ 50%, + 33% => 85%
22:58 karolherbst[d]: ehh
22:58 karolherbst[d]: * 25%
22:58 karolherbst[d]: `*`
22:59 karolherbst[d]: anyway.. my point is: I have no idea where 15pp perf is lost 🙃
22:59 karolherbst[d]: but I have a good idea for most of the other pp
22:59 karolherbst[d]: running more workgroups will help for sure, but not quite sure _how_ much
23:11 airlied[d]: should probably switch back to cm2, spark is blowing up for me on golden context creation, no idea why yet
23:16 karolherbst[d]: heh..
23:16 karolherbst[d]: though I'm more interested in the aliased shared and scratch memory stuff
23:17 mhenning[d]: cosmicemotion[d]: pixelcluster[d] I am kind of curious how much driver-specific work there is for the cgroup stuff. Did you need amdgpu changes? At a glance I only saw you changing ttm in the patches
23:17 mohamexiety[d]: she actually has a kinda proto untested patch for nouveau
23:17 mhenning[d]: ooh cool
23:18 mohamexiety[d]: it was not even compiled, so it's like untested untested
23:18 mohamexiety[d]: but if it's that simple then it could be worth looking into
23:19 airlied[d]: karolherbst[d]: the nir branch that was posted previously should be testable with much backend changes
23:20 karolherbst[d]: yeah.. I'm supposed to do the drm backport this month, so not sure when I'm getting to it, but yeah.. it's something I wanted to look at
23:20 karolherbst[d]: and also landing: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39384 because that one helps a bit
23:21 karolherbst[d]: and also predicated global stores help 🙃 so much to do
23:23 karolherbst[d]: airlied[d]: ohh also doing no bound checks at all helps a bit still.. though disabling all gave me +50% perf, doing the predicated global load thing already gave me +35%, so "no bound checks" doesn't help _as much_ anymore, but that was something you said we could do for coop mat there?
23:23 karolherbst[d]: but... I suspect that doing predicated global stores closes the gap almost entirely already, soo...
23:23 karolherbst[d]: and that helps with tons of other things
23:25 airlied[d]: yes there is a separate robustness bit for coopmat
23:25 airlied[d]: but nir really isn't ready for it
23:25 airlied[d]: since we decide the form for ssbo loads in advance
23:25 karolherbst[d]: yeah... guess we should figure out predication instead then given it will help with a bunch of other things
23:27 airlied: like maybe we could add robustness flag to ssbo load/store to pick a different addr format
23:32 karolherbst[d]: per instruction? yeah that might work.. make it an ACCESS thing?
23:32 karolherbst[d]: `ACCESS_IN_BOUNDS`?
23:32 karolherbst[d]: just set that?
23:32 karolherbst[d]: we already have it
23:33 karolherbst[d]: though we don't make use of it in nak...
23:33 karolherbst[d]: maybe we should, lol
23:33 karolherbst[d]: I should add it to my todo list
23:34 cosmicemotion[d]: mohamexiety[d]: I'm about to compile the kernel with this patch right now. I'll offer some insight as to if it compiles first of all and if it works secondly. 🙂
23:36 cosmicemotion[d]: Also, if I might ask, does anyone have any idea what's needed to Final Fantasy Rebirth to not freak out when launching with Fragment Shader interlock merged on NVK? It launches only on Feature Level 12.2 and proceeds to compile shaders. After that it's an out of Video Memory error and it closes.
23:39 mohamexiety[d]: we dont support fl12_2 quite yet. missing RT for that
23:40 mohamexiety[d]: that said, an out of video memory error during shader compile/startup usually indicates something failed somewhere early on and the error reporting thinks it's an out of memory thing
23:42 mhenning[d]: karolherbst[d]: I also have some work for controlling bounds checking in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38874 which might be related
23:42 cosmicemotion[d]: mohamexiety[d]: Tbh this game is so problematic even on the proprietary driver as well. After 1-2 mins even on the lowest possible settings it runs out of VRAM. i've filed a bug report for memory leak (?) on the VKD3D github tracker.
23:43 histopson[m]: MASTER THE ART OF HACKING 🕹️... (full message at <https://matrix.org/oftc/media/v1/media/download/AUk1SUMxbGmgidbeCYfDaQh_NwDjOOu8CQ3G-jEg4eiwdbY5r1hdPB6IDTLkJrpnshiooejNaVnwDTkd6zpQsaRCedvLohtwAG1hdHJpeC5vcmcvS0RTb2xnVWFPV2dzZFhVbUlORW1NbUJD>)
23:43 mhenning[d]: I'm not sure ACCESS_IN_BOUNDS is used in common code at all yet so it might not be exactly what you're looking for
23:44 karolherbst[d]: yeah.. but if we need a per instruction flag, might as well make use of it
23:44 karolherbst[d]: and it's safe to ignore, soo..
23:45 saancreed[d]: mohamexiety[d]: Unreal Engine likes to report all sort of things as OOM errors. I've seen it claim oodle integrity check failures (caused by Intel 13 gen CPU bugs) as one 🙃
23:46 mhenning[d]: cosmicemotion[d]: Not sure if this will help but in the past some games needed `VKD3D_CONFIG=no_upload_hvv`
23:46 mhenning[d]: I think that's supposed to be fixed now though
23:47 cosmicemotion[d]: mhenning[d]: Yeah I'm using `VKD3D_CONFIG=force_static_cbv,no_upload_hvv,force_host_cache`. No dice. Same issue after 5 mins tops
23:54 airlied[d]: ACCESS_IN_BOUNDS looks useful alright, just in nak/nir we pick different ssbo addr formats depending on robustness which needs to handle it