00:02fdobridge: <redsheep> 1. NVK+Zink plasma wayland session is just a black screen and cursor. You can force applications to open in it from ssh, but nothing comes up naturally.
00:02fdobridge: <redsheep> 2. NVK+Zink plasma x11 session flickers between the last few frames across steam and various kde apps, and seems to get more intense under load
00:02fdobridge: <redsheep> 3. NVK+Zink plasma x11 session crashes kwin in libvulkan_nouveau.so on launch, and every few hours after.
00:02fdobridge: <redsheep> 4. NVK+Zink causes Chromium to randomly hit Xid13, so a crash due to an issue with a shader (This one is almost certainly not zink's fault)
00:02fdobridge: <redsheep> 5. Zink on discord has corruption with the UI, especially under high gpu load. From what I understand this happens on radv as well.
00:03fdobridge: <gfxstrand> There are also still some fence bugs with need to sort out in nouveau.ko.
00:04fdobridge: <zmike.> Zink has been tested and is used regularly with plasma sessions on other platforms without issues
00:05fdobridge: <zmike.> I'm more suspicious of nvk in most of those issues
00:05fdobridge: <zmike.> Is this the discord client?
00:06fdobridge: <zmike.> Or browser?
00:06fdobridge: <redsheep> Right, sinze they will open as zink issues though I thought I would ask
00:06fdobridge: <gfxstrand> I suspect at least some of the plasma issues are something to do with non-modifiers paths somewhere.
00:06fdobridge: <redsheep> Discord client, the electron app
00:06fdobridge: <zmike.> Probably
00:06fdobridge: <zmike.> Ah
00:06fdobridge: <zmike.> Are you sure that's using zink
00:06fdobridge: <zmike.> And not ANGLE
00:07fdobridge: <redsheep> It only happens when I enable NOUVEAU_USE_ZINK or the loader override, and does not rely on having it set on the entire session or not
00:07fdobridge: <redsheep> I bring it up with the others because if it's your whole session suddenly all these non-games are obviously zink
00:08fdobridge: <zmike.> Still not conclusive, there could be issues with dmabufs shared across GL contexts
00:08fdobridge: <gfxstrand> I'm starting to suspect that these have something to do with explicit implicit sync.
00:08fdobridge: <zmike.> Zink has a driver workaround to force that on Nvidia blob
00:10fdobridge: <zmike.> Open tickets I suppose, can't hurt
00:10fdobridge: <redsheep> Okay, sounds good
00:14fdobridge: <redsheep> Turnip is the platform where zink sessions have been tested the most, right? Could the difference be that the other well tested platforms all have descriptor buffer?
00:14fdobridge: <redsheep> You had mentioned that influences quite a lot of how well zink works
00:20fdobridge: <zmike.> They're pretty well tested on AMD by now too
00:20fdobridge: <zmike.> Zink is reasonably well tested without DB because of venus
00:26fdobridge: <zmike.> ...also the non-DB code was the default/only codepath for years before DB existed and hasn't changed since
00:30fdobridge: <airlied> a fedora kernel with the modifiers fix would be a few weeks away
00:30fdobridge: <airlied> I can accelerate the process but it's not usually worth the pain
00:31fdobridge: <zmike.> Guess it's a few weeks until I need to roll up my boots then
00:31fdobridge: <zmike.> :sweatytowelguy:
01:09fdobridge: <redsheep> I am no longer seeing the plasma x11 session crash... Either that got fixed, or is harder to replicate now because it doesn't happen on first load
01:09fdobridge: <gfxstrand> @redsheep If you want to play around, my nvk/cbuf0 branch might help. I suggest trying it with `NVK_DEBUG=no_cbuf`. I'm going to try and improve the cbuf situation tomorrow.
01:09fdobridge: <gfxstrand> (With perf, that is.)
01:10fdobridge: <redsheep> https://tenor.com/view/yyyess-yes-banbert-excited-gif-20963567
01:12fdobridge: <zmike.> Improving with perf is my favorite kind of improving
01:12fdobridge: <redsheep> Yeah to be honest I don't actually care about modifiers for the sake of modifiers like... at all... I have just been testing it to try and speed things along so we can go back to fun stuff
01:13fdobridge: <gfxstrand> hehe
01:13fdobridge: <gfxstrand> Yeah, modifiers are the most important least interesting thing.
01:13fdobridge: <gfxstrand> Hence why I've been putting them off. 😅
01:15fdobridge: <gfxstrand> Right now that branch is very not great but I think should barely accomplish the objective. I need to give more thought to how I want it to look long-term.
01:15fdobridge: <redsheep> What kind of not great? Like, you don't like something with quality or maintainability, or you think it will crash?
01:15fdobridge: <gfxstrand> And it seems to mostly pass CTS
01:16fdobridge: <gfxstrand> I don't expect crashes though it's possible with a change this major
01:16fdobridge: <gfxstrand> It's more that it's a really hacky way of doing it.
01:16fdobridge: <gfxstrand> I still don't have a plan for a clean way
01:18fdobridge: <redsheep> What is this branch actually doing? What does CB0 mean?
01:26fdobridge: <redsheep> Hmm I see now why you were talking about the mme
01:38fdobridge: <redsheep> Yeah, the branch definitely isn't quite as solid as main, zink session went from mostly okay to barely working lol
01:55fdobridge: <redsheep> Unfortunately that forced me to change two variables at once, I had to go back to an nvc0 session. Switching that and from fresh main to cbuf0 without doing the environment variable resulted in:
01:55fdobridge: <redsheep> Talos: 80 > 76 fps (probably session type, this is similar to previous tests I did along these lines)
01:55fdobridge: <redsheep> Doom Eternal: 22 > broken, black screen on launch
01:55fdobridge: <redsheep> The Witness: 56 > 56 fps
01:55fdobridge: <redsheep> I add NVK_DEBUG=no_cbuf and I get:
01:55fdobridge: <redsheep> Talos - 76 > 68 fps, bizarrely it is a performance regression, and I have tested back and forth several times to confirm
01:55fdobridge: <redsheep> Doom Eternal: Remains broken
01:55fdobridge: <redsheep> At this point it was not looking so great but I had one more game to test:
01:55fdobridge: <redsheep> The Witness 56 > 99 fps a *77% improvement*
01:56fdobridge: <redsheep> @gfxstrand It's not all roses, but there's a pretty huge bright spot there at the end
02:00fdobridge: <redsheep> 99 fps is the highest I have ever seen in The Witness on NVK from that test location, and it's not even close
02:08fdobridge: <gfxstrand> Yeah. I'm not surprised it's a bit broken. The branch needs serious work. Getting a good bump in The Witness makes me happy, though. I'll keep going. I'm sure I can get it solid. This is the very hacky "let's try to use the inline path" version. It's gonna take more work to be sure.
02:08fdobridge: <gfxstrand> But if we're getting a bump, that tells me that maybe I'm on to something.
02:08fdobridge: <gfxstrand> More work to do.
02:11fdobridge: <redsheep> I think the doom eternal issue is really just that it starts rendering really really slowly when it goes to render the menu background, and the game draws black over it for a few frames so you don't see things popping in, but it gets so slow the black frames don't end. That's similar to what was happening with running the zink session, it just got really slow
02:11fdobridge: <airlied> filed an f40 update dropping nvk
02:14fdobridge: <redsheep> Also it doesn't seem the cbuf0 branch outright breaks zink performance, it's still running minecraft just fine
02:44fdobridge: <airlied> okay found the bug with autosuspend, now to find the right fix
03:07fdobridge: <airlied> https://patchwork.freedesktop.org/patch/594067/
03:07fdobridge: <airlied> @gfxstrand probably should add that to your tree
03:08fdobridge: <gfxstrand> Yeah, the CTS also seems slow. IDK why yet. I'm sure it's fixable.
03:09fdobridge: <gfxstrand> Oh, nice! I'm definitely going to pull that in and give it a spin. Does it affect the GPU sleep issues?
03:09fdobridge: <!DodoNVK (she) 🇱🇹> Only jadahl wanted a code change (2 other people already ACKed it)
03:10fdobridge: <airlied> yes it should fix the gpu sleep
03:12fdobridge: <gfxstrand> Sweet
03:12fdobridge: <gfxstrand> Yeah, that should help a lot. Maybe I can actually GDB things on my laptop now. 😂
03:13fdobridge: <redsheep> Alright one issue filing down down, 4 to go lol
03:21fdobridge: <gfxstrand> Now if only we could figure out why I sometimes get "VMM allocation failed" and my firmware takes a hike into the woods indefinitely.
03:25fdobridge: <airlied> that usually means GSP has fatally crashes, need to use the logging patch from timur and send him the output after it dies
03:27fdobridge: <!DodoNVK (she) 🇱🇹> Have you done actual hikes in the woods/forests though? 🌳
03:36fdobridge: <redsheep> In the course of 4 hours I haven't yet hit the xid13 in Chromium, so clearly that one is not common.
03:45fdobridge: <redsheep> Since I can't easily replicate that one I won't torture somebody with an issue for the xid error, for all I know it is fixed.
03:45fdobridge: <redsheep> If somebody wants to label these others NVK and Zink, that would be great:
03:45fdobridge: <redsheep> https://gitlab.freedesktop.org/mesa/mesa/-/issues/11160
03:45fdobridge: <redsheep> https://gitlab.freedesktop.org/mesa/mesa/-/issues/11161
03:45fdobridge: <redsheep> https://gitlab.freedesktop.org/mesa/mesa/-/issues/11162
03:45fdobridge: <redsheep> https://gitlab.freedesktop.org/mesa/mesa/-/issues/11163
03:55fdobridge: <redsheep> Thanks @asdqueerfromeu
04:27fdobridge: <gfxstrand> Branch should be updated. I'll let you know if it blows up
05:18fdobridge: <marysaka> *note down to redirect her urges to write a register allocator to that*
08:32fdobridge: <mtijanic> Does anyone here regularly trace/profile/etc OpenRM or other nvidia kmods? I expect it's mostly just code reading for OpenRM and then runtime mmio tracing the UMDs for work submission and other UMD->GPU communication, and no one realy cares about UMD->KMD interface?
08:32fdobridge: <mtijanic>
08:32fdobridge: <mtijanic> But if you do, I'd like to hear how you do it and what the painpoints are.
08:50fdobridge: <airlied> I usually just stick printfs in the kernel code to trace what paths are used
08:51fdobridge: <airlied> Dumping gsp msgs also sometimes useful
08:57fdobridge: <mtijanic> We mostly use bpftrace, plus a few other specialized tools, as you often want to trace something without reloading (and/or recompiling) the kmod. But the whole thing is a pretty diffuse with multiple sets of scripts doing the same thing, etc. I've recently taken to consolidating the bits and was just curious if it's something external people would find useful.
08:58fdobridge: <mtijanic> Well, useful _enough_ to justify going through the process of getting it published. If there's only a specific issue you have here or there, we can just work to fix that or share one of the many hacks we have to work around it.
09:01fdobridge: <mtijanic> As a fun fact, in the more extreme end of the spectrum, I have a script I can `curl | sh` that builds and installs a kmod which disables write protections on the kernel and then hotpatches some BPF functionality :D. Not something I'd share in good consciousness though.
09:12fdobridge: <airlied> With the open module for most of us just rebuild with specific printfs is probably easier the few times we need to use it
09:18fdobridge: <karolherbst🐧🦀> I just found out about `nir_lower_wrmasks` and it looks like nak and codegen could benefit from it by not having to do that stuff themselves
09:19fdobridge: <karolherbst🐧🦀> though not sure how much nak would actually benefit there
14:39fdobridge: <mhenning> @karolherbst I've known about that pass for a while. I forget why I don't use it in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24984 - I think a different pass also lowers them for us, but I forget which one
14:40fdobridge: <mhenning> Anyway, NAK's lowering is similar to that MR, so I don't think it's necessary
14:48fdobridge: <gfxstrand> Meh. It's not that hard to lower in NAK and we have to deal with alignments anyway so we may as well do both at the same time.
14:49fdobridge: <gfxstrand> I considered using it but I had to handle alignments anyway and it was the same loop
14:50fdobridge: <karolherbst🐧🦀> yeah, fair
14:50fdobridge: <karolherbst🐧🦀> hot take: we should get rid of rwmask altogether :ferrisUpsideDown:
14:50fdobridge: <karolherbst🐧🦀> well.. maybe just for global load/stores
14:51fdobridge: <karolherbst🐧🦀> I fear that some hardware can actually make use of that mask? And I suspect it's intel
14:55fdobridge: <zmike.> maybe dumb question but why not use nir_opt_load_store_vectorize
14:56fdobridge: <mhenning> That's exactly what that MR is doing
14:56fdobridge: <zmike.> ok so dumb question after all
14:57fdobridge: <karolherbst🐧🦀> maybe we should just merge it :ferrisUpsideDown:
14:59fdobridge: <mhenning> It has merge conflicts at this point, and I don't have much time right now to hack on stuff
15:00fdobridge: <mhenning> If you want to rebase and re-run cts, go ahead
15:00fdobridge: <ahuillet> MME stands for "macro method expander" btw which gives you an idea of what it was originally meant to be, more of a preprocessor than a real CPU as I understand it
15:01fdobridge: <gfxstrand> Very much so
15:02fdobridge: <gfxstrand> It took a bit for that difference to click in my head but once it did, the MME design makes a lot of sense.
15:20fdobridge: <!DodoNVK (she) 🇱🇹> So it isn't "Memory Management Engine", right?
15:23fdobridge: <ahuillet> well no?
15:53fdobridge: <redsheep> I believe you're thinking of MMU, memory management unit
15:57fdobridge: <redsheep> @zmike. For the backtrace you're wanting I just need to run a debug build of mesa and wait for the next crash to get that in dmesg, right?
15:58fdobridge: <zmike.> probably get it from something like `coredump gdb -1` after the crash
15:58fdobridge: <redsheep> Does that require a debug build? My current one isn't debug
15:58fdobridge: <zmike.> yes
15:59fdobridge: <zmike.> or at least debug symbols
19:26Lyude: gfxstrand: once I get through cleaning up the git history for rvkms I might be able to look at that
19:26Lyude: I'm not sure how much actually needs to change for HDR but I don't think it's much tbh
19:27airlied: I think how NVIDIA expose HDR is quite different to other HDR vendors, so I expect it's a lot
19:33Lyude: ah, huh
19:37fdobridge: <Misyl with Max-Q Design> In terms of color pipelines for scanout, NVs is remarkably over engineered and over specialized which results in it being overly useless
19:37fdobridge: <Misyl with Max-Q Design> 3D LUTs >>>> whatever they have going on
19:42airlied: I think the NVIDIA evo API is fair bit abstracted from the underlying hw
19:42airlied: but it's definitely a big job no matter what
19:43HdkR: Oh no, they aren't just 3D LUTs? D:
19:45fdobridge: <Misyl with Max-Q Design> Its like some conversions then luts in ICtCp space
19:45fdobridge: <Misyl with Max-Q Design> Its like some conversions then 1D luts in ICtCp space (edited)
19:46HdkR: Wacky
19:46fdobridge: <Misyl with Max-Q Design> Very limiting compared to AMDs 17x17x17 tetrahedral 3D LUTs
19:46fdobridge: <Misyl with Max-Q Design> I think QCom might have something similar?
19:47fdobridge: <Misyl with Max-Q Design> No idea tho
21:33fdobridge: <gfxstrand> @redsheep I reproduced your Witness numbers on my 4060. no_cbuf gives me 20%, too.
21:33fdobridge: <gfxstrand> I think I very badly need to implement bindless cbufs.
21:34fdobridge: <redsheep> Hmm? It gave me 77% more performance, but I guess your's won't be as drastic with a narrower gpu
21:34fdobridge: <redsheep> Do you have msaa on?
21:34fdobridge: <gfxstrand> Yeah, 66 FPS with 80 FPS, no MSAA.
21:34fdobridge: <redsheep> I have been testing maxed msaa, wonder if that also changes things
21:35fdobridge: <redsheep> At the least it makes it heavier
21:35fdobridge: <gfxstrand> Yeah
21:36fdobridge: <redsheep> I really like the witness as a test because it loads so fast, and it's very consistent if you can find a way to anchor looking at exactly the same thing from the same place
21:36fdobridge: <redsheep> The world has nothing dynamic, like at all
21:37fdobridge: <redsheep> Is this cbuf work something that counts as being fixes to go into 24.1 point releases or is this 24.2 stuff at this point?
21:40fdobridge: <gfxstrand> 24.2
21:41fdobridge: <gfxstrand> Yeah, it varies by like 1 FPS once everything is loaded
21:44fdobridge: <redsheep> Oh btw if you do use it for testing, something that really helps is to get positioned and then exit and relaunch before taking down numbers
21:44fdobridge: <redsheep> That way when you load in next with a change it's more precisely comparable
21:55fdobridge: <redsheep> Anyway, let me know if you reach another point where you want more testing.
21:56fdobridge: <gfxstrand> I'm going to go on a bindless cbuf quest for a bit. Once I've got those working and The Witness going even faster, I'll throw it at you again.
21:57fdobridge: <redsheep> Sounds good. Have you found what part is making some stuff really slow?
21:57fdobridge: <gfxstrand> I've got theories but I need bindless cbufs to fix them.
21:57fdobridge: <gfxstrand> Which is going to be a bit of a quest because it's a lot of NAK work.
21:57fdobridge: <karolherbst🐧🦀> so it's finally going to happen
21:58fdobridge: <gfxstrand> It's got to
21:58fdobridge: <karolherbst🐧🦀> the funky part is using them as source
21:58fdobridge: <gfxstrand> That's not too bad but I'll have to figure out the encoding
21:59fdobridge: <karolherbst🐧🦀> ohh
21:59fdobridge: <karolherbst🐧🦀> does NAK support uniform regs?
21:59fdobridge: <karolherbst🐧🦀> because if not, you might want to do that first
21:59fdobridge: <gfxstrand> Not really
22:00fdobridge: <gfxstrand> And they're kinda dumb
22:00fdobridge: <gfxstrand> But I'm going to have to do that
22:00fdobridge: <karolherbst🐧🦀> yeah..
22:00fdobridge: <gfxstrand> At least a little bit
22:00fdobridge: <karolherbst🐧🦀> using bindless ubos as sources requires the address to be in an uniform reg
22:00fdobridge: <gfxstrand> I know
22:01fdobridge: <karolherbst🐧🦀> anyway, fun project
22:01fdobridge: <karolherbst🐧🦀> what's the bad part about them though?
22:01fdobridge: <gfxstrand> You can only use them in uniform control flow and there's pretty much only integer instructions for them.
22:02fdobridge: <karolherbst🐧🦀> ah yeah, fair
22:02fdobridge: <gfxstrand> So they're not something where we can convert all non-divergent things to them.
22:02fdobridge: <gfxstrand> We can pretty much only use them for a few cbuf things.
22:02fdobridge: <gfxstrand> Which is kinda fine
22:02fdobridge: <karolherbst🐧🦀> yeah..
22:02fdobridge: <karolherbst🐧🦀> I think that's the main idea
22:04fdobridge: <karolherbst🐧🦀> but like.. you can also just always write an uniform value into an uniform register if you need it
22:04fdobridge: <gfxstrand> Sure
22:05fdobridge: <gfxstrand> And some LDC might want to do that
22:06fdobridge: <redsheep> Just curious, given the need for some bigger NAK changes here is this a situation where you are liking or disliking the CTS focused path you've been on so far?
22:06fdobridge: <redsheep> I know it's kind of an abstract question
22:06fdobridge: <redsheep> As in, would this have been easier earlier, or is it easier now because you know that everything else is basically working?
22:07fdobridge: <karolherbst🐧🦀> what's interesting is, that tex instructions can also directly get the handle from an uniform register instead of doing the cbuf pull
22:08fdobridge: <gfxstrand> Oh, it was absolutely the right choice. I can regression test my changes.
22:09fdobridge: <karolherbst🐧🦀> what's more interesting, that it's a 64 bit read, not 32 as normal
22:33fdobridge: <mohamexiety> is this more limited than other scalar implementations? like e.g. AMD's SALU or if Intel have something similar
22:36fdobridge: <gfxstrand> Yeah, much more limited
22:43fdobridge: <mohamexiety> 😮 I see. I was under the impression that the others were also more of dedicated units for address generation and such and you realistically couldn't do that much with them (barring some scalarization tricks on AMD)