09:07OftenTimeConsuming: phomes_[d]: Ah yes, the search failure. Version 1.4-r1 on the computer that has good performance.
09:37OftenTimeConsuming: Updated to 1.5 and better performance - I get 120fps with mesa 25.1.8
11:44notthatclippy[d]: marysaka[d]: did you maybe keep track (via envyhooks) of which software methods the NV UMDs invoke?
11:45marysaka[d]: notthatclippy[d]: not quite sure what you mean? Do you mean the DRM side of things or actual methods sent (via subchannel 6) ?
11:46notthatclippy[d]: Yeah, actual methods. SET_OBJECT and then the method
11:46notthatclippy[d]: Context: <https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/157#discussioncomment-15252434>
11:47marysaka[d]: notthatclippy[d]: I dump all GP entries from GPGet to GPPut when GPPut is written
11:48marysaka[d]: (I need to rework it for Blackwell and to not be at the mercy of the host inflight processing of them)
11:48marysaka[d]: so it should be able to dump those as long as a mapping exit in the current process address space
11:49notthatclippy[d]: I just imagined myself in your shoes for a moment, and if it were me, I would totally have kept a list of all the methods ever encountered, mapped to what was documented in the NV docs, and then a big fat red label on the ones that were not documented :)
11:51notthatclippy[d]: (I would also probably build a markov chain (or finetune an LLM) on the sequences as encountered in the wild, but that's not really relevant right now)
11:52marysaka[d]: notthatclippy[d]: could probably build that for sure, I'm trying to keep it quite simple but it should be possible after execution to use the dumped blobs to check what is undocumented
11:53marysaka[d]: we have some tool to translate raw blobs to human readable dumps so it should really be trivial
11:54notthatclippy[d]: I mean, don't bother. Was just curious if you already had something handy to give to that person in the link. Otherwise, they can build it themselves. And you can do it whenever you actually need it, or curiosity gets the better of you.
11:56marysaka[d]: That's on my todo list kinda, I was planning to have my test bench crunch all VKCTS tests with envyhooks to aggregate all unknown commands (and also RM control calls that are currently non pretty formatted) 😄
11:57notthatclippy[d]: Please send me the results if you do.
11:59notthatclippy[d]: Ideally, nothing used in production would be undocumented, but I'm sure that is not the case. There's probably a bunch that are instrumentation only and can just be skipped, but there's probably also some that do something useful and should be implemented in a documented way.
12:40x512[m]: I am also making C++ RAII NVRM and NVKMS wrapper for Haiku use.
14:35halloy4897: s
15:23karolherbst[d]: oh also.. ping me if you want to become nouveau ML moderators... it's not using a secret pw anymore, but is using our gitlab as a SSO provider
15:23karolherbst[d]: _lyude[d]: should also be able to add one
18:40_lyude[d]: following openrm code is really hard sometimes, geez
18:40_lyude[d]: I hate how it feels nearly impossible quite often to actually follow where a function in this driver is being called
18:42notthatclippy[d]: Trust me, it's easoer with openrm than the codebase we have internally. At least it's actually C code and not a custom language that no tools understand.
18:43_lyude[d]: oh i can believe it, considering how much trouble I have reading amdgpu
18:43_lyude[d]: fwiwbtw: i've een trying to figure out where kfifoStateLoad gets called. Convienently, there appears to be 0 instances of this function getting called that I can find even though it seems like it needs to be called to actually resume active channel scheduling
18:51_lyude[d]: is there any chance there's anyone from nvidia here who might be able to help me out with this? if I can't get logs decoded anymore this is the next best way of trying to fix the suspend issue on my desktop's gpu but i'm genuinely getting lost trying to understand what is going on here
18:52_lyude[d]: really just having someone actually familiar with this driver would be a huge help
18:56notthatclippy[d]: In general, feel free to ping me for any explanation of openrm code. I'm not very useful otherwise, but at least this I mostly know offhand.
18:58notthatclippy[d]: More specifically, there is a certain kind of C++-style OOP happening in the code, at a few places. One is that all "children" of `OBJGPU`, as defined here - https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/generated/g_gpu_nvoc.h#L984-L1021 - , inherit from a class called `OBJENGSTATE`. `KernelFifo` is one of them. `OBJENGSTATE` itself defines a bunch of virtual methods,
18:58notthatclippy[d]: here: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/generated/g_eng_state_nvoc.h#L141-L156
19:00notthatclippy[d]: Then, `OBJGPU` as part of its state transitions will iterate over all engines and call the appropriate engine function. For that particular case, you'll have this code:
19:00notthatclippy[d]: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/gpu/gpu.c#L2559
19:00notthatclippy[d]: call `xxxStateLoad` for each value of `xxx`, including `kfifo`
19:03notthatclippy[d]: A bit more generally, OBJGPU goes through about half a dozen states in its lifetime, and each time it switches a state, it runs the appropriate function on each engine child, plus a bunch of special handling before/after. This logic is way more complicated than it needs to be right now because it was supposed to handle SLI in the Old Way it was implemented.
19:04cubanismo[d]: They haven't built an AI that can understand the resman code structure yet, so job security. On the other hand, they haven't built an engineer that can understand it either, so ¯\_(ツ)_/¯
19:05cubanismo[d]: Though I hear if you configure C-Tags properly, it can follow it to some extent.
19:05_lyude[d]: I've actually got clangd setup and working as far as it can for openrm I believe
19:05_lyude[d]: but sometimes even it seems to get lost
19:05cubanismo[d]: Yeah, I'm referring to the un-demangled internal version
19:06cubanismo[d]: These days I always have a build of the open version sitting around as a build artifact, so I just grep in there.j
19:06cubanismo[d]: But it used to be really hard and involve a lot of cursing trying to figure out how something worked for the first time.
19:06cubanismo[d]: (I'm a userspace engineer who dabbles in RM sometimes)
19:07notthatclippy[d]: The states are:
19:07notthatclippy[d]: - Construct, PostConstruct - Creates OBJGPU and all child objects, doesn't really touch HW other than to figure out which GPU it is and wire up the `_HAL` function pointers. After these are done, GSP is initialized.
19:07notthatclippy[d]: - StatePreInit, StateInit, StatePostInit - Run once after GSP is initialized
19:07notthatclippy[d]: - StatePreLoad, StateLoad, StatePostLoad - Run once at init, then also each time GPU wakes from S/R
19:07notthatclippy[d]: - StatePreUnload, StateUnload, StatePostUnload - Each time GPU is suspended, and on shutdown
19:07notthatclippy[d]: - StateDestroy - Once on full GPU teardown
19:07_lyude[d]: there's another openrm?
19:07_lyude[d]: notthatclippy[d]: OK - this is actually a lot like nouveau
19:07_lyude[d]: i guess that's where ben got his inspiration from
19:07cubanismo[d]: I think Ben said that was on purpose
19:07cubanismo[d]: Right
19:09notthatclippy[d]: _lyude[d]: Really just all `g_` prefixed files exist as some "higher level" thing that generates these. In some ways they are easier to understand and edit, but they are not C and so come with a ton of drawbacks. It used to be tens of thousands of lines of perl to generate it, but now we have a clang fork that accepts some custom syntax
19:10notthatclippy[d]: We've published a few of those before, mostly by accident. There's nothing really interesting there, and no one really wants to run a special compiler just to compile openrm, so shipping pure C code seemed like the saner option for everyone involved
19:12cubanismo[d]: Yeah, unless you want to develop major features in OpenRM code, it's not terribly helpful.
19:12cubanismo[d]: And even then, debatable.
19:13hatfielde: is there anything in envytools for viewing what a user space program like nvidia-smi is writing to shared memory? referenced by this github comment: https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/157#discussioncomment-10772369
19:13cubanismo[d]: I think if you're fully ramped up on the framework, it's probably beneficial in the way C++ is a productivity improvement over pure C if you're fully ramped up on the latest C++ stuffs.
19:16_lyude[d]: notthatclippy[d]: that is horrifying and fascinating
19:16_lyude[d]: thank you for the info haha
19:16_lyude[d]: (I say that lovingly btw!)
19:17notthatclippy[d]: Most of RM's codegen only becomes beneficial when it's time to add a new chip. The syntax sugar is there to provide saner mapping for all the HAL functions than updating everything manually. Kinda like rust's match+macros, but NIH'd.
19:17_lyude[d]: yeah
19:17_lyude[d]: I assume there's usually a good reason or this sort of thing, AMD wrks somewhat similarly with code generation iirc
19:18_lyude[d]: though theirs isn't as compact as NVs
19:18cubanismo[d]: I can't fully hate on the codegen. It's helped me out before.
19:18notthatclippy[d]: I can't hate it because I still remember The Perl.
19:18cubanismo[d]: It's just really, really hard to build an understanding of a code flow from scratch with it.
19:19hatfielde: Yeah I'm actually going thru that now. Enabling logging has helped somewhat. What is the issue for me is the virtual dispatch
19:19cubanismo[d]: Right
19:19notthatclippy[d]: Enable logging, and use bpftrace to dump callstacks as you see something interesting.
19:20cubanismo[d]: You haven't built complex enough software if you haven't sorta-kinda invented a new programming language to write it in.
19:20cubanismo[d]: Would be sort of the backwards way to justify it 🙂
19:20notthatclippy[d]: At least the init paths are super easy to work with if you have a GPU that you can init/teardown on a whim. If you have to restart your whole session, or reboot.. then... ouch.
19:22_lyude[d]: cubanismo[d]: oh yeah I can't hate on it either, I honestly completely see why you guys would do it
19:23cubanismo[d]: Nova will have the same thing, but it'll just be Rust macros.
19:26_lyude[d]: nouveau had it as wel
19:26_lyude[d]: Honestly it was a bit tough for me to understand at first but I got it after a while. The bigger issue was helping other people understand it
19:26_lyude[d]: (understandably, as it was quite confusing)
19:32_lyude[d]: btw, for some context if anyone's wondering what I'm doing: I've been trying to figure out a suspend/resume issue that's been happening on the desktop that I just built with an RTX 6000 (AD102), where it seems that we (supposedly) boot GSP properly but then fail to re-enable channel scheduling with `NV2080_CTRL_CMD_INTERNAL_FIFO_TOGGLE_ACTIVE_CHANNEL_SCHEDULING` (GSP just times out with that being
19:32_lyude[d]: the last unread message in the queue).
19:35notthatclippy[d]: How many channels do you have?
19:36notthatclippy[d]: I mean, is it some 'normal' value or did you get into the thousands?
19:36_lyude[d]: notthatclippy[d]: notthatclippy[d] what type of channels?
19:37_lyude[d]: (if you know how to check in this in nouveau too that would be helpful, but I might also be able to figure it out here as well)
19:38notthatclippy[d]: Put another way - this happens always, even on a fresh boot?
19:39_lyude[d]: notthatclippy[d]: No, it happens after the first suspend/resume cycle on this machine always. On the initial boot it's fine
19:39_lyude[d]: suspend works, resume fails with the message timeouts
19:40notthatclippy[d]: Actually, can you just send me the debugfs logs from gsp?
19:41_lyude[d]: woo yeah totally! I should note: I've been having issues getting the actual openrm driver to load here (it runs into some missing kernel symbols, didn't look too much past hat). notthatclippy[d] I assume you want me to try grabbing logs after resume fails, or before?
19:42notthatclippy[d]: Yeah, after it hangs. The kernel should still be able to extract those from the memory buffers
19:42notthatclippy[d]: And I'll hunt down the decoder ring for that version somewhere...
19:43_lyude[d]: yep, I can do that!
19:43x512[m]: notthatclippy[d]: cubanismo[d] Any Idea why on NVKMS display output do not work until to connect screen to different port compared to port at boot time? This happens on RISC-V machine that have no VESA/GOP support in firmware and GPU is started by OS for the first time. From API point of view display output pretends to be working but there are no output on screen until connecting to different port.
19:43x512[m]: Can it be so some RISC-V memory ordering issue?
19:44x512[m]: Interestingly, issue is also present on Nouveau KMD.
19:44cubanismo[d]: x512[m]: No idea. Display really isn't my specialty.
19:44airlied[d]: My favourite thing a bout all those neat sequenced operations is when there is some engine ordering requirement or dependency that won't fit the abstraction
19:44airlied[d]: Then the hacks make it nearly impossible to reason about
19:46cubanismo[d]: I think then you're just supposed to append a MidPreInit() function to all objects
19:46cubanismo[d]: And split every object's previous PreInit() function in two
19:46airlied[d]: I think nouveau has a number of places where engine init ordering is interdependent but can't express that so adds another set of hooks
19:46cubanismo[d]: And so on
19:47airlied[d]: Yes it just gets more horrible, for nova I'm trying to avoid that mistake and have explicit order, but will let's others figure it out a but
19:47notthatclippy[d]: airlied[d]: Yeah, OpenRM has things where the order is A->B for StateInit, but B->A for StateLoad, but only on HW X and Y, not on Z
19:48airlied[d]: I feel the abstraction becomes a midlayer at that point and you are better just explicitly sequencing stuff even if it's more code
19:48notthatclippy[d]: Fortunately, most of that is not relevant with the GSP offload, since GSP self-initializes in its own way. You just have to manage your own dependencies
19:49airlied[d]: Like instmem/bar/vm boot is also its own nightmare
19:50_lyude[d]: https://lyude.net/~lyudess/tmp/goldenwind-s3-fail-gsp-logs/ notthatclippy[d]
19:55notthatclippy[d]: _lyude[d]: GSP version 570.144, right? For some reason it's failing to decode. Probably user error on my end. I'll try a bit later, got a meeting now..
19:57_lyude[d]: notthatclippy[d]: correct. Keep in mind though, I made the assumption GSP booted based off the fact that we get past ` ret = r535_gsp_rpc_poll(gsp, NV_VGPU_MSG_EVENT_GSP_INIT_DONE);` in nouveau. so, I would be disappointed but unsurprised if it ended up being that GSP didn't actually load
19:59notthatclippy[d]: Okay, the init logs decode. It's just `logrm` that seems corrupt.
20:14airlied[d]: There isnt any signs the card just didn't wake up? Like 0xff regs etc
20:15_lyude[d]: airlied[d]: any registers you generally check? I didn't realize that could be indicative of the GPU not starting up (especially if we get as far as actually receiving a GSP event)
20:16airlied[d]: Oh if you got a gap event at all then that is usually good
20:16airlied[d]: Doh gsp event
20:17_lyude[d]: yeah we do gt an event saying GSP started, it's just starting the channels back up times out
20:19_lyude[d]: I should also note: this card ominously has a label on it that someone else added on that has the vbios version (which, I assume means it's possible there is a later version of the vbios for this card somewhere). While I am optimistic this isn't related, I suppose it's worth mentioning just in case.
20:23airlied[d]: I only have the rtx6000 Ada but not sure I ever did a suspend/resume cycle on it
20:24_lyude[d]: that's the same card I've got here
20:24_lyude[d]: i guess it couldn't hurt if you could let me know what vbios version is on it
20:26_lyude[d]: ...i'm glad I'm running this as a main gpu because i am just finding all of the bugs wow
20:27airlied[d]: On the laptops I do sometimes I think see failure to exit D3 but need to reproduce it more
20:28_lyude[d]: (hit a wndw lockup out of nowhere, this time it was actually kind enough to give me an error though
20:31_lyude[d]: airlied[d]: any chance you remember how to decode these?
20:31_lyude[d]: `[ 1949.398142] nouveau 0000:c1:00.0: gsp: Xid:56 CMDre 00000000 00000218 00102680 00000004 00800003`
20:35airlied[d]: the 56 maps to a ROBUST_CHANNEL define but that one isn't documented in public openrm
20:36airlied[d]: src/common/sdk/nvidia/inc/nverror.h
20:38_lyude[d]: yeah there's a gap between 48 and 58 in that file for me
20:38_lyude[d]: OH there it is, I think?
20:39_lyude[d]: `#define ROBUST_CHANNEL_NVENC1_ERROR (65)`
20:39airlied[d]: CMDre seems like cmd restart or reset or something 😛
20:39airlied[d]: I saw that one error recently but vague on what caused it or where it was
20:42_lyude[d]: I can see some error reporting code for this in openrm so maybe I can figure out a bit more on what's going on
20:43_lyude[d]: it seems like there's a heck of a lot of thngs that can be a robust channel error
20:44_lyude[d]: oh hey - it's a channel watchdog
20:53_lyude[d]: cool - 218 = NVCA7D_SET_INTERLOCK_FLAGS
20:53_lyude[d]: 21c = NVCA7D_SET_WINDOW_INTERLOCK_FLAGS
21:01_lyude[d]: i'm remembering now too - I think normal nvdisplay exceptions actually get funneled through this reporting mechanism if I remember my discussions with ben correctly
22:29_lyude[d]: I wonder what the easiest way to artifically hang the display channels is
22:30_lyude[d]: (wrote up a patch for dumping the entire contents of all of the display channels involved in an atomic commit, so the next time I get a hang like that on my desktop I can actually figure out what's going on)
22:30_lyude[d]: also notthatclippy[d] any chance you made any progress with the gsp log?
22:35notthatclippy[d]: _lyude[d]: Sorry, my meeting ran long and it's bedtime now. I'll take another look tomorrow, but I'm not hopeful that I'll decode logrm, seems thoroughly corrupt. I'll see if there's enough useful info from the others to diagnose.
22:37_lyude[d]: notthatclippy[d]: sgtm, thank you!