01:00 Hawk777: Hello, I'm looking for somewhere to request technical support with Nouveau. I think my request would be best written as an e-mail, but the nouveau mailing list archive appears to be all patches and RFCs. Is there a separate user-facing mailing list or something like that that I haven’t found?
01:11 esdrastarsis[d]: _lyude[d]: I have this error on gnome too
01:11 esdrastarsis[d]: I thought my GPU was dying
01:17 _lyude[d]: Hawk777: It would be that mailing list and/or dri-devel.
01:20 Hawk777: OK, it’s quite nouveau-specific so I will use the nouveau list, thanks.
01:20 Hawk777: Is it necessary to subscribe before posting?
02:44 karolherbst: yes
02:44 karolherbst: Hawk777: ^^
02:44 karolherbst: though your message will just end up in the moderation queue otherwise
02:51 Hawk777: OK, got it. Thanks.
09:21 marysaka[d]: _lyude[d]: We also never stop channels with GSP at the moment, I have some patches I forgot to upstream about that, not sure if that would contribute to it?
09:26 marysaka[d]: Pushed them here but will try to send them by the end of the week https://gitlab.freedesktop.org/marysaka/linux/-/commits/gsp-dev
13:33 notthatclippy[d]: _lyude[d]: Sorry, no useful info. I think the loginit one might also be broken, but in a way that it still decodes, just into garbage. All I can tell from it is that certain asserts failed on resume. So yes, GSP did wake up for a bit and then died inside. But no insight whatsoever into _why_, sorry.
16:36 _lyude[d]: marysaka[d]: oh maybe! I've got a fast desktop so I'll test a kernel right now
16:57 _lyude[d]: notthatclippy[d]: Do you think there's Anything else I might be able to grab that would give us a better chance of figuring out why it's dying?
16:57 _lyude[d]: Also marysaka[d] no dice unfortunately, thanks for trying though!
17:13 _lyude[d]: marysaka[d]: actually, hm. Going through this code - does this actually implement stopping channels on r570 as well? I only see code on r535
17:16 marysaka[d]: _lyude[d]: r535_fifo is actually always constructed for both versions see ga102_fifo_new for example
17:16 _lyude[d]: ah gotcha
17:16 _lyude[d]: must have missed it
17:17 marysaka[d]: navigating the maze is always an adventure yeah...
17:58 _lyude[d]: hmmmmmmmm. I just found a flag for suspend we are not setting in r570_gsp_set_rmargs owo
18:00 mhenning[d]: I think https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38913/commits is ready for review if anyone wants to look at it. It's mostly piping some data through the compiler.
18:35 _lyude[d]: _lyude[d]: No dice. this issue is weird
18:44 esdrastarsis[d]: _lyude[d]: Is it a regression?
18:45 _lyude[d]: esdrastarsis[d]: unless you know of a kernel version where this problem didn't happen, no clue. I only just built the computer
18:45 _lyude[d]: if I don't figure out any possible leads though I might stick this card in another machine and see what happens, I'm wondering if there might be something about my desktop that is causing this to fail
18:46 karolherbst[d]: mhenning[d]: ping me next year when I'm not on PTO if nobody reviews it until then
18:52 esdrastarsis[d]: _lyude[d]: 6.18 works fine and 6.18.1 doesn't, which is very strange
18:52 _lyude[d]: esdrastarsis[d]: to be clear - we're talking about suspend/resume, right?
18:53 esdrastarsis[d]: I was talking about the Xid 56 issue
18:54 _lyude[d]: ahhh
18:54 _lyude[d]: that one should hopefully be easy to fix, I'll take a closer look at it the next time my system freezes or after I figure out what is going on with suspend/resume here
19:05 _lyude[d]: mhenning[d]: so, feel free to let me know if I should direct questions at someone else 🙂
19:05 _lyude[d]: in gpuPowerManagementResume, I'm noticing that there's actually some extra setup here for setting up the PMU log that seems to happen before booting up GSP. And I don't think we have an equivalent to this in nouveau, do you think we might be able to get some more information if I'm able to hook that up?
19:12 airlied[d]: _lyude[d]: so it seems display related rather then accel?
19:12 _lyude[d]: airlied[d]: You mean the Xid message from yesterday? I believe so, yes. I was able to get a pretty reasonable looking explanation for decoding those values from nvidia
19:13 mhenning[d]: Not sure if I'm the person you meant to tag, but I don't know much about kernel side power management
19:13 airlied[d]: No I mean the channels not resuming?
19:13 _lyude[d]: oops sorry! it was the wrong person, meant to tag notthatclippy[d]
19:14 _lyude[d]: airlied[d]: The channels not resuming I don't have any idea. Unfortunately all I know right now is that GSP supposedly tells us it's booted, we try resuming channels and it fails. And it seems like the GSP logs that we get from there are corrupted
19:14 karolherbst[d]: maybe something with memory management going wrong?
19:15 _lyude[d]: I've been thinking about that possibility, because this is a 32 core 64 GB ram system. But I haven't been able to find anything obvious yet
19:16 karolherbst[d]: I was more thinking GPU memory management, but the host doing weird things with that many cores wouldn't be unsurprising either
19:16 _lyude[d]: yeah.
19:16 _lyude[d]: I may stick the card in another machine to try to figure out if it behaves more normally or not, though I'm trying to exhaust my other options before opening my desktop up again because lazy
19:16 karolherbst[d]: there is `maxcpus` on the kernel command line
19:17 _lyude[d]: Good point. do we have one for limiting ram availability as well?
19:17 karolherbst[d]: yes
19:18 karolherbst[d]: well.. or rather.. I use it for persistent dmesg logs across boots: `ramoops.dump_oops=1 pstore.backend=ramoops ramoops.record_size=65536 ramoops.console_size=65536 ramoops.ftrace_size=65536 log_buf_len=4M mem_sleep_default=deep mem=0xff8000000 ramoops.mem_address=0xff8000000 ramoops.mem_size=0x80000 memmap=0x8000000!0xff8000000`
19:18 karolherbst[d]: something there limits memory 🙃
19:18 _lyude[d]: I do really wonder though if it's possible to get any kind of additional logging from the gpu in general for situations like this.
19:18 _lyude[d]: It's pretty difficult trying to troubleshoot issues like this because almost every time I've seen a suspend/resume issue there's been no useful gsp logs
19:18 karolherbst[d]: `mem=` probably and `memmap` reserves ranges or so
19:19 karolherbst[d]: _lyude[d]: the logs are written into "memory" right? Can you... map host memory into the GSP to use?
19:22 _lyude[d]: karolherbst[d]: How do you mean, in addition to what we already do? or something earlier like I was looking at in nvidia's driver
19:23 karolherbst[d]: _lyude[d]: it kinda sounds like you get like corrupted memory, no? Or is it simply stopping printing things
19:23 karolherbst[d]: uhm... wait.. you say you suspend/resume?
19:23 _lyude[d]: yes - resume is where the fail happens
19:23 _lyude[d]: channel rescheduling in specific is the message gsp gets stuck on
19:24 karolherbst[d]: all VRAM is moved to system RAM, right?
19:24 karolherbst[d]: prior suspend I mean
19:24 karolherbst[d]: mhhh.. though I think we only do that for hibernation...
19:25 karolherbst[d]: I do wonder if something goes funky on the system level where it shuts off power to the GPU and it looses VRAM or something?
19:25 _lyude[d]: JFYI - limited the ram to like 5GiB, one CPU - no difference unfortunately
19:25 karolherbst[d]: not really sure about the details there and what to expect
19:26 _lyude[d]: karolherbst[d]: typically before suspend, we use something called frts in order to move the VRAM contents into system memory, then we use frts after GSP boot on resume in order to restore the memoiry
19:26 karolherbst[d]: is it real suspend or s2idle or whatever?
19:26 _lyude[d]: Real s3
19:26 karolherbst[d]: mhhh
19:27 karolherbst[d]: so it's a GSP RPC call I guess
19:27 _lyude[d]: yeah
19:28 _lyude[d]: to be honest though I wouldn't be surprised if something funky is still happening with the GPU memory. I'm not sure but it seems like the suspend/resume path we follow looks slightly different from OpenRM, I don't see openrm doing any manual management of BARs and VMM flushes but it's also possible I'm not looking in the right place yet
19:29 karolherbst[d]: and I guess the GSP log is random nonsense even after restoring VRAM?
19:29 _lyude[d]: Supposedly, notthatclippy[d] attempted decoding it yesterday without success
19:29 karolherbst[d]: oh right.. it's just a binary blob...
19:29 _lyude[d]: Well yeah but there is a decoder nvidia has for it, the problem is somtimes you have situations like this
19:30 _lyude[d]: airlied[d]: by the way - you said you had an RTX6000. I highly doubt this is the issue but just because it couldn't hurt, could you grab the VBIOS version on yours?
19:30 _lyude[d]: I want to confirm I'm not somehow running a GPU with a preprod vbios without realizing, since that's entirely possible…
19:33 airlied[d]: It's on the shelf at the moment, but I doubt bios can influence suspend/resume at all it's doesn't get called on that path usually
19:34 _lyude[d]: it would be very nice if there was a nice little debug port somewhere that just always spat out GSP logging....
20:18 marysaka[d]: mhenning[d]: Will look at it tomorrow
20:40 loanselot[d]: hey, i decided to look more into NVK since nvidia announced ending driver support to older gpus, so i am beginner here 👋
20:40 loanselot[d]: im trying to understand the relationship between nir and nak. after looking into it for a while this is my understanding: nir is lowered into nak through C API, nak does SM specific magic (opt, backwards compat stuff, etc..) then produces nvidia shader bin through Rust API, is this roughly correct?
20:40 loanselot[d]: and lets say i want to implement a basic feature and that feature is supported by nir but not by NVK, so i should naturally be looking into nak to implement it, right?
20:43 _lyude[d]: airlied[d]: karolherbst[d] any chance you guys have any idea what src/nvidia/src/kernel/diagnostics/nv_debug_dump.c is all about?
20:44 mhenning[d]: loanselot[d]: yeah, roughly nir has parts of the compiler that are common across multiple drivers and nak has parts of the compiler that are nvidia-specific
20:45 mhenning[d]: and yes, compiler features will normally end up being implemented mostly in nak if nir already has support
20:45 airlied[d]: _lyude[d]: looks like some sort of engine core dump type mechanism, but no idea what goes in it
20:46 loanselot[d]: mhenning[d]: got it, thanks
21:00 asdqueerfromeu[d]: loanselot[d]: I think the main issues for those architectures now are relatively low performance and kernel driver stability issues (because Vulkan conformance is surprisingly good even down to Kepler)
21:01 notthatclippy[d]: _lyude[d]: It captures per-engine (module, subsystem) state and stores it as protobuf. There's a GSP side of it that nouveau can also pipe through to debugfs that would give it a tiny bit more visibility into the dumps. Part of the format is documented, most of it isn't, but usually for no good reason other than "I wrote a script to undocument a large portion so I don't have to read and approve
21:01 notthatclippy[d]: for publishing".
21:02 notthatclippy[d]: _lyude[d]: If you really want to dig deeper, send me the logs while the thing is still functioning. I'll see if there's any errors reported early on that might cause it; and also see if the log decoding actually works at all.
21:03 loanselot[d]: asdqueerfromeu[d]: yeeaaah, the performance is really horrible
21:03 loanselot[d]: from some chat searches i believe its due to gpu clock issue related to pascal (and others)
21:03 loanselot[d]: i think my gpu uses boot clock
21:05 gfxstrand[d]: Yeah, those GPUs are a bit stuck, I'm afraid
21:06 notthatclippy[d]: I mean, if you were okay with using the proprietary blob driver so far, that will continue to work for the foreseeable future. Historically the last supported branch for an arch got fixes and kernel compat patches backported for >5 years.
21:06 loanselot[d]: at least it is able to run my renderer that does some vulkan 1.3+ stuff like bindless, descriptor indexing \:D
21:07 notthatclippy[d]: (and after that, the part that needs kernel compat changes is available as source so others can patch it too)
21:07 _lyude[d]: notthatclippy[d]: I mean I do get paid for this so I'm happy to dig deeper! I'll grab you log in a moment after this kernel build finishes in a few mins
21:08 _lyude[d]: notthatclippy[d]: i hope someday maybe we can get a firmware... even though it'll likely never happen :C
21:08 loanselot[d]: notthatclippy[d]: yeah i will keep using proprietary drivers when i need low framerate, my main intent is just contributing to NVK, nvidia cutting the support is just an excuse
21:09 _lyude[d]: it's a shame though, I'm fairly certain nvidia could just upload a blob of the last firmware they used on pascal with no documentation and people would be happy to RE it for them and make it work
21:13 airlied[d]: it's not impossible to pull out of the binary,
21:18 notthatclippy[d]: Right, it's purely a licensing issue IIRC. Which probably makes it that much more expensive to solve.
21:18 _lyude[d]: which is why i'm kind of curious if a "can we just get an OK we can hand the file around"
21:18 airlied[d]: that and I'm not sure anyone wants to support it for ever if say someone finds security issues
21:19 _lyude[d]: i feel like you could still do that with or without nouveau though
21:20 airlied[d]: indeed, but it's less likely and they have a procedure for deprecating drivers and producing new ones
21:22 _lyude[d]: did we ever try a bcm-fwcutter (or whatever it's called) type solution? I feel like I remember us talking about that a while back
21:22 airlied[d]: there used to be one when you could find the gzip headers, but that was a long time ago
21:23 _lyude[d]: the openrm source we have works with pascal doesn't it?
21:24 airlied[d]: nope turing+
21:25 notthatclippy[d]: _lyude[d]: I don't know if the license lets us do that even. Which means that whoever needs to sign off on it also doesn't know, which means someone will have to investigate it, lawyers will have to get involved and those are famously risk-averse, so the question will be what is the benefit.. And then after someome spends two hoirs explaining to them what nouveau even is and that this is about 10
21:25 notthatclippy[d]: year old products and something as untangible as "community goodwill"... Well, I don't see a way that meeting ends with a stamp of approval.
21:26 notthatclippy[d]: And every year that goes by, it is less and less likely and more and more expensive
21:26 _lyude[d]: hmmmm
21:27 _lyude[d]: well. easier to ask forgiveness then permission. how different even is the boot process between nvidia's old firmware and current gsp
21:28 notthatclippy[d]: Significantly. There is no way to use gsp stuff on older GPUs.
21:28 _lyude[d]: well that part I know
21:29 _lyude[d]: I'm mostly curious what portions of it we understand because all we would need to do is figure out where it writes the images.
21:29 _lyude[d]: and there is definitely at least some hints to how some of this works i've seen in openrm....
21:29 steel01[d]: This sounds similar to what happened when I asked about t186 (the tegra pascal arch) stuff. Not worth the trouble asking the people that would have to be asked to make stuff happen. ><
21:30 _lyude[d]: i'm just thinking about this because literally if we could get a fw cutter I am somewhat convinced the rest would just sort of happen.
21:30 _lyude[d]: karolherbst[d] didn't we get some of the earlier firmwares that we cut booting up on pascal?
21:30 _lyude[d]: "the rest" as in "people get interested in making the thing work"
21:31 notthatclippy[d]: I'm sure you can extract the fw from the blob and patch up nouveau to use it
21:32 _lyude[d]: also - thing just finished, will grab the gsp logs you asked for in just a moment
21:32 notthatclippy[d]: I imagine the fw will never get accepted to linux-firmware and I don't know if Dave would even accept the patches that use it in a BYOFW fashion, because it's unclear who even has a license claim to that stuff.
21:33 _lyude[d]: *gives dave the puppy eyes*
21:33 notthatclippy[d]: But, bootleg floating nouveau patches... Totally.
21:34 _lyude[d]: actually though airlied[d] - I am kind of curious if you'd be ok with this or not
21:37 steel01[d]: Relevant thought: google won the api lawsuit from oracle. If nouveau would just using the firmware api, there's... kinda precedent there. Independent of the 'where a user gets the firmware' part.
21:39 _lyude[d]: notthatclippy[d]: https://lyude.net/~lyudess/tmp/goldenwind-normal-gsp-logs/
21:40 notthatclippy[d]: FWIW, back when these were current gen GPUs someone looked into shipping the fw to nouveau and for reasons I don't remember decided that it is not something we can do and that we should instead provide bespoke firmware that does just what nouveau needs. But that never materialized due to priorities and resource allocation.
21:40 _lyude[d]: i remember
21:40 _lyude[d]: i just couldn't say until you did 🙂
21:40 notthatclippy[d]: It's ancient history.
21:42 steel01[d]: I'm rather curious as to the reasoning there. And even *more* interested as to why the api changed in the nouveau variants. Like, if some super special feature had to be removed, why does that change the api? Just stub the functions and save the pain of drivers having to chase moving targets.
21:44 notthatclippy[d]: Could be as simple as a 3rd party bit that doesn't allow for this. Or doesn't *obviously* allow for it.
21:45 notthatclippy[d]: No one is going to sign off on probably millions in expenses to work out the issue with that, and then hit another layer of the onion after that.
21:46 notthatclippy[d]: They will if given a good business justification. But "Steel01 asked for it" unfortunately isn't that.
21:46 steel01[d]: Wouldn't it be *more* work to have a new api that then has to be verified?
21:47 notthatclippy[d]: steel01[d]: Arguable. But even if less work, it was evidently too much work.
21:49 notthatclippy[d]: Even if I find the actual specifoc reasoning I probably couldn't share it. But for whatever it's worth you have me saying that it wasn't "we don't want nouveau to get this feature". It's "we don't want nouveau to get this feature..badly enough compared to everything else we want currently"
21:50 notthatclippy[d]: _lyude[d]: Keep it there for 12 hours please and I'll download in the morning
21:51 _lyude[d]: notthatclippy[d]: sgtm
21:52 karolherbst[d]: _lyude[d]: yeah we could, and because the drivers are mostly EoL, might even be sustainable
21:53 _lyude[d]: yeah that's kind of why i've been thinking about it
21:53 karolherbst[d]: just needs somebody to RE all the interfaces
21:53 karolherbst[d]: it's a bit painful
21:53 _lyude[d]: I mean, just having the thing out there so people can do that would let people who are interested actually try to work on it
21:53 karolherbst[d]: like.. on pascal it should be trivial (tm), but maxwell used the PMU for the ACR
21:54 karolherbst[d]: and I really don't know how that all interacts with each other
21:54 karolherbst[d]: given that the firmware nvidia gave us is nouveau custom afaik
21:54 karolherbst[d]: I'm sure getting the PMU to boot and do the thing is like 5% of the work
21:54 _lyude[d]: i can at least say there are a surprising number of things I'm finding that still haven't changed since maxwell
21:54 karolherbst[d]: and everything else is to figure out how to use the firmware in the first place
21:54 _lyude[d]: in this driver at least
21:55 _lyude[d]: ...anyone know what KernelBif is
21:55 karolherbst[d]: I meant it more from a secboot perspective
21:55 karolherbst[d]: maxwell hands over to the real PMU image _somehow_
21:56 karolherbst[d]: and I had a hack to boot our own PMU firmware at some point, but that also basically meant resetting the PMU and load it up and then the secboot stuff is gone (tm)
21:56 _lyude[d]: right, and then the acr firmware we have never loads anything on the pmu right
21:56 karolherbst[d]: like that allowed for memory reclocking on maxwell2, but sadly your spins stayed slow
21:56 karolherbst[d]: *fans
21:56 karolherbst[d]: _lyude[d]: who knows if our acr is even capable of booting the PMU firmware nvidia uses
21:57 karolherbst[d]: but same concern for Pascal anyway
21:57 karolherbst[d]: might just not support it
22:05 airlied[d]: I think we have cutter support for the video decode firmwares
22:05 airlied[d]: and patches upstream, but I'm not 90% sure even if you cut the fw's out of the image for PMU, booting it is a nightmare, before you even get to how do I ask it to reclock
22:13 karolherbst[d]: yeah but video decode doesn't require a LS mode
22:13 karolherbst[d]: the PMU needs it for fan management on Maxwell and voltage control on Pascal
23:01 notthatclippy[d]: _lyude[d]: Good news: I can't decode that either! This means likely that our decoder regressed for this version or something similar, and not that the original logs were corrupt. So the next step will be to figure out what broke and get the decode working and then maybe we'll get something useful from the repro-state logs.