15:17danyspin97_: Hi there :)
15:18danyspin97_: I am having issue with a GTX965M card that have stopped working since kernel 5.2
15:18danyspin97_: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/335#note_371709 The problem is exactly as reported here
15:20danyspin97_: I have tried with linux master version and the problem persists. I have also searched through the kernel nouveau module and found that that address is only used 4 times, so it could be a good starting point
15:20danyspin97_: however, I could not go further :/
15:34RSpliet: danyspin97_: the specific post you shared does not contain any errors in its logs.
15:35danyspin97_: RSpliet: these are the 2 important lines
15:35danyspin97_: bus: MMIO read of 00000000 FAULT at 022554 [ IBUS ]
15:35danyspin97_: DRM: failed to create encoder 1/8/0: -19
15:35RSpliet: danyspin97_: yes I did find that in other posts on the same bug
15:36RSpliet: The "nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 022554" sounds like the device is not powered up, most likely a runpm issue. Did you actually try this with a 5.8 or 5.9 kernel?
15:36RSpliet: some bugfixes went in since, I believe
15:37danyspin97_: I have tried with master
15:37RSpliet: *kernel* master? or xf86-video-nouveau master?
15:37danyspin97_: kernel master, latest versions of libdrm and mesa
15:38danyspin97_: I don't know about xf86, maybe I only installed it recently
15:38danyspin97_: isn't it only for X11?
15:38RSpliet: ah ok, and it still persists. karolherbst: this sounds like the stuff you know more about
15:39RSpliet: yes, the xf86-video-nouveau module is for Xorg only. It's also hardly updated, as newer GPUs use a generic modesetting module and glamor for accel
15:39karolherbst: that one can be ignored
15:39danyspin97_: the weird thing is that before kernel version 5.4 it worked fine. Than with >5.5 it stopped working
15:40danyspin97_: now booting kernel 5.4< results in hangs
15:40danyspin97_: karolherbst: I see, so the error code is the relevant thing?
15:40karolherbst: no, I mean the error can be ignored, it's not really hinting towards the problem
15:41danyspin97_: btw, I don't have xf86-video-nouveau installed
15:42karolherbst: doesn't matter
15:42RSpliet: karolherbst: oh I read that error wrong actually. I thought it was trying to read register 0x0 :-P but it does get beyond identification, so probs not runpm
15:42karolherbst: danyspin97_: do you have a crash log or something?
15:42RSpliet: I'll leave it with you
15:42karolherbst: and the hang you talk about was fixed recently and backported to some stable branches
15:42danyspin97_: karolherbst: there isn't any for now
15:43danyspin97_: it just doesn't create the DRM outputs
15:43karolherbst: what kernel are you on
15:43danyspin97_: /dev/dri/card1 exists and is the nouveau one
15:43danyspin97_: currently I have master installed, but I also have 5.9.1
15:43karolherbst: what does /sys/class/drm contain?
15:44danyspin97_: I reboot and send you the output
15:45danyspin97_: heae I am
15:46danyspin97_: karolherbst: http://ix.io/2DN9
15:47karolherbst: okay.. you have two HDMI ports? Although I assume it's just one and the intel one is just fake or something
15:48danyspin97_: the kernel is started with "nouveau.runpm=0 nouveau.noaccel=1" as arguments
15:48karolherbst: remove both of those
15:48danyspin97_: otherwise I got an INVALID_STATE error in dmesg abuot nouave
15:48danyspin97_: just one HDMI attached to the nvidia card
15:48karolherbst: having noaccel set will make the external screen not work
15:49danyspin97_: ah, I see
15:49karolherbst: and the runpm stuff should be fixed now
15:51karolherbst: mhhh "DRM: failed to create kernel channel, -22"
15:51karolherbst: that is critical
15:51karolherbst: you don't have the firmware installed
15:51karolherbst: you need linux-firmware
15:51danyspin97_: it is installed though
15:52danyspin97_: no, plain one
15:53danyspin97_: which is the name of the firware file?
15:53karolherbst: the directory is /lib/firmware/nvidia/gm206/acr/
15:53karolherbst: and it contains a bunch of files
15:54karolherbst: but there is more stuff
15:54karolherbst: it's the first one failing though
15:54karolherbst: maybe you need to regenerate your initramfs or so?
15:54karolherbst: anyway, make sure it's included there
15:54karolherbst: this is all initramfs land as the drivers are loaded super early
15:54danyspin97_: karolherbst: do I need an initram-fs?
15:54karolherbst: but if you have one, you need the firmware in there
15:54danyspin97_: I am not using it
15:55karolherbst: danyspin97_: mind booting with nouveau.debug=debug?
15:56danyspin97_: one thing to note, I use the nvidia drivers on another distribution on the same machine
15:59danyspin97_: http://ix.io/2DNg output with debug enabled
16:00karolherbst: danyspin97_: yeah.. seems like something goes wrong with the firmware stuff
16:00karolherbst: not much I can do about it atm, but it does print the files now
16:00karolherbst: maybe that will help you figuring it out
16:00karolherbst: there is probably also a way to debug the linux firmware loading stuff
16:01danyspin97_: karolherbst: thanks for your help :)
16:02karolherbst: well.. in case it still doesn't work after you verified the files are indeed there we can take a deeper look
16:17danyspin97_: it seems that the firmware loading doesn't want to work. I am adding all the files manually (there were already files I have put there last year)
16:25karolherbst: danyspin97_: ohh.. there is something I forgot.. when you don't have initramfs, you have to include all required firmware files into the linux binary
16:26karolherbst: there is a linux config for that where you can list them
16:26karolherbst: and they get picked up when building the kernel
16:29danyspin97_: karolherbst: oh, so firware loading isn't broken
16:29danyspin97_: one less thing to investigate, at least
16:30danyspin97_: now it just hangs at boot time :(
16:30danyspin97_: just to be sure I am recompiling it with gcc (was compiled with clang previously)
16:30danyspin97_: I didn't thing that it does a difference, but it's worth to try
16:34karolherbst: yeah... not sure.. there can be many reasons it is hanging
16:34karolherbst: and getting the log would be helpful
16:34karolherbst: there are a few other workarounds which might affect your machine
16:34danyspin97_: how could I get the log when it hangs?
16:35karolherbst: that's an annoying issue indeed, but there are a few options: netconsole, rammops.. ever one of them sucks to set up
16:35karolherbst: sometimes distributions have packages to preconfigure stuff
16:35karolherbst: there is one thing you could try
16:36karolherbst: that might help
16:36karolherbst: but that shouldn't prevent booting afaik :/
16:36danyspin97_: thanks, I'll try that too!
16:37karolherbst: another option: blacklist nouveau and modprobe it later once you have something setup to retrieve logs
16:37karolherbst: like ssh or whatever
16:46danyspin97_: I see, I'll try that too
16:48danyspin97_: argh, it is compiled as built-in xD
16:57karolherbst: danyspin97_: ahh, might be that builtin modules require firmware to be embedded... so maybe with a normal module it would work without having to include firmware :D but yeah, need to compile as module in order to blacklist it
16:58imirkin: firmware has to be available at the time of probe
16:58imirkin: you could manually reprobe the GPU after boot if you wanted
17:05karolherbst: ohh, also an idea
17:05karolherbst: so unbind/bind.. but the interface is annoying and I always get it wrong
17:05imirkin: you echo the pci address
17:06imirkin: which you know since it's in the symlink
17:06imirkin: easier to unbind than bind
17:06imirkin: (since when you unbind, the symlink is no longer there)
17:06karolherbst: sure, but I think you need to prefix with pci:... or something
17:06karolherbst: when binding
17:06imirkin: er, after you unbind
17:06imirkin: didn't use to
17:06karolherbst: yeah.. not sure either
17:06imirkin: maybe depends on the driver?
17:06karolherbst: might be?
17:06imirkin: i think you have to do that for platform:
17:06karolherbst: maybe only relevant on platform devices?
17:07imirkin: but it's assumed pci
17:07karolherbst: ahh, might be that pci is assumed
17:07imirkin: (although pci: likely works)
17:07karolherbst: on the tegra the GPU is in the dtb tree and is not a PCI device
17:10ignapk: imirkin: hey, I just managed to build and boot 4.16.3 kernel, (on fedora 28 by following https://fedoraproject.org/wiki/Building_a_custom_kernel), could you give me some pointers on how I would go about patching it with https://github.com/skeggsb/nouveau/tree/devel-clk?
17:11danyspin97_: YEA, it works!
17:12danyspin97_: i blacklisted it and then used modprobe
17:12karolherbst: and no error this way?
17:13imirkin: ignapk: don't patch it. just grab the repo, check out the branch, "cd drm; make"
17:14imirkin: this should result in a nouveau.ko which you can then try to use
17:14ignapk: ok, git it :D
17:14danyspin97_: karolherbst: I think the error still happens, but wlroots pickup the output anyway
17:14RSpliet: imirkin: doesn't that repo require envytools to be set up (as make will compile our firmwares too)?
17:14karolherbst: danyspin97_: interesting
17:14imirkin: RSpliet: i don't _think_ so
17:15imirkin: the compiled firmware is checked in
17:15RSpliet: it's been too long for me
17:15imirkin: there are special make incantations you can use to regenerate it
17:16RSpliet: ah, in that case ignore me :-)
17:17ignapk: um, this looks like a problem with my installation
17:18imirkin: seems fine
17:18imirkin: what's the problem?
17:18ignapk: oh, so this output means that make finished successfully?
17:18imirkin: it just means it's running make
17:18imirkin: there should be many lines after that
17:19imirkin: which say compiling this or that file
17:19ignapk: Oh it wasn't pasted correctly
17:19ignapk: Or rather, only stdout was piped to paste.rs, without stderr
17:20imirkin: probably not great then
17:20imirkin: stderr means things failed
17:20imirkin: you can use |& to pipe everything
17:21ignapk: nice, that's handy
17:22imirkin: ah yeah
17:22ignapk: 4.16.3-300.local is the kernel version i just compiled and booted
17:22imirkin: it can't find your build dir
17:22imirkin: so ... you need to help it out
17:22imirkin: i don't know how
17:23imirkin: maybe that build-a-custom-kernel thing has a way of including the source too
17:23imirkin: (along with .config and whatnot)
17:23ignapk: hmm maybe it's because I installed it as rpm, and didn't install it/build it with make install etc
17:23imirkin: yeah, i don't know anythign about rpm
17:23imirkin: except that it's difficult to use
17:23ignapk: oh so I can just install other rpms it generated, thanks for the clue
17:26RSpliet: ignapk: whenever I roll a kernel RPM (which is not very often these days), I invoke rpmbuild with
17:26RSpliet: rpmbuild -bb --target `uname -m` --with baseonly --without debuginfo --with firmware
17:26RSpliet: that gives header and module rpms too
17:27RSpliet: in fact, I made an alias for that because I keep forgetting the flags :-P
17:28RSpliet: It's definitely more tedious than the standard kernel make, make install routine... but I prefer to have the ability to uninstall my kernel again without worrying about where the files went
17:43ignapk: yeah I forgot to install kernel-headers rpm :p
17:51imirkin: ignapk: ok, so it's working now?
17:52ignapk: yup, still compiling
17:52imirkin: but like i said ... probability of working isn't so high
17:52imirkin: i tested on precisely the same board as skeggsb had been testing on, and it didn't work for me :)
17:53imirkin: (although i had display running, whereas i'm guessing he didn't ... if your display is off another GPU, then i think chances go up slightly)
17:54ignapk: Yeah I'm aware of that, there's also probability that I chose wrong incompatible kernel version, but I enjoy the tinkering :p i do have display off of intel integrated graphics, so yay!
17:54ignapk: ok it finished
17:55imirkin: so now if you do modules_install then i think that module will get used (next boot)
17:56imirkin: although when i've used it, i think i've just run insmod directly
17:56imirkin: but given that you like the rpm approach...
17:56ignapk: hmm it seems now that it would have been easier if I followed another guide
18:10ignapk: is there a way to confirm using different nouveau kernel module?
18:10imirkin: reclocking should no longer give an error like that
18:18ignapk: still the same error but I suspect it's because I didn't use the patched module correctly https://paste.rs/M9q
18:19imirkin: well you don't just install it
18:19imirkin: you have to reboot :)
18:19imirkin: or at least reload it
18:20imirkin: also i assuem you did the install as root?
18:20imirkin: and finally, i don't know anything about those errors ... sounds like some sort of module verification thing?
18:20ignapk: yeah I rebooted, and run make install as root
18:20imirkin: check the timestamp on nouveau.ko
18:20imirkin: see if it's new or old?
18:21imirkin: it actually sounds like the install failed
18:21ignapk: in the cloned nouveau tree?
18:21imirkin: i know nothing about this signature junk though
18:21imirkin: in /lib/modules
18:22ignapk: Ok so nouveau.ko.xz was last touched at 17:17
18:23imirkin: and you live in the UK, so that's recent?
18:23ignapk: so nope, it wasnt installed (it's now 19:23 in my timezone)
18:23imirkin: as opposed to living in, say, australia, where that's many hours old
18:23imirkin: aha ok
18:23imirkin: so yeah, modules_install failed
18:24imirkin: probably because it couldn't sign
18:24imirkin: perhaps there's another rpm which has the signing keys?
18:24imirkin: iirc fedora randomly generates them for every build or something like that
18:24ignapk: after last fiasco I installed all the remaining rpms, i reached out at #fedora-kernel about it
18:25imirkin: yeah, sorry, i just don't know that much about it. i always do stuff directly, so these problems never come up
18:26imirkin: but don't worry, i have other problems to take their place :)
18:26ignapk: yeah np, ill figure it out, you've been great help with that crazy quest to re-clock fermi card already :)
20:48ignapk: imirkin: ok, so what exactly did you mean by not working? Did the echo $a > /sys/kernel/debug/dri/$b/pstate returned the usual error, or did it hang?
21:06RSpliet: ignapk: you'll have to manually xz the .ko and copy it over. Probably strip -S the module before xz'ing as well. Make doesn't do any of that for you
21:07RSpliet: oh and then dracut --force to rebuild your initramfs, nouveau is snuck in there to load before your root is mounted
21:11ignapk: RSpliet: I already figured out the signing keys problem, and confirmed that the patched nouveau module was installed and used by... at least the fact that cpu gets stuck in a soft lockup when trying to re-clock now :)
21:12ignapk: exactly \o/
21:12ignapk: now I guess I should extract the interesting log
21:23RSpliet: ignapk: I guess this is the time I need to inform you of two kernel parameters that will prove useful
21:23RSpliet: nouveau.config=NvMemExec=0 nouveau.debug=pmu=debug
21:24RSpliet: together what will happen is that if you write something to pstate, the driver will generate and print the script used for changing the DRAM clock (a sequence of register writes and pauses and some other bits)
21:24RSpliet: you can compare these register writes against a trace from the blob to find differences
21:25RSpliet: Also, with nouveau.config=NvMemExec=0 you can try and fix the code for changing all the other clocks first
21:25RSpliet: Memory is by far the most difficult ;-) Would already be quite an achievement if you get the other clocks reconfigured
21:27RSpliet: oh, sorry, I should clarify: if you set both those params, nouveau will generate and print the script (the debug half), but not actually execute it (the NvMemExec half)
21:28RSpliet: And as a final tip: focus on the lower clocks first. There's less that can go wrong there
22:05ignapk: hmm, so I still got the cpu soft lockup, but this time system seemed more responsive? and `journalctl -k -b -1` showed what hopefully you meant by the script: https://paste.rs/OuH
23:10RSpliet: ignapk: no the DRAM clock change scripts are not in there, your GPU seems to die before that already
23:11RSpliet: interestingly it dies in gt215_pmu_send+0x10e/0x2d0 [nouveau]. Shows it gets all the way there though. Don't remember the code well enough to tell you what that means
23:12ignapk: could the kernel version be at fault? should I try some earlier ones?
23:12RSpliet: if anything you want a kernel that's way later...
23:13RSpliet: But that'll take effort forward-porting the patches.
23:16imirkin: ignapk: no, all the nouveau stuff is contained inside nouveau
23:17imirkin: so ... yeah. you're hitting some condition that's not covered by the existing logic. sorry
23:17ignapk: ok so I can at least assume that you guessed right kernel version to try the branch out
23:18imirkin: if it compiles, then we guessed right
23:18ignapk: so maybe I could look in the nouveau code to find what happens in gt215_pmu_send+0x10e/0x2d0