00:19karolherbst: cool
06:51x001: Hi, guys - does this line in dmesg signify an error? nouveau 0000:02:00.0: fifo: intr 008000000
06:52x001: Sorry, typo: nouveau 0000:01:00.0: fifo: intr 008000000
06:59x001_: Sorry, typo: nouveau 0000:01:00.0: fifo: intr 008000000
07:27yepthatssomerobotass[d]: Hi - I'm getting display hang on antiX 32.2, on a Dell E6420 XFR with an Nvidia NVS 4200M.
07:27yepthatssomerobotass[d]: This seems to only occur when scrolling or browsing. I was able to setup SSH on another machine, and capture the dmesg:
07:27yepthatssomerobotass[d]: ```[ 1229.425281] nouveau 0000:01:00.0: fifo: INTR 00800000
07:27yepthatssomerobotass[d]: [ 1278.767074] nouveau 0000:01:00.0: fifo: INTR 01000000: 00000005
07:27yepthatssomerobotass[d]: [ 1285.538913] nouveau 0000:01:00.0: fifo: INTR 01000000: 00000005
07:27yepthatssomerobotass[d]: [ 1286.438851] nouveau 0000:01:00.0: fifo: INTR 01000000: 00000005
07:27yepthatssomerobotass[d]: [ 1346.405555] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.459035] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.515303] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.570919] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.625951] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.679319] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.732672] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.786075] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:27yepthatssomerobotass[d]: [ 1346.839453] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1346.892831] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1346.946096] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1346.999277] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.052581] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.106096] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.159820] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.213417] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.266612] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.319960] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.373261] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16
07:28yepthatssomerobotass[d]: [ 1347.427029] nouveau 0000:01:00.0: Xorg[1813]: nv50cal_space: -16```
07:29yepthatssomerobotass[d]: The machine is still accessible via SSH, I can navigate the filesystem and such.
07:29yepthatssomerobotass[d]: But display is totally frozen.
07:34yepthatssomerobotass[d]: Found packages list - here is everything regarding Nouveau:
07:34yepthatssomerobotass[d]: ```libdrm-nouveau2:amd64 2.4.114-1+b1 amd64 Userspace interface to nouveau-specific kernel DRM services -- runtime
07:34yepthatssomerobotass[d]: xserver-xorg-video-nouveau 1:1.0.17-2 amd64 X.Org X server -- Nouveau display driver```
07:35yepthatssomerobotass[d]: Installed packages, that is.
07:40yepthatssomerobotass[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1306886639695695984/nouveau-error-11-15-2024.txt?ex=67384c87&is=6736fb07&hm=afa229257f7bf5cc071f5f0337de8b83be1df757c03bf17b6e28c30f64f9452c&
07:40yepthatssomerobotass[d]: Okay, there was way more than that in dmesg. Uploading a .txt
07:47yepthatssomerobotass[d]: I spotted this:
07:47yepthatssomerobotass[d]: ```nouveau 0000:01:00.0: DRM: allocated 1366x2304 fb: 0x60000, bo 0000000072db3713```
07:47yepthatssomerobotass[d]: My display is 1366x768 - could this have any effect?
07:52yepthatssomerobotass[d]: Rebooted machine and was able to run xrandr, looks like it's detecting 1366x768, but I don't know if there is still somehow a conflict with Noveau.
11:45karolherbst: yepthatssomerobotass[d]: "nv50cal_space: -16" usually means that you ran out of VRAM or something
12:09lru: so the open source drivers slowly taking over the binary blobs... where is the source found for them? kernel? libdrm? mesa? xf86?
12:09karolherbst: all of those places
12:09lru: interesting...thanks
14:42DodoGTA: lru: xf86 isn't really important in the days of Wayland though
14:52yepthatssomerobotass[d]: karolherbst: Interesting...do you have a recommended top to monitor?
14:52tiredchiku[d]: nouveau currently does not have any sort of hw monitoring set up
15:05yepthatssomerobotass[d]: Ah, so no type of Nvidia top will work.
15:24asdqueerfromeu[d]: yepthatssomerobotass[d]: Without GSP you at least have temperature monitoring with hwmon (but otherwise there's nothing)
17:46awilfox[d]: Okay, current status: I have found why the CE channel doesn't work. The -22 (EINVAL) is coming from nvkm_chan_new
17:47awilfox[d]: `[ 9421.888706] nouveau 0000:01:00.0: fifo:000000: args runq:0:0 vmm:1:00000000ff3f0be6 userd:0:0000000000000000 push:0:0000000000000000 devm:00000fff:00000001 priv:0:1` unfortunately, I'm not yet sure what I'm looking at… trying to figure out the conditions that are not being met
17:47awilfox[d]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nvkm/engine/fifo/chan.c#n359
18:12awilfox[d]: I _think_ the issue is that `priv` is mismatched. The thing is, I don't know where the value comes from. The only `ramfc` I could find defined for Fermi was https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c#n111 which does leave `.priv` unset, but `nouveau_accel_ce_init` definitely hardcodes the `priv` argument to `true`:
18:12awilfox[d]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nouveau_drm.c#n334
18:18karolherbst[d]: soooooo
18:18karolherbst[d]: there is something you could try
18:18karolherbst[d]: well actually
18:18karolherbst[d]: ...
18:18karolherbst[d]: you might have to change the endian of read stuff
18:19karolherbst[d]: I think the global endian switch does flip the mmio region and the likes over
18:19karolherbst[d]: but what I think it doesn't change much is read VRAM and other things
18:19karolherbst[d]: and the context initialization stuff is kinda full of those things
18:19awilfox[d]: I'll note - and I don't know how much value it has - that the drmfb works fine, and has all the right colours and stuff, so it seems to be fine? But I have no idea really.
18:19karolherbst[d]: those `nvkm_kmap` things are mapping VRAM
18:20karolherbst[d]: and then read from or write to it
18:20karolherbst[d]: yeah.. but like the thing is, nvidia got sloppy with big endian support on newer GPUs
18:20karolherbst[d]: so it might be that random things are not endian transparent
18:21karolherbst[d]: though the other thing you could try out is to create an mmiotrace with the same hardware on little and big endian
18:21karolherbst[d]: and see if any reads are different
18:21karolherbst[d]: though not sure if mmiotrace works on anything not x86
18:24awilfox[d]: It doesn't seem to `depends on` any specific arch… hmm
18:24karolherbst[d]: yeah... but it has some arch specific code, like it needs to deal with page sizes and the likes. But maybe somebody made it work everywhere, dunno
18:26awilfox[d]: Ah, no, there is even a patch to make it work on PPC: https://lkml.org/lkml/2024/6/20/449
18:26awilfox[d]: …very very recent, too
18:44awilfox[d]: It's not all there and I am not intimately familiar enough with Linux kernel mm to immediately fix it myself