05:15Cyberhuman: Hi! I have a Quadro P600 card (GP107) and my PC crashes quite often. Sometimes only the picture freezes and I can SSH and check logs (sysrq also works), sometimes it hangs completely. How can I investigate the crash?
05:17Cyberhuman: I have 2 x 4K monitors and use sway/wayland if that matters. Usually before the crash the screen starts flickering when switching between windows. After 5-10 minutes it freezes. Also sometimes I have 10-30 days uptime without any hiccup, but otherwise this system does not survive more than a few hours :-(
05:19Cyberhuman: I get this before the freeze:
05:19Cyberhuman: [ 3061.526221] nouveau 0000:03:00.0: gr: TRAP ch 2 [007fb91000 systemd-logind][ 3061.526229] nouveau 0000:03:00.0: gr: GPC0/TPC0/TEX: 80000041[ 3061.526232] nouveau 0000:03:00.0: gr: GPC0/TPC1/TEX: 80000041[ 3061.526236] nouveau 0000:03:00.0: gr: GPC0/TPC2/TEX: 80000041[ 3061.526245] nouveau 0000:03:00.0: fifo: fault 00 [READ] at
05:19Cyberhuman: 000000001ff5f000 engine 00 [GR] client 0a [GPC0/T1_3] reason 00 [PDE] on channel 2 [007fb91000 systemd-logind][ 3061.526253] nouveau 0000:03:00.0: fifo: channel 2: killed[ 3061.526255] nouveau 0000:03:00.0: fifo: runlist 0: scheduled for recovery[ 3061.526259] nouveau 0000:03:00.0: fifo: engine 0: scheduled for recovery[ 3061.526264] nouveau
05:19Cyberhuman: 0000:03:00.0: fifo: engine 7: scheduled for recovery[ 3061.526274] nouveau 0000:03:00.0: systemd-logind: channel 2 killed![ 3061.530026] nouveau 0000:03:00.0: bus: MMIO read of 00000000 FAULT at 122124 [ IBUS ]
05:22Cyberhuman: I also collected some logs with nouveau.debug=trace and there is a lot of output even after the FAULT line above. But it starts to show some hung tasks in syslog only after I try to kill sway or to reboot the system.
05:58fling: Cyberhuman: can you kill xdm and X and try reloading the module?
05:58fling: I'm having the similar situation here
05:58fling: first time it hanged I successfully killed X and xdm and everything worked fine after I reloaded the module
05:59fling: after the second hang X failed to get killed and stuck in defunc state but force reloading of the module killed X and solved it again
05:59fling: not rebooted since then, X works
06:41Cyberhuman: I don't have neither xdm nor X. I use sway on wayland and launch it directly from the tty.
06:43Cyberhuman: trying to kill sway or reboot usually leads to complete hang
10:52cosurgi: Cyberhuman: if you want stability, then go back to Xorg and drop wayland.
10:53cosurgi:experiences crash once pr two weeks. Yesterday we nailed the source of the problem. Hopes we will fix it soon.
11:18Cyberhuman57: cosurgi by "we nailed the source of the problem" do you mean that it's known issue? Where can I read more about it?
11:20cosurgi:pastes part of email
11:20cosurgi: while working through more old TTM functionality I stumbled over the io_reserve_lru. Basic idea is that when this flag is set the driver->io_mem_reserve() callback can return -EAGAIN resulting in unmapping of other BOs. But Nouveau doesn't seem to return -EAGAIN in the call path of io_mem_reserve anywhere. I believe this is a bug in Nouveau. We *should* be returning -EAGAIN if we fail to find space
11:21cosurgi: in BAR1 to map the BO into.
11:21cosurgi: Yes, I would say that's exactly what would happen.
11:21cosurgi: A user has been experiencing this in a tricky-to-reproduce scenario with a ton of vram dedicated to framebuffers and so on (3x 4K), and the nouveau ddx falls back to memcpy in certain cases.
11:21cosurgi: Could this lead to SIGBUS in userspace, esp related to resume and similar situations?
11:21cosurgi: --- end
11:21cosurgi: but my crashes look different than what yu described.
11:22Cyberhuman57: Ok, but at least there is some clue! I may take a look in my free time. Thank you!
11:24Cyberhuman57: And if there is any patch or anything I can help with testing, feel free to ping me.
16:03Lyude: skeggsb: from the CRC spec, what did the RG/SF sources correspond to? Raster graphics/symbol formatter?
18:09skeggsb: Lyude: Raster Generator, Symbol Formatter ( i think)
18:48Lyude: skeggsb: sweet, andy confirmed it as well
18:49Lyude: seems it goes RG -> SF -> (SOR|PIOR)
18:49Lyude: (not sure if that's also how it's routed with DACs, but I can figure that out later and focus on just getting plain SOR/PIOR CRCs for now)