04:35rhyskidd: i know it's always fun to RE organizations internal opaque structures, so I noticed nvidia has a "OpenSource-Approval" mailing list/forum/meeting/thing
04:36rhyskidd: see commit on: https://github.com/NVIDIA/open-gpu-doc/commit/09307cb4f9b0352b3045fde1d3f058197b01d018
04:36rhyskidd: good if there's some additional time/effort invested in the processes by which we might be getting more documentation, going forward
04:38rhyskidd: karolherbst: i'll send some (friendly) review comments on your intern's patches. good first attempt, with a few constructive areas to improve upon
06:46karolherbst: rhyskidd: cool :)
06:47karolherbst: rhyskidd: one thng though, as those are patches from the out of tree source tree, we omit the "drm/nouveau"
06:48karolherbst: that's get added by the nvkm-am script ben uses to convert those to the linux tree though
12:17Tom^: is there some tool you guys use to actually reflash the vbios on gpus? or do you need to boot windows for that?
12:20RSpliet: Tom^: we generally don't flash BIOSes
12:21Tom^: RSpliet: yeah i assumed as much, the case was if i were to do some external gpu "modding" for a laptop. some gpus requires some hacked vbios
12:42PaulePanter: Hi. Restarting a system with the monitor turned off, and then after a while turning the monitor on, (lightdm) X hangs, and the messages below are seen in the Linux messages.
12:42PaulePanter: Changing to tty1 works though.
12:42PaulePanter: [619634.065347] nouveau 0000:01:00.0: DRM: core notifier timeout
12:42PaulePanter: [619636.065200] nouveau 0000:01:00.0: DRM: base-0: timeout
12:43PaulePanter: Could this be a regression in the Linux 4.19.y series?
12:43PaulePanter: I do not have a reproducer yet, as users just come to us in case of the problem.
12:55karolherbst: PaulePanter: maybe? maybe not?
12:56karolherbst: if this didn't happen with previous kernels, it makes sense to bisect it
12:56karolherbst: otherwise testing with master could help
12:56karolherbst: I think we had something like this recently
12:56karolherbst: imirkin: ?
13:22PaulePanter: Any idea, if these messages point to a driver problem? Or could it still be some lightdm issue?
13:22PaulePanter: Anyway, I also remember that these system were rebooted using kexec.
13:23PaulePanter: I am quite confused how the tty’s should work, but X does not.
13:28karolherbst: driver problem
13:28karolherbst: kexec might mess something up though
13:28karolherbst: but in any case that's something we have to fix on the kernel side
13:29karolherbst: PaulePanter: do you think you might be able to retrieve more information? like GPU and what port/resolution used for the display, etc...?
13:29PaulePanter: I try to do some more tests on how to reproduce it. Might take a while though.
13:29PaulePanter: karolherbst: Sure. lspci?
13:30karolherbst: ohh, wait, you posted dmesg already
13:30karolherbst: that's good enough for the gpu
13:30karolherbst: GF108.. oh well
13:32PaulePanter: Old Dell OptiPlex 7010. Xorg.0.log has also some information: https://paste.debian.net/1091339/
13:32karolherbst: ohh, it's a laptop?
13:33karolherbst: but nvidia gpu only I see
13:45PaulePanter: karolherbst: No, desktop system.
13:45karolherbst: ohhh, okay
14:01imirkin: i mean, basically that timeout is the gpu saying "you did something wrong, so i'm going to hang"
14:02imirkin: try running with drm.debug=0x14 to see what the previous submit was
14:02imirkin: PaulePanter: --^
14:03PaulePanter: imirkin: Can I change that in the running system?
14:04imirkin: echo that into /sys/modules/drm/parameters/debug or something like that
14:04imirkin: it won't give you retroactive info, of course :)
14:23karolherbst: imirkin: but I think evo was already messed up from the beginning though :/ "[ 35.983877] nouveau 0000:01:00.0: DRM: base-0: timeout" that's a little odd
14:23karolherbst: especially if the display was working
14:24karolherbst: or is that fine if it's just for fetching some info?
14:24imirkin: no, it's not fine
14:24imirkin: but it becomes not-fine in response to something dumb nouveau does
14:25imirkin: i think we make some effort to print when there's a failing submit
14:25karolherbst: wished we could kind of dump the transactions we did before the timeout
14:25karolherbst: but not if we submited successfully and wait for the result or something weird?
14:26imirkin: i dunno tbh
14:26karolherbst:wished all of that would be easier to debug :/
14:26PaulePanter: imirkin: https://paste.debian.net/1091346/ Is that excerpt enough?
14:26imirkin: the stuff ben has already is actually quite good
14:26imirkin: it takes a bit to wrap your head around it all
14:26imirkin: in large part because, well, it's complicated :)
14:26karolherbst: ohh, we dump stuff if we enable the drm debugging.. right
14:26karolherbst: always forget about that
14:27imirkin: PaulePanter: erm ... yes? what?
14:27imirkin: we didn't even do anything
14:27PaulePanter: imirkin: Should I upload the whole log?
14:28imirkin: well ideally we'd get the submit before the FIRST Timeout
14:28imirkin: perhaps something's permanently fucked
14:28PaulePanter: imirkin: Ok.
14:28imirkin: i haven't debugged this stuff a lot
14:28imirkin: usually my debugging revolves around me changing code and then things hanging
14:28imirkin: so that generally narrows down the parameters :)
14:29imirkin: here we just do an image set, and it times out
14:29imirkin: that means something has gone horribly wrong
14:30karolherbst: might make sense to boot with drm.debug
14:30karolherbst: and make it so that the display doesn't come back
14:30karolherbst: and give the fullt log
14:32PaulePanter: Hopefully it’s in there.
14:32imirkin: wait, this shit happens right on boot?
14:33imirkin: there's something fucked here. you have to do a cold power off
14:33karolherbst: imirkin: ... guess why I copy&pasted that one extract
14:33imirkin: let it sit for a minute, then turn it back on.
14:33imirkin: oh, now i see mentions of kexec above ...
14:34imirkin: kexec + nouveau = no go.
14:34imirkin: maybe if you force vbios execution it has a chance
14:34PaulePanter: imirkin: Understood.
14:34imirkin: i dunno how other drivers handle kexec
14:35karolherbst: by luck
14:35imirkin: there was more to it than that
14:35imirkin: iirc someone added a shutdown handler
14:35imirkin: or something
14:35karolherbst: a drm driver?
14:35imirkin: so that drivers could safen up devices? dunno
14:35imirkin: no, like to the kernel device core
14:35karolherbst: ohh, sure
14:36karolherbst: I meant only drm drivers suck here normally
14:36karolherbst: but core stuff like networking are fine
14:36imirkin: oh, well that may be
14:36karolherbst: simplier devices
14:36karolherbst: simplier to reset...
14:36karolherbst: the issue is that kexec doesn't trigger a POST
14:37karolherbst: maybe nouveau.config=NvForcePost=1 could help
14:37karolherbst: worth trying out
14:37imirkin: that's what i said above :p
14:37imirkin: except not as explicitly
14:37karolherbst: PaulePanter: mind checking if adding "nouveau.config=NvForcePost=1" to the new kernel helps?
14:37karolherbst: this should revert the GPU to a saneish state
14:38karolherbst: or at least a state nouveau might be able to handle better
14:38karolherbst: or wrose
14:38imirkin: sometimes yes, sometimes no.
14:38imirkin: sometimes it just makes things worse.
14:39karolherbst: I think not even the binary driver survives a kexec...
14:42PaulePanter: Normal reboot seems to brought the card back.
14:42PaulePanter: Let’s try the parameter.
14:50PaulePanter: The parameter seems to have helped.
14:50PaulePanter: No timout messages in the configuration anymore.
14:50PaulePanter: Only error I see is
14:50PaulePanter: [ 23.770504] nouveau 0000:01:00.0: disp: ERROR 1  02  chid 1 mthd 0000 data 00000400
14:51PaulePanter: Anyway, we will reboot the desktop systems without kexec in the future.
14:52imirkin: PaulePanter: i'd like to see that one
14:52imirkin: was there any other info around that?
14:54PaulePanter: Yeah, there was. One second.
15:00PaulePanter: imirkin: https://paste.debian.net/1091360/
15:00PaulePanter: imirkin: It was run with nouveau.config=NvForcePost=1 and kexec.
19:21pmoreau: karolherbst: I’ll check that out during the weekend. I was at a conference this week, and the two weeks before were busy with organisation for that conference and preparing my presentation.
19:21pmoreau: I’ll be home tomorrow in the early afternoon.
20:50pmoreau: I should take the opportunity to push some more linking fixes, and investigate a segfault I ran into.
23:34Lyude: agh, imirkin: I'll try to take a look at that MST bug you showed me earlier next monday
23:35Lyude: got distracted fixing a very large amount of broken amdgpu code so that I could make mst suspend/resume reprobe work on it :|