04:35 rhyskidd: i know it's always fun to RE organizations internal opaque structures, so I noticed nvidia has a "OpenSource-Approval" mailing list/forum/meeting/thing
04:36 rhyskidd: see commit on: https://github.com/NVIDIA/open-gpu-doc/commit/09307cb4f9b0352b3045fde1d3f058197b01d018
04:36 rhyskidd: good if there's some additional time/effort invested in the processes by which we might be getting more documentation, going forward
04:38 rhyskidd: karolherbst: i'll send some (friendly) review comments on your intern's patches. good first attempt, with a few constructive areas to improve upon
05:03 rhyskidd: done
06:46 karolherbst: rhyskidd: cool :)
06:47 karolherbst: rhyskidd: one thng though, as those are patches from the out of tree source tree, we omit the "drm/nouveau"
06:48 karolherbst: that's get added by the nvkm-am script ben uses to convert those to the linux tree though
12:17 Tom^: is there some tool you guys use to actually reflash the vbios on gpus? or do you need to boot windows for that?
12:20 RSpliet: Tom^: we generally don't flash BIOSes
12:21 Tom^: RSpliet: yeah i assumed as much, the case was if i were to do some external gpu "modding" for a laptop. some gpus requires some hacked vbios
12:42 PaulePanter: Hi. Restarting a system with the monitor turned off, and then after a while turning the monitor on, (lightdm) X hangs, and the messages below are seen in the Linux messages.
12:42 PaulePanter: Changing to tty1 works though.
12:42 PaulePanter: https://paste.debian.net/1091332/
12:42 PaulePanter: [619634.065347] nouveau 0000:01:00.0: DRM: core notifier timeout
12:42 PaulePanter: [619636.065200] nouveau 0000:01:00.0: DRM: base-0: timeout
12:43 PaulePanter: Could this be a regression in the Linux 4.19.y series?
12:43 PaulePanter: I do not have a reproducer yet, as users just come to us in case of the problem.
12:55 karolherbst: uff
12:55 karolherbst: PaulePanter: maybe? maybe not?
12:56 karolherbst: if this didn't happen with previous kernels, it makes sense to bisect it
12:56 karolherbst: otherwise testing with master could help
12:56 karolherbst: but...
12:56 karolherbst: I think we had something like this recently
12:56 karolherbst: imirkin: ?
13:22 PaulePanter: Any idea, if these messages point to a driver problem? Or could it still be some lightdm issue?
13:22 PaulePanter: Anyway, I also remember that these system were rebooted using kexec.
13:23 PaulePanter: I am quite confused how the tty’s should work, but X does not.
13:28 karolherbst: driver problem
13:28 karolherbst: uh..
13:28 karolherbst: kexec might mess something up though
13:28 karolherbst: but in any case that's something we have to fix on the kernel side
13:29 karolherbst: PaulePanter: do you think you might be able to retrieve more information? like GPU and what port/resolution used for the display, etc...?
13:29 PaulePanter: I try to do some more tests on how to reproduce it. Might take a while though.
13:29 PaulePanter: karolherbst: Sure. lspci?
13:30 karolherbst: ohh, wait, you posted dmesg already
13:30 karolherbst: that's good enough for the gpu
13:30 karolherbst: GF108.. oh well
13:32 PaulePanter: Old Dell OptiPlex 7010. Xorg.0.log has also some information: https://paste.debian.net/1091339/
13:32 karolherbst: ohh, it's a laptop?
13:33 karolherbst: but nvidia gpu only I see
13:45 PaulePanter: karolherbst: No, desktop system.
13:45 karolherbst: ohhh, okay
14:01 imirkin: i mean, basically that timeout is the gpu saying "you did something wrong, so i'm going to hang"
14:02 imirkin: try running with drm.debug=0x14 to see what the previous submit was
14:02 imirkin: PaulePanter: --^
14:03 PaulePanter: imirkin: Can I change that in the running system?
14:03 imirkin: sure
14:04 imirkin: echo that into /sys/modules/drm/parameters/debug or something like that
14:04 PaulePanter: Nice.
14:04 imirkin: it won't give you retroactive info, of course :)
14:23 karolherbst: imirkin: but I think evo was already messed up from the beginning though :/ "[ 35.983877] nouveau 0000:01:00.0: DRM: base-0: timeout" that's a little odd
14:23 karolherbst: especially if the display was working
14:24 karolherbst: or is that fine if it's just for fetching some info?
14:24 imirkin: no, it's not fine
14:24 imirkin: but it becomes not-fine in response to something dumb nouveau does
14:24 karolherbst: :)
14:24 karolherbst: yeah..
14:25 imirkin: i think we make some effort to print when there's a failing submit
14:25 karolherbst: wished we could kind of dump the transactions we did before the timeout
14:25 karolherbst: ahh
14:25 karolherbst: right
14:25 karolherbst: but not if we submited successfully and wait for the result or something weird?
14:26 imirkin: i dunno tbh
14:26 karolherbst:wished all of that would be easier to debug :/
14:26 PaulePanter: imirkin: https://paste.debian.net/1091346/ Is that excerpt enough?
14:26 imirkin: the stuff ben has already is actually quite good
14:26 imirkin: it takes a bit to wrap your head around it all
14:26 imirkin: in large part because, well, it's complicated :)
14:26 karolherbst: ohh, we dump stuff if we enable the drm debugging.. right
14:26 karolherbst: always forget about that
14:27 imirkin: PaulePanter: erm ... yes? what?
14:27 imirkin: we didn't even do anything
14:27 PaulePanter: imirkin: Should I upload the whole log?
14:28 imirkin: well ideally we'd get the submit before the FIRST Timeout
14:28 imirkin: perhaps something's permanently fucked
14:28 PaulePanter: imirkin: Ok.
14:28 imirkin: i haven't debugged this stuff a lot
14:28 imirkin: usually my debugging revolves around me changing code and then things hanging
14:28 imirkin: so that generally narrows down the parameters :)
14:29 imirkin: here we just do an image set, and it times out
14:29 imirkin: that means something has gone horribly wrong
14:30 karolherbst: might make sense to boot with drm.debug
14:30 karolherbst: and make it so that the display doesn't come back
14:30 karolherbst: and give the fullt log
14:30 karolherbst: *full
14:32 PaulePanter: https://paste.debian.net/1091347/
14:32 PaulePanter: Hopefully it’s in there.
14:32 imirkin: wait, this shit happens right on boot?
14:33 imirkin: there's something fucked here. you have to do a cold power off
14:33 karolherbst: imirkin: ... guess why I copy&pasted that one extract
14:33 imirkin: let it sit for a minute, then turn it back on.
14:33 imirkin: oh, now i see mentions of kexec above ...
14:34 imirkin: kexec + nouveau = no go.
14:34 imirkin: maybe if you force vbios execution it has a chance
14:34 PaulePanter: imirkin: Understood.
14:34 imirkin: i dunno how other drivers handle kexec
14:35 karolherbst: by luck
14:35 imirkin: there was more to it than that
14:35 imirkin: iirc someone added a shutdown handler
14:35 imirkin: or something
14:35 karolherbst: a drm driver?
14:35 imirkin: so that drivers could safen up devices? dunno
14:35 imirkin: no, like to the kernel device core
14:35 karolherbst: ohh, sure
14:36 karolherbst: I meant only drm drivers suck here normally
14:36 karolherbst: but core stuff like networking are fine
14:36 imirkin: oh, well that may be
14:36 karolherbst: simplier devices
14:36 karolherbst: simplier to reset...
14:36 karolherbst: the issue is that kexec doesn't trigger a POST
14:36 karolherbst: so....
14:37 karolherbst: maybe nouveau.config=NvForcePost=1 could help
14:37 karolherbst: worth trying out
14:37 imirkin: that's what i said above :p
14:37 imirkin: except not as explicitly
14:37 karolherbst: :)
14:37 karolherbst: PaulePanter: mind checking if adding "nouveau.config=NvForcePost=1" to the new kernel helps?
14:37 karolherbst: this should revert the GPU to a saneish state
14:38 karolherbst: or at least a state nouveau might be able to handle better
14:38 karolherbst: or wrose
14:38 karolherbst: *worse
14:38 imirkin: sometimes yes, sometimes no.
14:38 imirkin: sometimes it just makes things worse.
14:39 karolherbst: I think not even the binary driver survives a kexec...
14:42 PaulePanter: ;-)
14:42 PaulePanter: Normal reboot seems to brought the card back.
14:42 PaulePanter: Let’s try the parameter.
14:50 PaulePanter: The parameter seems to have helped.
14:50 PaulePanter: No timout messages in the configuration anymore.
14:50 PaulePanter: Only error I see is
14:50 PaulePanter: [ 23.770504] nouveau 0000:01:00.0: disp: ERROR 1 [] 02 [] chid 1 mthd 0000 data 00000400
14:51 PaulePanter: Anyway, we will reboot the desktop systems without kexec in the future.
14:52 imirkin: PaulePanter: i'd like to see that one
14:52 imirkin: was there any other info around that?
14:54 PaulePanter: Yeah, there was. One second.
15:00 PaulePanter: imirkin: https://paste.debian.net/1091360/
15:00 PaulePanter: imirkin: It was run with nouveau.config=NvForcePost=1 and kexec.
19:21 pmoreau: karolherbst: I’ll check that out during the weekend. I was at a conference this week, and the two weeks before were busy with organisation for that conference and preparing my presentation.
19:21 pmoreau: I’ll be home tomorrow in the early afternoon.
20:50 pmoreau: I should take the opportunity to push some more linking fixes, and investigate a segfault I ran into.
23:34 Lyude: agh, imirkin: I'll try to take a look at that MST bug you showed me earlier next monday
23:35 Lyude: got distracted fixing a very large amount of broken amdgpu code so that I could make mst suspend/resume reprobe work on it :|