22:50 ttkay: hiya .. I'm trying to get Xorg running under a 5.4.0-rc6 kernel on my Lenovo p73, which has a Quadro P620. Unfortunately when I switch into graphics mode the nouveau kernel module detects that the device is 511 C and shuts down the system
22:50 ttkay: looking at the module sources, I didn't see an obvious way to disable this behavior, and I'm not sure if that's the best redress anyway
22:50 imirkin_: that sounds a lot like a -1
22:50 ttkay: this is my first time dealing with nouveau, so any advice would be welcome
22:51 imirkin_: 511 = 0x1ff
22:51 ttkay: *nod* it's clearly not detecting the temperature correctly
22:51 imirkin_: pretty sure we don't do anything too drastic beyond trying to spin up fans on higher temp
22:51 imirkin_: and P620 sounds like pascal
22:51 ttkay: it is pascal, yeah
22:51 imirkin_: in which case we don't control the fans at all - it's done by the firmware
22:51 imirkin_: do you have a dmesg?
22:52 ttkay: imirkin_ - yeah, let me upload that, just a minute .. in the meantime here's a snippet: http://ciar.org/h/517fd4.txt
22:52 imirkin_: right ... i just don't think those messages have any teeth
22:52 ttkay: http://ciar.org/h/messages.nov_debug.txt
22:52 imirkin_: they're annoying, but not much beyond that
22:52 ttkay: orly? okay
22:53 imirkin_: i mean, i guess i dunno - maybe shutdown is For Real (tm)
22:53 ttkay: the laptop rebooted immediately afterwards, so I just put two and two together
22:53 imirkin_: lol, looks like something does run 'shutdown'
22:53 imirkin_: which means you are getting some acpi event
22:53 imirkin_: which is telling your distro to shut down
22:53 imirkin_: yay for user friendly!
22:54 imirkin_: maybe the temp really does get high btw (but maybe not the full 511 degrees...)
22:54 imirkin_: Nov 8 05:25:31 kirov smartd[1374]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 119
22:54 imirkin_: that's fairly toasty
22:54 ttkay: sequence of events: (1) running the laptop in text mode for about an hour, (2) tried to start Xorg, (3) laptop shut down
22:55 imirkin_: it really sounds like you're having a thermal emergency here
22:56 karolherbst: ttkay: there is a fix upstream already for that
22:56 karolherbst: or should be...
22:56 imirkin_: for which part?
22:56 ttkay: karolherbst - that's good news :-) thanks
22:56 karolherbst: imirkin_, ttkay: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/nouveau?id=68bf8b577977a2804af82ffe03ce25d952f82fd2
22:57 karolherbst: but..
22:57 karolherbst: that's already in 5.4
22:57 karolherbst: so I am wondering why ttkay is still hitting that
22:57 ttkay: hrm
22:57 ttkay: thanks for the pointer, at least .. it's something else to be looking at
22:57 karolherbst: sensors shouldn't report 511 C, but N/A instead e.g.
22:57 ttkay: if it comes down to it, I can edit the sources and recompile
22:58 karolherbst: so I really don't know where that 511 is coming from.. maybe bad timing?
22:58 karolherbst: but also doubtful
22:58 ttkay: when I run lmsensors, it does show N/A
22:58 karolherbst: mhhh
22:58 karolherbst: then I guess it's a bit racy
22:58 imirkin_: i see some messages like "refused to change power state, currently in D3"
22:58 karolherbst: yeah.. that's runpm being broken
22:58 imirkin_: so perhaps we're trying to bring it up, but it fails
22:58 karolherbst: I have a patch
22:58 imirkin_: but we think we brought it up
22:59 ttkay: imirkin_ - I think that's because I'm already running the system in its lowest power mode
22:59 karolherbst: https://github.com/karolherbst/linux/commits/runpm_fixes
22:59 karolherbst: top two commits
22:59 karolherbst: one fixes a common secboot issue
22:59 karolherbst: the other fixes runpm
22:59 imirkin_: ttkay: check your fans though ... it kinda feels like you might have legit temp issues
22:59 ttkay: lmsensors shows reasonable temperatures and nothing too warm
22:59 karolherbst: ohh wait..
22:59 karolherbst: ttkay: can I have your lspci -nn please?
23:00 karolherbst: "lspci -nntvv" actually
23:00 ttkay: this "pci: prevent putting pcie devices into lower device states" commit looks likely
23:00 ttkay: karolherbst - coming right up
23:01 karolherbst: ohh, wait.. that will crash the system.. ups
23:01 ttkay: yep, it just did
23:01 karolherbst: boot with nouveau.runpm=0 please
23:01 ttkay: ok
23:01 karolherbst: then you can safely run lspci
23:02 karolherbst: the patch checks against certain Intel bridge controllers, so yours might not be on that list yet
23:02 karolherbst: ad then you can check if that patch helps
23:02 karolherbst: (which would be super helpful)
23:04 ttkay: http://ciar.org/h/lspci-nntvv.txt
23:05 ttkay: I'm a bit out of my element, no idea which of those is a bridge controller
23:05 ttkay: "Intel Corporation UHD Graphics 630" perhaps? but that looks like the integrated gpu maybe?
23:06 imirkin_: yes
23:06 imirkin_: unless you actually want to make use of that nvidia gpu, just boot with nouveau.modeset=0 and live a happy life
23:07 karolherbst: ttkay: lspci -s 00:01.0 -nn
23:07 karolherbst: imirkin_: well.. runpm sould work though, otherwise battery runtime is crap and all of that
23:08 imirkin_: is it though? with the fancy new stuff?
23:08 karolherbst: and I have a patch, but upstream isn't really in the mood of accepting it yet
23:08 karolherbst: imirkin_: well, if the nv gpu is on, the battery won't hold as long
23:08 karolherbst: on my system it cuts 66%
23:08 karolherbst: roughly
23:08 imirkin_: i get that
23:08 imirkin_: but i mean with _PR3 and whatnot
23:08 imirkin_: do you still need nouveau to power it off?
23:08 karolherbst: sure, because if there is a driver, it's the driver responsibility to do that
23:08 karolherbst: yes
23:09 karolherbst: although.. with modeset=0 it might not...
23:09 imirkin_: ok, so the pci subsystem doesn't do this automatically?
23:09 karolherbst: we don't bind to the device, or do we?
23:09 imirkin_: we don't
23:09 karolherbst: it won't get automatically enabled in either case
23:09 imirkin_: but it might start out off? :)
23:09 karolherbst: no
23:09 karolherbst: it starts out as on
23:10 ttkay: sorry, had to afk for a moment .. catching up
23:10 ttkay: http://ciar.org/h/lspci-s-nn.out.txt
23:10 karolherbst: imirkin_: anyway, I have two patches which should at least solve all those runpm+secboot issue and I'd prefer people testing it so I am sure those work for real :)
23:10 imirkin_: sure
23:10 karolherbst: ahh, same as mine
23:11 karolherbst: ttkay: forgot to push my newest versions of the patches
23:11 karolherbst: https://github.com/karolherbst/linux/commits/runpm_fixes
23:11 karolherbst: if you want you can try them out
23:11 karolherbst: and with both applied runpm should just work
23:11 ttkay: cool .. that's what you're referring to by '<karolherbst> and I have a patch, but upstream isn't really in the mood of accepting it yet' ?
23:11 karolherbst: and using the GPU as well
23:11 karolherbst: ttkay: not enough evidence of blaming intel
23:12 ttkay: heh
23:12 ttkay: thanks :-) I'll apply the patch and rebuild, see how it goes
23:12 imirkin_: i think several patches
23:12 karolherbst: the patch is fine itself, it just that everybody would like to see more data points and information
23:12 ttkay: looking at them now
23:12 karolherbst: yeah, two
23:12 karolherbst: one is a secboot workaround which skeggsb is already working on making it obsolete
23:12 karolherbst: so we might get that covered soon as well
23:13 karolherbst: ohhh, wait
23:13 karolherbst: I know where the 511 comes from
23:13 ttkay: I really appreciate your help :-) never been to this channel, and didn't know what to expect, but you folks are awesome
23:13 karolherbst: runpm fails to resume, so then all reads are garbage
23:13 karolherbst: and our error checking is bad...
23:13 karolherbst: so we just think the device is there.. because the pci subsystem just continues on its merry way
23:14 karolherbst: because... if the device is stuck at D3, the best thing to do is to assume it's functional....
23:17 karolherbst: ttkay: it seems like your HDD reports a temperature 64ºC too high
23:17 karolherbst: you might want to report that to the relevent subsystem as well (open a bug on bugzilla or something)
23:17 karolherbst: or... well
23:17 karolherbst: that might be a smart only thing
23:18 karolherbst: then whatever www.smartmontools.org says
23:19 ttkay: karolherbst - okie-doke
23:21 ttkay: hrm, smartctl -a /dev/nvme0n1 shows a sane-looking temperature (40degC and 38degC on the two sensors) so it might be something lost in translation
23:23 karolherbst: ttkay: /dev/sda
23:23 ttkay: oh oops
23:23 ttkay: root@kirov:~# smartctl -a /dev/sda | grep Temperature_Celsius
23:23 ttkay: 194 Temperature_Celsius 0x0022 111 096 000 Old_age Always - 36 (Min/Max 24/51)
23:24 ttkay: how odd
23:24 karolherbst: heh
23:24 karolherbst: why does smartd doesn't get it then
23:24 karolherbst: uff
23:24 karolherbst: ohh, wait
23:25 ttkay: smartd is what's logging it, so yeah, regardless it's smartmontools' problem .. will file a bug
23:25 karolherbst: the min/max might be stored in a different format
23:26 karolherbst: but the 64ºC offset looks plausible
23:28 ttkay:needs to focus on work for a bit, but will report back when the p73's running nouveau with your patches
23:53 imirkin_: pfft, work. who needs that.