22:50ttkay: hiya .. I'm trying to get Xorg running under a 5.4.0-rc6 kernel on my Lenovo p73, which has a Quadro P620. Unfortunately when I switch into graphics mode the nouveau kernel module detects that the device is 511 C and shuts down the system
22:50ttkay: looking at the module sources, I didn't see an obvious way to disable this behavior, and I'm not sure if that's the best redress anyway
22:50imirkin_: that sounds a lot like a -1
22:50ttkay: this is my first time dealing with nouveau, so any advice would be welcome
22:51imirkin_: 511 = 0x1ff
22:51ttkay: *nod* it's clearly not detecting the temperature correctly
22:51imirkin_: pretty sure we don't do anything too drastic beyond trying to spin up fans on higher temp
22:51imirkin_: and P620 sounds like pascal
22:51ttkay: it is pascal, yeah
22:51imirkin_: in which case we don't control the fans at all - it's done by the firmware
22:51imirkin_: do you have a dmesg?
22:52ttkay: imirkin_ - yeah, let me upload that, just a minute .. in the meantime here's a snippet: http://ciar.org/h/517fd4.txt
22:52imirkin_: right ... i just don't think those messages have any teeth
22:52imirkin_: they're annoying, but not much beyond that
22:52ttkay: orly? okay
22:53imirkin_: i mean, i guess i dunno - maybe shutdown is For Real (tm)
22:53ttkay: the laptop rebooted immediately afterwards, so I just put two and two together
22:53imirkin_: lol, looks like something does run 'shutdown'
22:53imirkin_: which means you are getting some acpi event
22:53imirkin_: which is telling your distro to shut down
22:53imirkin_: yay for user friendly!
22:54imirkin_: maybe the temp really does get high btw (but maybe not the full 511 degrees...)
22:54imirkin_: Nov 8 05:25:31 kirov smartd: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 119
22:54imirkin_: that's fairly toasty
22:54ttkay: sequence of events: (1) running the laptop in text mode for about an hour, (2) tried to start Xorg, (3) laptop shut down
22:55imirkin_: it really sounds like you're having a thermal emergency here
22:56karolherbst: ttkay: there is a fix upstream already for that
22:56karolherbst: or should be...
22:56imirkin_: for which part?
22:56ttkay: karolherbst - that's good news :-) thanks
22:56karolherbst: imirkin_, ttkay: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/nouveau?id=68bf8b577977a2804af82ffe03ce25d952f82fd2
22:57karolherbst: that's already in 5.4
22:57karolherbst: so I am wondering why ttkay is still hitting that
22:57ttkay: thanks for the pointer, at least .. it's something else to be looking at
22:57karolherbst: sensors shouldn't report 511 C, but N/A instead e.g.
22:57ttkay: if it comes down to it, I can edit the sources and recompile
22:58karolherbst: so I really don't know where that 511 is coming from.. maybe bad timing?
22:58karolherbst: but also doubtful
22:58ttkay: when I run lmsensors, it does show N/A
22:58karolherbst: then I guess it's a bit racy
22:58imirkin_: i see some messages like "refused to change power state, currently in D3"
22:58karolherbst: yeah.. that's runpm being broken
22:58imirkin_: so perhaps we're trying to bring it up, but it fails
22:58karolherbst: I have a patch
22:58imirkin_: but we think we brought it up
22:59ttkay: imirkin_ - I think that's because I'm already running the system in its lowest power mode
22:59karolherbst: top two commits
22:59karolherbst: one fixes a common secboot issue
22:59karolherbst: the other fixes runpm
22:59imirkin_: ttkay: check your fans though ... it kinda feels like you might have legit temp issues
22:59ttkay: lmsensors shows reasonable temperatures and nothing too warm
22:59karolherbst: ohh wait..
22:59karolherbst: ttkay: can I have your lspci -nn please?
23:00karolherbst: "lspci -nntvv" actually
23:00ttkay: this "pci: prevent putting pcie devices into lower device states" commit looks likely
23:00ttkay: karolherbst - coming right up
23:01karolherbst: ohh, wait.. that will crash the system.. ups
23:01ttkay: yep, it just did
23:01karolherbst: boot with nouveau.runpm=0 please
23:01karolherbst: then you can safely run lspci
23:02karolherbst: the patch checks against certain Intel bridge controllers, so yours might not be on that list yet
23:02karolherbst: ad then you can check if that patch helps
23:02karolherbst: (which would be super helpful)
23:05ttkay: I'm a bit out of my element, no idea which of those is a bridge controller
23:05ttkay: "Intel Corporation UHD Graphics 630" perhaps? but that looks like the integrated gpu maybe?
23:06imirkin_: unless you actually want to make use of that nvidia gpu, just boot with nouveau.modeset=0 and live a happy life
23:07karolherbst: ttkay: lspci -s 00:01.0 -nn
23:07karolherbst: imirkin_: well.. runpm sould work though, otherwise battery runtime is crap and all of that
23:08imirkin_: is it though? with the fancy new stuff?
23:08karolherbst: and I have a patch, but upstream isn't really in the mood of accepting it yet
23:08karolherbst: imirkin_: well, if the nv gpu is on, the battery won't hold as long
23:08karolherbst: on my system it cuts 66%
23:08imirkin_: i get that
23:08imirkin_: but i mean with _PR3 and whatnot
23:08imirkin_: do you still need nouveau to power it off?
23:08karolherbst: sure, because if there is a driver, it's the driver responsibility to do that
23:09karolherbst: although.. with modeset=0 it might not...
23:09imirkin_: ok, so the pci subsystem doesn't do this automatically?
23:09karolherbst: we don't bind to the device, or do we?
23:09imirkin_: we don't
23:09karolherbst: it won't get automatically enabled in either case
23:09imirkin_: but it might start out off? :)
23:09karolherbst: it starts out as on
23:10ttkay: sorry, had to afk for a moment .. catching up
23:10karolherbst: imirkin_: anyway, I have two patches which should at least solve all those runpm+secboot issue and I'd prefer people testing it so I am sure those work for real :)
23:10karolherbst: ahh, same as mine
23:11karolherbst: ttkay: forgot to push my newest versions of the patches
23:11karolherbst: if you want you can try them out
23:11karolherbst: and with both applied runpm should just work
23:11ttkay: cool .. that's what you're referring to by '<karolherbst> and I have a patch, but upstream isn't really in the mood of accepting it yet' ?
23:11karolherbst: and using the GPU as well
23:11karolherbst: ttkay: not enough evidence of blaming intel
23:12ttkay: thanks :-) I'll apply the patch and rebuild, see how it goes
23:12imirkin_: i think several patches
23:12karolherbst: the patch is fine itself, it just that everybody would like to see more data points and information
23:12ttkay: looking at them now
23:12karolherbst: yeah, two
23:12karolherbst: one is a secboot workaround which skeggsb is already working on making it obsolete
23:12karolherbst: so we might get that covered soon as well
23:13karolherbst: ohhh, wait
23:13karolherbst: I know where the 511 comes from
23:13ttkay: I really appreciate your help :-) never been to this channel, and didn't know what to expect, but you folks are awesome
23:13karolherbst: runpm fails to resume, so then all reads are garbage
23:13karolherbst: and our error checking is bad...
23:13karolherbst: so we just think the device is there.. because the pci subsystem just continues on its merry way
23:14karolherbst: because... if the device is stuck at D3, the best thing to do is to assume it's functional....
23:17karolherbst: ttkay: it seems like your HDD reports a temperature 64ºC too high
23:17karolherbst: you might want to report that to the relevent subsystem as well (open a bug on bugzilla or something)
23:17karolherbst: or... well
23:17karolherbst: that might be a smart only thing
23:18karolherbst: then whatever www.smartmontools.org says
23:19ttkay: karolherbst - okie-doke
23:21ttkay: hrm, smartctl -a /dev/nvme0n1 shows a sane-looking temperature (40degC and 38degC on the two sensors) so it might be something lost in translation
23:23karolherbst: ttkay: /dev/sda
23:23ttkay: oh oops
23:23ttkay: root@kirov:~# smartctl -a /dev/sda | grep Temperature_Celsius
23:23ttkay: 194 Temperature_Celsius 0x0022 111 096 000 Old_age Always - 36 (Min/Max 24/51)
23:24ttkay: how odd
23:24karolherbst: why does smartd doesn't get it then
23:24karolherbst: ohh, wait
23:25ttkay: smartd is what's logging it, so yeah, regardless it's smartmontools' problem .. will file a bug
23:25karolherbst: the min/max might be stored in a different format
23:26karolherbst: but the 64ºC offset looks plausible
23:28ttkay:needs to focus on work for a bit, but will report back when the p73's running nouveau with your patches
23:53imirkin_: pfft, work. who needs that.