00:51 kisak: I've got a near-miss question... I'm trying to move from linux 5.2 to 5.4, and hit multiple regressions with my laptop. I just finished bisecting the first nouveau issue, where the secondary gpu (low end fermi) is not powering down when idle to "PCI: Enable NVIDIA HDA controllers".
00:52 kisak: So it's fairly easy to conclude that the sound subcomponent is keeping the chip from powering down. Anyone happen to know how I can power it down?
00:53 imirkin: that's a known issue
00:53 imirkin: and has been discussed
00:54 imirkin: unfortunately i don't have the specifics
00:54 imirkin: let me see if i can track down the conversations about this
00:54 imirkin: basically you have to enable PM somewhere
00:55 imirkin: kisak: echo on > /sys/bus/pci/devices/0000\:01\:00.1/power/control
00:56 imirkin: or auto, dunno
00:56 kisak: the other 5.2->5.3 regression is the loss of the chipset temp sensor, not bisected yet
00:57 imirkin: kisak: look for this patch from Lukas Wunner: ALSA: hda - Force runtime PM on Nvidia HDMI codecs
00:57 kisak: looks like that was set to auto
00:57 imirkin: and relevant info at https://bugs.freedesktop.org/show_bug.cgi?id=75985#c81
00:57 imirkin: mmmm ... temp sensor, like the one attached to the nvidia board? that's surprising.
00:58 kisak: yeah, I've had it in MATE's system tray, when the chipset is off it reads 0C, which makes it really quick and easy to know it's powered down
00:58 imirkin: ah, we did change it to return -ENODEV now
00:58 imirkin: (when it's pwoered down)
00:59 imirkin: (or whatever the return code "for sensor gone" is
01:00 kisak: thanks for pointing to that patch, testing now
01:00 imirkin: that patch should be in 5.4-rc4 though
01:00 kisak: oh hmm
01:00 imirkin: if you look at the bz link, there's some additional info in there
01:01 imirkin: apparently some userspace likes to randomly disable runpm too
01:06 karolherbst: kisak: yeah.. in the past we returned some bs value
01:06 karolherbst: a 14 bit -1 or something
01:06 karolherbst: kisak: but your GPU should power down if sensors reports an error, so I am wondering why you think it doesn't?
01:07 kisak: I know the nvidia chipset is active because idle temps are up 6-10C
01:07 karolherbst: so you don't know
01:07 karolherbst: it could be anything
01:07 karolherbst: probably your iGPU not entering power savings state
01:07 karolherbst: there was a CVE and the mitigation was to never idle the intel GPU
01:07 karolherbst: it will be fixed with 5.5
01:08 karolherbst: if you check the igpu power states, it should never reac rc6
01:08 karolherbst: *reach
01:08 karolherbst: on my system I get like 30% higher idle power consumption due to that
01:09 kisak: hmm? okay, a quick revert of "PCI: Enable NVIDIA HDA controllers" should prove me wrong on the temps, was seeing the elevated temps with 5.3
01:11 karolherbst: it might be you are not hit by that CVE, but I am not aware of anything inside nouveau breaking runpm and I am aware of that itnel thing hitting 5.3 :)
01:15 kisak: the raised temps was backed by low delay DRI_PRIME=1 glxinfo -B (has to wait a few seconds to power up the chip) and temp readout on the nvidia chip during the tail end of the bisect, but I didn't document those indicators in my bisect log
01:17 karolherbst: ? how does that make sense? If you power on the GPU of course your system gets hotter
01:17 karolherbst: or what issue are you exactly trying to bisect
01:19 karolherbst: kisak: anyway, if sensors reports N/A, the GPU is most likely off, except something odd went on.. but there is no knowing unless you poke ACPI and even that could lie. Anyway you probably want to check your intel SoC state with turbostat or so
01:19 karolherbst: that usually gives you a better indication on whats happening
01:20 kisak: the issue I'm troubleshooting is the secondary gpu not powering down when idle, just rebooted onto 5.4.13 with that commit reverted, waiting for temps to settle
01:20 karolherbst: why do you think it's not powering down?
01:23 kisak: so, I reached 40C, which this hardware can't do with the nvidia gpu awake
01:24 kisak: cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_status now gives me suspended instead of active
01:24 karolherbst: ohh, maybe imirkin hint was actually right..
01:24 karolherbst: kisak: did you check /sys/bus/pci/devices/0000\:01\:00.1/power/control?
01:24 kisak: doesn't exist anymore
01:24 karolherbst: uff.. ehh, right
01:25 karolherbst: your HDMI audio is then broken
01:25 karolherbst: which was what the patch in question tried to fix
01:26 karolherbst: kisak: mind polling the /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_status and /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_status every second on the broken kernel?
01:27 karolherbst: I am actually wondering if something does some nasty audio stuff
01:27 kisak: I'll need a hint as to what you're looking for
01:27 karolherbst: but uhm...
01:27 karolherbst: if the GPU gets woken up, nouveau should start reporting GPU temperatures as well
01:27 karolherbst: but it sounded like that never happened, right?
01:28 karolherbst: kisak: ohh wait.. you mean that mate stops reporting the GPU temperature alltogether?
01:28 karolherbst: like completly?
01:28 karolherbst: even if the GPU is turned on?
01:30 kisak: okay, so now that HDA is disabled, I've discovered the sensor in mate's sensor applet. works with glxgears, reads as ERROR when suspended
01:30 karolherbst: kisak: that's on 5.4, right?
01:30 karolherbst: or with that audio stuff reverted?
01:31 kisak: 5.4.13, HDA enable reverted
01:31 karolherbst: and what happened when the commit wasn't reverted?
01:31 kisak: the nvidia chipset doesn't power down
01:31 karolherbst: I meant the temp reading
01:32 kisak: 46-50 idle temp instead of 40
01:32 karolherbst: I meant the GPU temp
01:32 kisak: sorry, I couldn't identify it at the time
01:32 karolherbst: mhhhh
01:32 kisak: rebooting to the kernel
01:32 karolherbst: k
01:33 karolherbst: my guess is that userspace keeps polling on the audio device or something, which let's the gpu be woken up regulary
01:36 kisak: laptop's back up on 5.4.13, hda enabled. temp sensor is fine.
01:37 karolherbst: so it constantly shows a GPU temperature?
01:38 karolherbst: cat /sys/bus/pci/devices/0000\:01\:00.1/power/control returns "auto"? what does /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_status return?
01:38 kisak: so, it just changed identity somewhere between 5.2 to 5.3, and with there being 3 "temp1"'s I couldn't tell them apart properly until it was powered down
01:38 kisak: yes, live temp off the gpu
01:39 kisak: auto / active
01:40 karolherbst: ahh
01:40 karolherbst: okay, so something does use the audio device
01:40 karolherbst: does /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_active_time increase over time?
01:41 karolherbst: well.. obviously it will
01:41 karolherbst: kisak: do you have anything connected on the GPU? like HDMI or DP?
01:41 kisak: checking the system config, this is a pulseaudio system GF High Definition Audio Controller is set to disabled in pulse's config
01:42 kisak: no external connections
01:42 karolherbst: kisak: you should have several audio devices, right?
01:42 kisak: runtime_active_time is increasing
01:43 kisak: intel-hda / Realtek ALC269VB and the nvidia hdmi
01:43 karolherbst: can you disable the nvidia hdmi one?
01:43 karolherbst: usually there is some HDMI stuff configured on that device
01:44 kisak: that's exactly where I tried to start asking for help
01:45 karolherbst: right.. but now I am sure on what's happening
01:45 karolherbst: but.. well
01:45 karolherbst: it's not a kernel bug but userspace being silly
01:45 karolherbst: question is: what exactly
01:47 karolherbst: kisak: do you use pavucontrol to check the pulse stuff or something else?
01:48 karolherbst: I am wondering if there is some application using the HDMI audio device for odd reasons
01:48 kisak: fast and dirty check was with MATE's utility, grabing it now
01:49 karolherbst: uff.. I hope it's not pulse itself being silly
01:49 karolherbst: Lyude: do you know if pulse is being silly?
01:52 kisak: nothing interesting in pavucontrol, nvidia chipset set to off
01:52 karolherbst: mhhhhh
01:55 kisak: disabled pulseaudio's autorespawn and killed it, so that's removed from the equation
01:55 kisak: no effect
01:56 karolherbst: there is probably something else using it
01:56 karolherbst: I stopped pulse here and I still have a reference on the audio card (intel one)
01:57 karolherbst: kisak: sof | grep -i snd
01:57 karolherbst: uff
01:58 karolherbst: lsof | grep -i snd
01:58 karolherbst: or maybe rather lsof | grep -i /dev/snd
01:58 karolherbst: I have a alsactl doing stuff
01:58 karolherbst: alsa-state.service
01:59 karolherbst: mhh, now my intel audio device got suspended
01:59 kisak: no output from that (perk of openrc)
01:59 karolherbst: you have lsof installed, right?
02:00 kisak: just compiled it before using it
02:00 karolherbst: mhhhh
02:00 karolherbst: openrc has this audio sound level restore/save service, right?
02:00 karolherbst: but uhm...
02:00 kisak: yes, it only runs during startup and shutdown, no monitoring in the middle
02:01 karolherbst: lsof /dev/snd/controlC0
02:01 karolherbst: does that show anything
02:01 karolherbst: ?
02:01 karolherbst: same for lsof /dev/snd/controlC1
02:01 kisak: just check both, no output
02:01 karolherbst: or maybe just do "lsof /dev/snd/*"
02:02 kisak: it rejected the quotes, without the quotes it's an empty result
02:02 karolherbst: mhhhh
02:03 karolherbst: cat /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_status still shows active?
02:03 kisak: yes
02:03 karolherbst: now it's getting weird
02:06 karolherbst: ohhh.. I have an ieda
02:06 karolherbst: as root
02:06 karolherbst: echo Y > /sys/module/snd_hda_intel/parameters/power_save_controller
02:06 karolherbst: echo 5 > /sys/module/snd_hda_intel/parameters/power_save
02:07 karolherbst: and if that doesn't change anything, try start pulseaudio for a moment and kill it again
02:08 karolherbst: okay.. I can confirm that audio device should get suspended even when pulseaudio is running
02:09 kisak: it just powered down
02:09 kisak: with starting pulseaudio
02:09 karolherbst: \o/
02:10 karolherbst: you might want to create a modprobe.d file then
02:10 karolherbst: with "options snd_hda_intel power_save_controller=Y power_save=5" as the content
02:11 karolherbst:is wondering why the alsa people still don't enable it by default... what year is it, 1980?
02:12 karolherbst: oh wow..
02:12 karolherbst: but that even works while pausing playback of videos
02:12 karolherbst: runtime suspending that is
02:14 kisak: oh, that looks like a kernel config issue, I had that set to 0 in my config
02:14 karolherbst: what exactly?
02:14 karolherbst: the delay?
02:14 karolherbst: it is 1 for me
02:15 karolherbst: CONFIG_SND_HDA_POWER_SAVE_DEFAULT.. but the doc says that 0 meains disable
02:15 karolherbst: and it was set to 0 for me at runtime
02:15 karolherbst: something is weird
02:15 karolherbst: ohh.. tlp
02:15 kisak: yes, that was set to 0
02:16 karolherbst: okay
02:16 karolherbst: yeah.. if you set it to 1 or something then the issue should also get fixed
02:16 karolherbst: I wasn't aware that I've missconfigured tlp here :D
02:16 karolherbst: audio runpm was disabled when on AC
02:17 kisak: the setting never had any practical meaning to me before because the chipset uses negligable power by itself on a desktop
02:17 karolherbst: true
02:18 karolherbst: imirkin: the next time something like that happens, let users check CONFIG_SND_HDA_POWER_SAVE_DEFAULT and /sys/module/snd_hda_intel/parameters/power_save*
02:18 karolherbst: apparently it's easy to configure it in a way audio devices never suspend :)
02:19 kisak: sorry about the roundabout trouble
02:20 karolherbst: no worries, next time we know what to look for :)
02:40 kisak: so, it looks like I was misconfigured in laptop-mode-tools as well
02:43 kisak: and now that I'm hands off on mate-sensors-applet, it successfully transitioned from gpu temp -> error readout -> gpu temp -> error readout
02:43 kisak: that'll do nicelt
02:46 kisak: rebuilt the kernel, CONFIG_SND_HDA_POWER_SAVE_DEFAULT is irrelevant when tlp or laptop-mode-tools overrides it anyway
19:40 imirkin: skeggsb: if you get a chance, take a look at https://bugzilla.kernel.org/show_bug.cgi?id=206225 -- someone having issues with cipher on a G96 for resume.