00:09karolherbst: mhh, okay, well the thin gis, we don't really balance the pci_enable_device and pci_disable_device calls
00:09karolherbst: at least that isn't the cause of the runpm screwup
09:52karolherbst: Lekensteyn: I think I found something yesterday, which kind of worries me
10:06karolherbst: airlied, Lekensteyn: so it seems like I was able to remove the pci device after the resume fail. And even cycling between off and on states I wasn't able to readd the device through a bus scan until I suspended my entire laptop
10:06karolherbst: and after resume it appeared again after a rescan
10:07karolherbst: any ideas why that might be or how we could mess up the GPU that even after powering down the device, it doesn't correctly power up?
10:15Lekensteyn: system suspend is quite a big hammer, it also toggles the Power Resources (PG00._OFF/_ON). How did you exactly "cycle between off an on states"? By toggling runpm on the pci port or manually invoking ACPI methods?
10:16karolherbst: Lekensteyn: manually invoking _OFF/_ON through ACPI, yes
10:16karolherbst: normally this helps to repair a broken GPU state
10:16karolherbst: as you can load nouveau again, and it does a full repost
10:17karolherbst: helps if you break the GPU even after a failed secboot and whatnot
10:17karolherbst: but maybe something remains still?
10:17karolherbst: or maybe the GPU disconnects in a way, that even ACPI can't reach it?
10:18karolherbst: Lekensteyn: mhh, I cuold check the power consumption though when that happens
10:18Lekensteyn: good idea
10:19Lekensteyn: also, the ACPI SSDT/DSDT might execute some black magic during system suspend/resume that might have side-effects
10:20karolherbst: anyway, I plan to fix this stupid issue for real now, because it really started to get annoying
10:21karolherbst: I will try to ask nvidia about it, maybe they provide some answers
10:21Lekensteyn: fixing it would be awesome
10:22karolherbst: well, I start to kind of get down to what the situation is on the hw level
10:22karolherbst: or at least I think so
10:33Noxbru: I have a little question
10:34Noxbru: I have the nvidia drivers installed in a dual gpu laptop, which I use through bumblebee
10:34Noxbru: so far, I don't use the nvidia card much, so I was thinking of switching to nouveau
10:35Noxbru: is there any wiki that could help me through the process of uninstalling nvidia and installing nouveau and having something like bumblebee to use the nvidia card 'on demand' ?
10:40Lekensteyn: Noxbru: what do you run bumblebee for? if you purge bumblebee and nvidia, you should be good to go. What distro?
10:40Noxbru: Lekensteyn: mainly for games, although lately I just use the intel card, and as the games are usually CPU bound there's no need for a 'good' gpu
10:40Noxbru: and I am using Arch
10:42Lekensteyn: in that case, why not just leave the nvidia GPU unused? that will likely increase battery life and reduce heat production
10:42Lekensteyn: you can optionally install xf86-video-nouveau
10:42Lekensteyn: dual monitor used to work, but broke with xorg 1.20.0 (at least, on Arch)
10:49Noxbru: I'm not worried much about dual monitor, I use that at work, but supposedly it should default to the intel card, right?
10:55gnarface: Noxbru: it depends on the hardware but usually it defaults to the Intel with these types of laptops, yes. most distros have nouveau installed by default already. *probably* all that it takes to enable it would be to blacklist the nvidia driver and un-blacklist nouveau. actually making use of it is another story...
12:13karolherbst: Lekensteyn: seems like with all the changes I've made it is kind of reliable to fail resuming the GPU, remove the PCI device and rmmod nouveau
12:14karolherbst: so, nvidia GPU is removed and the bus is suspended and I get around 12W, but the power source of the GPU is actually set to D0
12:16karolherbst: when I put it to d3, I get around 10W, which is still more than what I usually get
12:16karolherbst: mhh, actually no
12:16karolherbst: that doesn't change a hting
12:17karolherbst: talking about the \_SB.PCI0.PEG0.PEGP._PSC handle
12:17karolherbst: C is current, but you can select a state with PS0/PS3...
12:17karolherbst: "\_SB.PCI0.PEG0.PG00._STA" returns 0
12:18karolherbst: so, let me try to reset that bus
12:18karolherbst: mhh, 17W with the bus put into D0
12:23karolherbst: Lekensteyn: mhh, calling \_SB.PCI0.PEG0.PG00._OFF gets me to <9W
12:23karolherbst: which is kind of the "correct" power consumption
12:31karolherbst: Lekensteyn: do you know how to turn on debugging for "rescan"?
12:31karolherbst: the call to it takes a while
12:31karolherbst: and I am wondering if it tries to detect the GPU but fails
12:31Lekensteyn: dunno if there is much additional debugging there
12:31karolherbst: ohh no, that is because the bus was suspended
12:32karolherbst: anyway, the GPU seems to be off for real as my laptop consumes around 9W of power
12:32Lekensteyn: when you forcifully invoke PG00._OFF by hand?
12:32karolherbst: "only" 14W if the bus is powered on
12:33karolherbst: Lekensteyn: no, it was just random userspace applications doing random stuff
12:33karolherbst: will check how much power it consumers after resuming the system
12:34Lekensteyn: that was a vague description :P Do you mean that you are testing this with changes to nouveau in combination with runpm transitions?
12:35karolherbst: Lekensteyn: mhh it seems to consume a little more, but I am not confident enough
12:36karolherbst: anyway, rescan after system suspend/resume brings the GPU back
12:38karolherbst: Lekensteyn: maybe there is some magic ACPI method we could call to "revive" the device?
12:39karolherbst: might be better to use that than to let the system die
12:42karolherbst: mhh, PEGR is just the pci config stuff
12:42karolherbst: ohh wait, pegp is the thing
12:45Lekensteyn: I doubt that such a magic method exists, it seems very firmware-specific. Only _OFF/_ON (and some others) are standarized
12:45Lekensteyn: maybe you can find a magic register to write though, but I am not aware of one
12:49karolherbst: Lekensteyn: do you know what the WKEN field is for? sounds like it should specify if the device is actually awake
12:50karolherbst: but that one is always 0
12:50karolherbst: allthough _DSW with 3 args set it to 1
12:50karolherbst: still stays 0 though
12:50karolherbst: maybe my acpi files are too old as I updated the firmware after extracting
12:52karolherbst: ahh yeah, just extracted it again and a few things are different
12:56karolherbst: Lekensteyn: mhhhh, interesting
12:56karolherbst: Lekensteyn: you know that "\_SB_.PCI0.PEG0.VEID" field?
12:56karolherbst: guess what value it has
12:57karolherbst: NVID is also just 0xffffffff
12:58karolherbst: that explains _a_lot_
12:58Lekensteyn: Vendor ID?
13:00Lekensteyn: VEID is a 16-bit field, seems to be the vendor + product id
13:00Lekensteyn: *just vendor id
13:03karolherbst: veid is 0xffff
13:03karolherbst: it seems like the call to PGON doesn't really wake the device
13:03karolherbst: which is called before checking that stuff
13:05karolherbst: Lekensteyn: soo, let's see where pgon messes up
13:06Lekensteyn: for me PGON hangs in While ((\_SB.PCI0.PEG0.LNKS < 0x07))
13:07karolherbst: LNKS is 0 for me
13:07karolherbst: but it doesn't get used inside PGON
13:08karolherbst: and there is no loop
13:08Lekensteyn: you had an XPS 9560?
13:08karolherbst: loops inside firmware codes are evil anway
13:09karolherbst: mhh PGON and PGOF return 0
14:16pendingchaos: imirkin: in case you missed it: can you elaborate on "that won't end well ..."?
14:19pendingchaos: in response to https://patchwork.freedesktop.org/patch/240931/, you wrote "That won't end well. Bump the allocation to 6?"
14:19pendingchaos: it was about some code in nvc0_compute_validate_constbufs()
14:20imirkin: oh yes.
14:20imirkin: you have an array that's like foo
14:20imirkin: and then you access foo
14:20imirkin: that won't end well :)
14:21imirkin: + struct nvc0_cb_binding cb_bindings;
14:21imirkin: oh, but you never access it for compute?
14:21imirkin: ok, that's my bad then. ignore.
14:25pendingchaos: an updated patch has been sent
15:45Lyude: karolherbst: poke-do you know what nouveau_switcheroo_optimus_/red
15:45Lyude: sorry-i both got the answer to that question, and did not mean to send it but did by accident
15:46karolherbst: I think I figured out _something_
15:47karolherbst: when we load nouveau
15:47karolherbst: and I simply invoke the ACPI stuff to suspend the GPU, it works, kind of
15:48karolherbst: okay, next step: let the pci driver suspend the GPU and see if I am able to wake it up without it interfering
15:56karolherbst: so if we load nouveau, unload it and let the GPU be suspended, it resumes without problems
15:57karolherbst: when I suspend the device while nouveau is loaded, it breaks
15:57karolherbst: skeggsb_: do you know of any messups in the nouveau suspend code? I think you said something about the general suspending stuff not working all that great
15:57karolherbst: Lyude: do you know about some issues?
16:05karolherbst: uhm.... I get a bad feeling about that
16:05karolherbst: I hope that isn't some of the secure falcons getting all pissed at nouveau and messing up the GPU
16:07karolherbst: oh wow
16:08karolherbst: Lekensteyn: if I let nouveau_pmops_runtime_suspend return -EBUSY
16:08karolherbst: I am able to turn the GPU off via ACPI and if it wakes up the ACPI state isn't messed up at all
16:10karolherbst: Lekensteyn: if you would have to guess, what part of the pci suspend path may break it?
16:11karolherbst: the things touching PCI_PM_CTRL look kind of fishy to me
16:16Lyude: karolherbst: most of the issues I've been dealing with have to do with us deadlocking due to not disabling various hotplugging stuff in a way that won't lock
16:17Lyude: karolherbst: btw, have you compared how we suspend nouveau pci-wise to amdgpu or similar?
16:17karolherbst: Lyude: I am sure it is also broken for amdgpu
16:18Lyude: ...are you sure? I've been working with multiple amdgpu laptops and they've been fine
16:18Lyude: i just submitted some hotplugging fixes for one even
16:18karolherbst: agd5f mentioned something
16:18karolherbst: Lyude: which of those do d3cold actually?
16:18Lyude: i actually don't know
16:19karolherbst: laptops with AMD gpus are kind of rare these days
16:19karolherbst: espcially the new ones
16:19Lyude: karolherbst: i've got a bunch if you ever need access
16:19Lyude: HP and Dell sell them in their XPS lines
16:19Lyude: *Zbook and XPS
16:19Lyude: lenovo has some too, but not in the good thinkpads for some dumb reason
16:20nyef: Probably because the "good" thinkpads are for "business users", who don't need fancy graphics.
16:21Lyude: no... there's the P5x and P7x lines, and T4xx that all have nv gpus
16:21Lyude: but that's probably all for cuda stuff
16:22karolherbst: "pcieport 0000:00:01.0: ASPM: current common clock configuration is broken, reconfiguring"
16:22karolherbst: I don't know
16:22karolherbst: this happens when I rescan the bus
16:23Lyude: which laptop is this with btw
16:24karolherbst: I am more and more sure that the issue is somewhere inside the pci subsystem
16:25Lyude: or maybe we're not using the pci subsystem correctly
16:25karolherbst: I removed all that code
16:25karolherbst: we don't have to call any code into the pci subsystem on suspend/resume
16:25Lyude: aaaah, I suspected as much!
16:25karolherbst: we only need it if we do that _DSM stuff
16:26karolherbst: otherwise the pci subsystem takes care of putting devices into d3cold
16:26karolherbst: or d3hot
16:26karolherbst: but it feels broken nonetheless
16:26Lyude: karolherbst: btw, we have a pci expert somewhere at RH who might have some interesting insight on this
16:26karolherbst: Lyude: I just need a name
16:26Lyude: i've dealt with bugs with them before, let me try to think
16:30Lyude: karolherbst: found their info, check rh irc
16:31karolherbst: will do after rebooting
16:31Lekensteyn: karolherbst: if I had to guess, perhaps pci_disable_device? in all traces I have from Windows, the BusMaster bit is still on
16:32karolherbst: Lekensteyn: that is a noop inside nouveau anyway
16:32karolherbst: we call enable inside nouveau _and_ inside drm
16:33karolherbst: and because that thing is refcounted, we never actually get to disabling anything
16:33karolherbst: anyway, balancing that out didn't help either
16:34karolherbst: also, I do it inside my stub module and there it works
16:34karolherbst: mhhh which is still weird
16:34karolherbst: weird actually...
16:37karolherbst: so this is fun now
16:41karolherbst: Lekensteyn: .... mhhh
16:41karolherbst: now pci doesn't even try to suspend the gpu and its parent
16:42karolherbst: which makes like no sense really?
16:44Lekensteyn: what is your runtime_usage count?
16:44Lekensteyn: I had refcount underflows at some point, but that also involved the audio function device
16:45karolherbst: I just rebootet
16:45karolherbst: but not even the parent device gets dropped... it is a bit weird
16:47karolherbst: Lekensteyn: do you know if something needs to be put into d3 state before the bus the gpu is on?
16:51Lekensteyn: not sure
16:52karolherbst: mhh, well, then I try to make pci happy
16:53Lekensteyn: huh okay..
16:54Lekensteyn: in a trace from Windows, I see PEGP._PS3 being invoked before PG00._OFF (ok)
16:54Lekensteyn: but I also see PEGP._PS0 being invoked *before* PG00._ON
16:54Lekensteyn: that is not really symmetric
16:59Lyude: Lekensteyn: by the way, how are you doing those traces?
16:59Lyude: I have many things I would like to trace from windows regarding acpi and some other stuff
17:02Lekensteyn: Lyude: I am using the AMLi debugger with a Checked/Debug build of Windows 10, let me try to find some notes
17:06Lekensteyn: Lyude: the official MS docs are here: https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/introduction-to-the-amli-debugger
17:07Lekensteyn: once you have a kernel debugger session running, you can use "!amli ..." commands, see for example https://lekensteyn.nl/files/p651ra-acpi-debug/amli.log
17:16Lyude: Lekensteyn: oooh, that is useful, thanks!
17:21karolherbst: Lekensteyn: not really and shouldn't matter that much
17:24karolherbst: that code makes no sense
17:43karolherbst: Lekensteyn: so you sau calling that PS3 thing before resuming might help actually? I kind of doubt that because that thing seems to have literally no affect
17:55Lekensteyn: karolherbst: I don't think it has any effect, but it might suggest that Windows writes to PCI regs in a different order
17:56Lekensteyn: unfortunately I don't know how to trace PCI accesses on Windows, so there are no logs for that
18:00karolherbst: Lekensteyn: would I be able to set breakpoints with the acpi debugger or can I simply invoke methods?
18:01Lekensteyn: Linux? yes, I believe so
18:07Lyude: yes you can
18:08Lyude: there might be some stuff you have to enable in .config
18:15pmoreau: pendingchaos: Btw, feel free to mention the bug report (https://bugs.freedesktop.org/show_bug.cgi?id=100177) in your patch (“nvc0: serialize before updating some constant buffer bindings“); the syntax is “Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100177”.
18:16pendingchaos: not "Fixes: https://..."?
18:16pendingchaos: I put that in the second and third versions
18:16pmoreau: That’s when fixing a regression introduced by a patch.
18:17pmoreau: (See docs/submittingpatches.html)
18:17pendingchaos: I just saw it used in some random patch, so I used Fixes:
18:17pmoreau: My first thought was “Fixes: ” as well :-)
18:18pmoreau: Ah right, I missed it in the 2nd and 3rd version, sorry.
19:19agd5f: Lyude, https://bugs.freedesktop.org/show_bug.cgi?id=105760 for example
19:20agd5f: Lyude, and basically add the platforms in the amdgpu_px_quirk_list in amdgpu_atpx_handler.c. however on those platforms the ATPX method still works so we use that as a workaround
19:21Lyude: karolherbst: agd5f linked to https://bugs.freedesktop.org/show_bug.cgi?id=105760 while you were disconnected for a moment
19:21karolherbst: yeah, I already checked
19:28karolherbst: Lekensteyn: the heck... the actual heck
19:28karolherbst: Lekensteyn: guess what "fixed" it
19:29karolherbst: this is my patch to the kernel: https://gist.github.com/karolherbst/3cde7028a6b885ca42863b6f6320658c
19:29karolherbst: Lyude: ^^
19:29karolherbst: pls test...
19:30Lyude: woah what
19:30karolherbst: I called it :p
19:30Lyude: karolherbst: also-can't test right this moment unfortunately but I can get the machine to next time I'm in the officwe
19:30Lyude: all of my current prime machines mostly work actually
19:30karolherbst: agd5f: mind trying that patch as well
19:31karolherbst: I should still verify with powertop
19:31karolherbst: but I am quite sure
19:31Lyude: also hooray-the simplest way for me to fix some of the more exotic deadlocks with hotplugging in nouveau is to clean up the hotplugging code and let it probe connectors in parallel
19:31agd5f: karolherbst, I can ask the user on bug 105760 to try it. I don't have any systems that exhibit the problem unforunately
19:32karolherbst: Lyude: the thing is, I also cleaned up some of the nouveau code, but basically just removing those pci_* calls
19:32Lyude: karolherbst: i haven't touched anything pci related so i wouldn't worry about thaty
19:33Lyude: basically every problem I'm fixing right now is related in some way shape or form to a display connector :P
19:33karolherbst: I meant in regards to my patch
19:33karolherbst: it might be that this will mess up stuff again
19:33karolherbst: need to test
19:33karolherbst: okay so... 10W
19:34karolherbst: 9.5W now
19:34Lyude: karolherbst: i can test it on machines that don't exhibit the issue if you want though
19:34karolherbst: so I guess on a 15" laptop with an i7, that means the GPU is gone :)
19:34karolherbst: and lspci shows this: 01:00.0 3D controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1)
19:34karolherbst: normall rev is ff in the broken case ;)
19:34karolherbst: secboot fails...
19:34karolherbst: silly secboot
19:35agd5f: karolherbst, reading back all ones is normal if the power is off
19:35agd5f: I.e., if you do lspci when the dGPU is in runtime suspend it will return rev ff
19:36karolherbst: not anymore
19:36karolherbst: access to the pci device wakes the device up
19:36agd5f: oh, that's new
19:36karolherbst: agd5f: it only returns ff for the non _PR3 devices
19:36agd5f: ah, ok
19:36karolherbst: or maybe this changed as well
19:37karolherbst: anyway, nvapeek returns a sane value
19:37karolherbst: which basicly means the gpu was resumed correctly
19:46karolherbst: yep... besides the secboot issues, that patch fixes the runpm issues on my laptop
20:24Lekensteyn: karolherbst: wow, that's an odd fix. I wonder what Rafael thinks of such an approach
20:40karolherbst: Lekensteyn: yeah, uhm
20:40karolherbst: Lekensteyn: I guess putting the device into D3 state via the pci config space might break it
21:10karolherbst: Lekensteyn: I mean, it could make sense, couldn't it?
21:15Lekensteyn: karolherbst: I have a hypothesis that trying to access (even read?) the register might cause interference (why?) It sounds very weird though
21:15Lekensteyn: can you print the values that would be written/read?
21:15karolherbst: why should the read matter?
21:16karolherbst: Lekensteyn: there is nothing odd about the written values
21:16karolherbst: 0xb in the d3 case
21:16karolherbst: 0x8 otherwise I think
21:16Lekensteyn: no idea why read would matter, in my mind I have a scenario where the device is not expecting any PCI messages (or whatever the terminology is), but reading breaks that
21:16karolherbst: Lekensteyn: ahh no, it doens't write on resume
21:17karolherbst: as the device is already in D0
21:17karolherbst: Lekensteyn: uhm.. that would be liker super weird
21:17karolherbst: Lekensteyn: but why would a write setting it to D3 affect the read?
21:17Lekensteyn: yes, it is just a theory, it is certainly possible that it makes no sense at all
21:17karolherbst: or mess up reading later on?
21:17karolherbst: mhh, I could try it, but...
21:18karolherbst: I like my theory better
21:18Lekensteyn: in my trace I also see a read of register 0x64 followed by writing 0xb to that register, followed by another read and finally _PS3
21:19karolherbst: we do it _after_ the acpi stuff though
21:20karolherbst: so first _PS3, then _OFF then D3 into pci config
21:20karolherbst: or first _OFF then _PS3?
21:20karolherbst: doesn't matter, ACPI comes first
21:20karolherbst: or... actually
21:20karolherbst: let me check that
21:21karolherbst: okay, I am wrong
21:21karolherbst: Lekensteyn: https://gist.githubusercontent.com/karolherbst/99c34b84901509b4702a139d472d4d5d/raw/c4f95823a577841348ae73b717f274f3fe5306c3/gistfile1.txt
21:23Lekensteyn: yeah, so that's an attempt from D0 -> D3
21:24Lekensteyn: why would that break :/
21:24karolherbst: who knows
21:26Lekensteyn:grabs the ACPI spec, maybe there is something in there
22:24pendingchaos: imirkin: assuming it's deserving of one, can I have your Reviewed-By for https://patchwork.freedesktop.org/patch/241141/ sometime ?
22:46karolherbst: imirkin: the fp64 CTS fix was merged :)