03:35rhyskidd: karolherbst: sent you a R-b on your hwmon temp sensor patch for runpm-managed gpus
03:36rhyskidd: thanks for your recent work, with that + other patch series lots of rough edges being fixed up for my pascal
08:22karolherbst: rhyskidd: problem is just, that the runpm fixes make no sense... sadly. They are more like a workaround fixing nothing :/
08:22karolherbst: there are a few weird things going on tne GPU
08:23karolherbst: devinit reduces the pcie link speed to 2.5 and pci core is putting the GPU into d3hot via the pci config (0x64 on nv gpus / 0x88064 in mmio)
08:23karolherbst: when those both things happen, the bridge is reporting an fatal error on resuming (via ACPI) and the GPU is dead
08:24karolherbst: the most recent runpm patch I sent out just skips the entire runpm stuff for the GPU, not triggering those conditions
08:24karolherbst: the device still gets turned off, because the ACPI method is still invoked when susending the bridge, hence you still get the power savings :(
13:08rhyskidd: so you think your most recent patch "drm: don't set the pci power state if the pci subsystem handles the ACPI bits" is more likely correct?
17:52karolherbst: Lyude: I think I know how to fix this runpm issue... just need to try something out
18:00Lyude: karolherbst: Awesome!
18:07imirkin: Lyude: did you see the i2c regression thing?
18:08Lyude: imirkin: no, what's up?
18:08imirkin: see nouveau@
18:09imirkin: the commit hash in the email is bad
18:13Lyude: imirkin: what is the subject line?
18:13imirkin: NOUVEAU_LEGACY_CTX_SUPPORT Fan Speed
18:13imirkin: (obviously related to i2c, i know)
18:15Lyude: imirkin: hm-so it's an nv50 gpu
18:15imirkin: unclear if it's true nv50 or nv50-era
18:16imirkin: there were some marked i2c changes around g94, iirc
18:16Lyude: yeah-I was about to say the same thing, I'll asked
18:16Lyude: and yeah there were-it sounds like we might be reading out the fan speed without enabling the i2c bus, but I'll have to see when I get into the office on monday
18:17imirkin: do you have a collection of gpu's available there?
18:17Lyude: hopefully I have one that can reproduce this issue
18:18imirkin: fwiw i don't have a g80 myself. ben definitely does.
18:21imirkin: you can get them fairly cheap on ebay too if needed
18:21imirkin: just look for ones with funny amounts of memory, like 320/640mb
19:05karolherbst: that remindes me, I still have to fix the fan issue on my two g84
19:05karolherbst: the fan speed is in reverse order :)
19:08imirkin: the hotter it gets, the slower it goes?
19:09imirkin: what coudl go wrong with *that* policy ... :)
19:09karolherbst: well, the policy is fine, it's just that 100% fan speed means the fan stops :)
19:10karolherbst: it's probably some funky vbios bit
19:10karolherbst: I just didnt get to it yet
19:11karolherbst: Lyude: mhh, I couldn't try that stuff out I wanted to :/
19:11karolherbst: I want to drop the lnkCap of the bridge to 2.5, but... well, you can't do that via the PCI config space :(
19:21karolherbst: Lyude: maybe we just want to set PCI_DEV_FLAGS_NO_D3 on all nvidia GPUs :/
19:21karolherbst: but D2 and D1 are equally bad
19:23karolherbst: Lekensteyn: do you know if it's possible on windows to trace what happens on the PCI config with a device?
19:36Lekensteyn: karolherbst: If you can run stuff in QEMU, you can do VFIO tracing, but I don't know to do it natively in Windows
19:53karolherbst: mhh, qemu might be enough indeed
20:24imirkin: it can be a pain to pass a laptop gpu through
20:29karolherbst: yeah.. especially all the ACPI stuff
20:53karolherbst: Lekensteyn: mhhh, okay, anyway, here is the thing: 1. set PCIe link speed to 2.5 on the GPU 2. put GPU into D3 state via PCI config (PM+0x4, 0x64 on nv GPUs) 3. invoke ACPI methods to power down (\_SB.PCI0.PEG0.PG00._OFF)
20:53karolherbst: 4. dead on powering up (\_SB.PCI0.PEG0.PG00._ON)
20:53karolherbst: that's really all there is to it
20:53karolherbst: I have a trivial script to reproduce that from userspace
20:54karolherbst: I just don't know why step 1+2 are trigering this issue
20:55karolherbst: 1. is done via devinit (some script inside the vbios) 2. is done by the kernel runpm pci core code 3. is done by the acpi code (pci inokves a platform method on the bridge to power it down, which invokes those ACPI ethods, not through the GPU device)
20:55karolherbst: any thoughts on that?
20:55karolherbst:trying to read the PCIe 3.0 specification
20:58karolherbst: good that the PCIe spec is only 860 pages...
21:04karolherbst: funky, there is a "D0uninititialized" state in the spec :)
22:41karolherbst: imirkin: ohh, maybe that fan issue is something cool to start with for the intern we've got :) shouldn't be too hard? Or do you have any bug in mind which is even simplier?
22:51imirkin: i have nothing in mind
23:39Lekensteyn: karolherbst: the interaction with ACPI power resources triggers stuff, so if you decide to pass the device to QEMU, you should try a stub ACPI table as well with relevant methods
23:40Lekensteyn: I successfully traced the PCI config space before using https://github.com/Lekensteyn/acpi-stuff/blob/master/d3test/patches/qemu-trace.diff
23:41Lekensteyn: but in order to catch what devinit does you might have to catch mmio somehow. it might be worth trying, I never got further though since the Windows 10 failed to initialize the GPU
23:41karolherbst: well, I already know what devinit does
23:41karolherbst: Lekensteyn: but do you have the pci trace somewhere?
23:42Lekensteyn: I have some for the PCI config at https://github.com/Lekensteyn/acpi-stuff/tree/master/d3test/XPS9560/slogs, but not mmio
23:42Lekensteyn: lspci are in the parent directory
23:42karolherbst: Lekensteyn: the logs are all form the GPU device, right?
23:43Lekensteyn: and the parent PCIe port device (ioh3420)
23:43Lekensteyn: sometimes I also write hints in commit messages
23:43karolherbst: it writes 0xfee0100c into 0x64
23:44karolherbst: that's for the parent pcie port
23:44Lekensteyn: note that my pci config space might be different than the actual device due to pci capabilities at different offsets
23:44Lekensteyn: (than the actual device -> than the device on bare meta)
23:44karolherbst: Lekensteyn: interesting
23:44karolherbst: windows 10 never puts the device into d3 via the pci config
23:45karolherbst: that would require writing the 0x3 bits into 0x64
23:45karolherbst: 1 for D1, 2 for D2...
23:45karolherbst: ohh, later :/
23:45karolherbst: "email@example.com:vfio_pci_write_config (0000:01:00.0, @0x64, 0xb, len=0x2)"
23:45karolherbst: there it is
23:46Lekensteyn: see also the README in https://github.com/Lekensteyn/acpi-stuff/tree/master/d3test
23:46Lekensteyn: "Tested with Windows 10 and confirmed that the PMCSR state register is written (D3) before calling the corresponding PG00._OFF ACPI methods."
23:46karolherbst: Lekensteyn: maybe that's actually helpful... maybe some of the other writes makes it work
23:46karolherbst: Lekensteyn: do you know what's 0x3c?
23:46karolherbst: ohh, that's pci core stuff
23:47karolherbst: mh... nothing of any interest?
23:48karolherbst: Lekensteyn: heh.. why is it calling _OFF so many times?
23:48karolherbst: and this _ON without touching the GPU is also weird
23:49karolherbst: and this initial ._OFF ._ON cycle is also strange
23:49Lekensteyn: no idea why _OFF is called twice at shutdown, and maybe it is called once at startup as some form of reset? I guess this is not specified by the spec though
23:51karolherbst: maybe workarounds to make it work
23:51karolherbst: I could just parse this log and run it locally and see what happens
23:52karolherbst: thing is just, I don't know when devinit is executed
23:52Lekensteyn: here is one with the names for the registers http://sprunge.us/5W6pb2
23:53karolherbst: mhh this initial ._OFF/._ON cycle looks like a workaround for something stupid
23:53Lekensteyn: <slogs/win10-rp-enable-disable.txt ../../pciconfig/pciconfig.py --config lspci-vm-vfio.txt --filter rp -s :1c | ../../pciconfig/pciconfig.py --config lspci-vm-vfio.txt --filter vfio -s 1:
23:54Lekensteyn: (that was the comment to generate this annotated output from the d3test/XPS9560 directory)
23:54karolherbst: but I think I was actually seeing it on my older system as well where I had this cycle on boot
23:54karolherbst: the second call to _OFF looks like the real thing
23:55karolherbst: just... that the GPU isn't put into d3
23:55karolherbst: mhh, maybe that's before the nvidia driver was actually loaded
23:56karolherbst: that's a few seconds after booting
23:56karolherbst: so that's probably it
23:56Lekensteyn: one issue with this trace method is that the PCIe port is emulated
23:56karolherbst: the kernel just probed the device of what it is, the nvidia driver did some sanity checks, but left the device alone as nothing was using it
23:56karolherbst: and then it was just shut down
23:57karolherbst: anyway, the third call into _OFF looks like the real deal
23:57karolherbst: especially because of the _DSM call as well
23:58karolherbst: I am wondering
23:58karolherbst: Lekensteyn: do you know what that _DSM call does?
23:58karolherbst: the last one
23:58Lekensteyn: the 0x100 0x1a 0x0?
23:59karolherbst: 0x4 as the last number
23:59karolherbst: but yeah