03:35 rhyskidd: karolherbst: sent you a R-b on your hwmon temp sensor patch for runpm-managed gpus
03:36 rhyskidd: thanks for your recent work, with that + other patch series lots of rough edges being fixed up for my pascal
08:22 karolherbst: rhyskidd: problem is just, that the runpm fixes make no sense... sadly. They are more like a workaround fixing nothing :/
08:22 karolherbst: there are a few weird things going on tne GPU
08:23 karolherbst: devinit reduces the pcie link speed to 2.5 and pci core is putting the GPU into d3hot via the pci config (0x64 on nv gpus / 0x88064 in mmio)
08:23 karolherbst: when those both things happen, the bridge is reporting an fatal error on resuming (via ACPI) and the GPU is dead
08:24 karolherbst: the most recent runpm patch I sent out just skips the entire runpm stuff for the GPU, not triggering those conditions
08:24 karolherbst: the device still gets turned off, because the ACPI method is still invoked when susending the bridge, hence you still get the power savings :(
13:08 rhyskidd: so you think your most recent patch "drm: don't set the pci power state if the pci subsystem handles the ACPI bits" is more likely correct?
13:15 karolherbst: no
17:52 karolherbst: Lyude: I think I know how to fix this runpm issue... just need to try something out
18:00 Lyude: karolherbst: Awesome!
18:07 imirkin: Lyude: did you see the i2c regression thing?
18:08 Lyude: imirkin: no, what's up?
18:08 imirkin: see nouveau@
18:09 imirkin: the commit hash in the email is bad
18:09 imirkin: 342406e4fbba9a174125fbfe6aeac3d64ef90f76
18:13 Lyude: imirkin: what is the subject line?
18:13 imirkin: NOUVEAU_LEGACY_CTX_SUPPORT Fan Speed
18:13 imirkin: (obviously related to i2c, i know)
18:15 Lyude: imirkin: hm-so it's an nv50 gpu
18:15 imirkin: unclear if it's true nv50 or nv50-era
18:16 imirkin: there were some marked i2c changes around g94, iirc
18:16 Lyude: yeah-I was about to say the same thing, I'll asked
18:16 Lyude: *ask
18:16 Lyude: and yeah there were-it sounds like we might be reading out the fan speed without enabling the i2c bus, but I'll have to see when I get into the office on monday
18:17 imirkin: do you have a collection of gpu's available there?
18:17 Lyude: yes
18:17 imirkin: coolio
18:17 Lyude: hopefully I have one that can reproduce this issue
18:18 imirkin: fwiw i don't have a g80 myself. ben definitely does.
18:21 imirkin: you can get them fairly cheap on ebay too if needed
18:21 imirkin: just look for ones with funny amounts of memory, like 320/640mb
19:05 karolherbst: that remindes me, I still have to fix the fan issue on my two g84
19:05 karolherbst: the fan speed is in reverse order :)
19:08 imirkin: the hotter it gets, the slower it goes?
19:09 karolherbst: exactly
19:09 imirkin: what coudl go wrong with *that* policy ... :)
19:09 karolherbst: well, the policy is fine, it's just that 100% fan speed means the fan stops :)
19:10 karolherbst: it's probably some funky vbios bit
19:10 karolherbst: I just didnt get to it yet
19:11 karolherbst: Lyude: mhh, I couldn't try that stuff out I wanted to :/
19:11 karolherbst: I want to drop the lnkCap of the bridge to 2.5, but... well, you can't do that via the PCI config space :(
19:21 karolherbst: Lyude: maybe we just want to set PCI_DEV_FLAGS_NO_D3 on all nvidia GPUs :/
19:21 karolherbst: but D2 and D1 are equally bad
19:23 karolherbst: Lekensteyn: do you know if it's possible on windows to trace what happens on the PCI config with a device?
19:36 Lekensteyn: karolherbst: If you can run stuff in QEMU, you can do VFIO tracing, but I don't know to do it natively in Windows
19:53 karolherbst: mhh, qemu might be enough indeed
20:24 imirkin: it can be a pain to pass a laptop gpu through
20:29 karolherbst: yeah.. especially all the ACPI stuff
20:53 karolherbst: Lekensteyn: mhhh, okay, anyway, here is the thing: 1. set PCIe link speed to 2.5 on the GPU 2. put GPU into D3 state via PCI config (PM+0x4, 0x64 on nv GPUs) 3. invoke ACPI methods to power down (\_SB.PCI0.PEG0.PG00._OFF)
20:53 karolherbst: 4. dead on powering up (\_SB.PCI0.PEG0.PG00._ON)
20:53 karolherbst: that's really all there is to it
20:53 karolherbst: I have a trivial script to reproduce that from userspace
20:54 karolherbst: I just don't know why step 1+2 are trigering this issue
20:55 karolherbst: 1. is done via devinit (some script inside the vbios) 2. is done by the kernel runpm pci core code 3. is done by the acpi code (pci inokves a platform method on the bridge to power it down, which invokes those ACPI ethods, not through the GPU device)
20:55 karolherbst: any thoughts on that?
20:55 karolherbst:trying to read the PCIe 3.0 specification
20:58 karolherbst: good that the PCIe spec is only 860 pages...
21:04 karolherbst: funky, there is a "D0uninititialized" state in the spec :)
22:41 karolherbst: imirkin: ohh, maybe that fan issue is something cool to start with for the intern we've got :) shouldn't be too hard? Or do you have any bug in mind which is even simplier?
22:51 imirkin: i have nothing in mind
23:39 Lekensteyn: karolherbst: the interaction with ACPI power resources triggers stuff, so if you decide to pass the device to QEMU, you should try a stub ACPI table as well with relevant methods
23:40 Lekensteyn: I successfully traced the PCI config space before using https://github.com/Lekensteyn/acpi-stuff/blob/master/d3test/patches/qemu-trace.diff
23:41 Lekensteyn: but in order to catch what devinit does you might have to catch mmio somehow. it might be worth trying, I never got further though since the Windows 10 failed to initialize the GPU
23:41 karolherbst: well, I already know what devinit does
23:41 karolherbst: Lekensteyn: but do you have the pci trace somewhere?
23:42 Lekensteyn: I have some for the PCI config at https://github.com/Lekensteyn/acpi-stuff/tree/master/d3test/XPS9560/slogs, but not mmio
23:42 Lekensteyn: lspci are in the parent directory
23:42 karolherbst: Lekensteyn: the logs are all form the GPU device, right?
23:43 Lekensteyn: and the parent PCIe port device (ioh3420)
23:43 Lekensteyn: sometimes I also write hints in commit messages
23:43 karolherbst: mhh
23:43 karolherbst: it writes 0xfee0100c into 0x64
23:43 karolherbst: ohh
23:44 karolherbst: that's for the parent pcie port
23:44 Lekensteyn: note that my pci config space might be different than the actual device due to pci capabilities at different offsets
23:44 Lekensteyn: (than the actual device -> than the device on bare meta)
23:44 Lekensteyn: metal*
23:44 karolherbst: Lekensteyn: interesting
23:44 karolherbst: windows 10 never puts the device into d3 via the pci config
23:45 karolherbst: that would require writing the 0x3 bits into 0x64
23:45 karolherbst: 1 for D1, 2 for D2...
23:45 karolherbst: ohh, later :/
23:45 karolherbst: "2571@1535108878.397193:vfio_pci_write_config (0000:01:00.0, @0x64, 0xb, len=0x2)"
23:45 karolherbst: there it is
23:46 Lekensteyn: see also the README in https://github.com/Lekensteyn/acpi-stuff/tree/master/d3test
23:46 Lekensteyn: "Tested with Windows 10 and confirmed that the PMCSR state register is written (D3) before calling the corresponding PG00._OFF ACPI methods."
23:46 karolherbst: Lekensteyn: maybe that's actually helpful... maybe some of the other writes makes it work
23:46 karolherbst: Lekensteyn: do you know what's 0x3c?
23:46 karolherbst: ohh, that's pci core stuff
23:47 karolherbst: mh... nothing of any interest?
23:48 karolherbst: Lekensteyn: heh.. why is it calling _OFF so many times?
23:48 karolherbst: and this _ON without touching the GPU is also weird
23:49 karolherbst: and this initial ._OFF ._ON cycle is also strange
23:49 Lekensteyn: no idea why _OFF is called twice at shutdown, and maybe it is called once at startup as some form of reset? I guess this is not specified by the spec though
23:51 karolherbst: maybe workarounds to make it work
23:51 karolherbst: I could just parse this log and run it locally and see what happens
23:52 karolherbst: thing is just, I don't know when devinit is executed
23:52 Lekensteyn: here is one with the names for the registers http://sprunge.us/5W6pb2
23:53 karolherbst: mhh this initial ._OFF/._ON cycle looks like a workaround for something stupid
23:53 Lekensteyn: <slogs/win10-rp-enable-disable.txt ../../pciconfig/pciconfig.py --config lspci-vm-vfio.txt --filter rp -s :1c | ../../pciconfig/pciconfig.py --config lspci-vm-vfio.txt --filter vfio -s 1:
23:54 Lekensteyn: (that was the comment to generate this annotated output from the d3test/XPS9560 directory)
23:54 karolherbst: but I think I was actually seeing it on my older system as well where I had this cycle on boot
23:54 karolherbst: the second call to _OFF looks like the real thing
23:55 karolherbst: just... that the GPU isn't put into d3
23:55 karolherbst: mhh, maybe that's before the nvidia driver was actually loaded
23:56 karolherbst: that's a few seconds after booting
23:56 karolherbst: so that's probably it
23:56 Lekensteyn: likely
23:56 Lekensteyn: one issue with this trace method is that the PCIe port is emulated
23:56 karolherbst: the kernel just probed the device of what it is, the nvidia driver did some sanity checks, but left the device alone as nothing was using it
23:56 karolherbst: and then it was just shut down
23:57 karolherbst: anyway, the third call into _OFF looks like the real deal
23:57 karolherbst: especially because of the _DSM call as well
23:57 karolherbst: mhhhhhhhh
23:57 karolherbst: mhhhhhhh
23:58 karolherbst: I am wondering
23:58 karolherbst: Lekensteyn: do you know what that _DSM call does?
23:58 karolherbst: the last one
23:58 Lekensteyn: the 0x100 0x1a 0x0?
23:59 karolherbst: yeah
23:59 karolherbst: uhm
23:59 karolherbst: 0x4 as the last number
23:59 karolherbst: but yeah