10:34 karolherbst: Lekensteyn: do you know if the root bus is set to D3?
10:34 karolherbst: maybe it is just that
10:37 Lekensteyn: karolherbst: on Windows? Not sure
10:40 karolherbst: allthough I am sure it is the pci device
10:40 karolherbst: and we upset it in some way
10:40 karolherbst: uhm
10:40 karolherbst: the nvidia gpu
10:44 karolherbst: skeggsb_: do you have any ideas?
10:44 karolherbst: why putting the nvidia GPU (or the bus it is on) into the D3 state via the pci config space doesn't allow us to resume the GPU?
12:36 karolherbst: Lekensteyn: soo, it seems like I can put the bus into D3 without causing issues
14:33 karolherbst: uhh nice, we can just use the kepler stuff to set the pcie link on pascal
14:33 karolherbst: at least something....
14:35 karolherbst: glxgears 3840x2160 51 -> 85 fps \o/
14:38 imirkin: battery life 3h -> 1h
14:41 cliff-hm: need new battery?
14:41 karolherbst: imirkin: yeah well, if you run glxgears all the time...
14:41 karolherbst: but maybe I should check the idle consumption
14:41 karolherbst: _could_ make a difference
14:41 karolherbst: thing is, I can't use the GPU sensor
14:42 karolherbst: imirkin: ohh, btw, did you saw my runpm "fix"?
14:43 imirkin: no, but i've been busy
14:43 karolherbst: allthough, you don't know _that_ much about PCI devices in general to give me any clues?
14:43 karolherbst: https://gist.github.com/karolherbst/3cde7028a6b885ca42863b6f6320658c
14:43 karolherbst: or maybe you do, I don't know :)
14:43 imirkin: i gtg =/
14:43 karolherbst: :(
14:43 karolherbst: okay
14:44 karolherbst: well short story: setting D3 via pci config breaks runpm :)
14:50 karolherbst: ohh wow, if I am not mistaken 2.5 -> 8.0 speed increases total system power consumption by 2W
18:40 karolherbst: Lekensteyn: do you have a log from windows with all the PCI config space writes + ACPI invocations?
19:10 tertle||eltret: hi
19:11 tertle||eltret: hi
21:51 Lyude: hm. skeggsb_, https://paste.fedoraproject.org/paste/OAmqhFDY~8TBMbkZE-yt6w I think that explains the random golden context issues on this P50
21:53 Lyude: (our gr init can get triggered from outside of drm_load while drm_load() appears to still be finishing)
21:54 karolherbst: Lyude: :O
21:54 karolherbst: evil
21:54 Lyude: i think i have an idea how to fix that though!
21:54 karolherbst: Lyude: that trace looks like secboot failing though
21:54 karolherbst: or some other crazy stuff
21:54 Lyude: karolherbst: maxwell1, no sec boot
21:54 karolherbst: uhh
21:55 Lyude: ...it doesn't, right?
21:55 Lyude: i thought that started on maxwell2
21:55 karolherbst: yeah
21:56 karolherbst: mhh
21:56 karolherbst: could be that the falcon wasn't properly setup then
21:56 skeggsb_: Lyude: that can't happen, it's mutexed
21:56 karolherbst: but if gr init is triggered before we bootet the entire GPU, then yeah
21:56 karolherbst: skeggsb_: what if that gets triggered before we actually run the engine init stuff?
21:56 Lyude: skeggsb_: does it make a difference then that it's getting called from a worker we schedule?
21:57 skeggsb_: nope, shouldn't do, engines are meant to be init'd on demand
21:57 karolherbst: Lyude: maybe it makes sense to increase debug level and see what engines/subdevs are up
21:57 Lyude: like-the thing I always see happen right when gr init stops working is really early DP AUX debugging output
21:57 karolherbst: skeggsb_: ohh, right
21:57 Lyude: on runs where it doesn't happen, that aux output happens a lot later
21:58 karolherbst: skeggsb_: did you see my commenct about D3 and PCI config space?
21:58 Lyude: jfyi though: that branch is one where I've moved output_poll_changed and connector probing (in response to hpd) into their own workers
21:59 karolherbst: skeggsb_: apperantly, this kernel workaround fixes runpm on those fancy new laptops: https://gist.github.com/karolherbst/3cde7028a6b885ca42863b6f6320658c
21:59 skeggsb_: yeah, i've been somewhat following (have been sick this week though, so not around much)
22:00 karolherbst: that pci_write_config_word basically writes the DX state into the "Power Management version 3" PCI config space thing
22:00 karolherbst: and for whatever reaons, this prevents basically everything from accessing the GPU after we resume it
22:00 karolherbst: even the UEFI isn't able to access it anymore
22:00 karolherbst: or well, the ACPI firmware at least
22:01 karolherbst: any idea why that might happen?
22:01 karolherbst: (with that fixed we still have secboot failing if we performed secboot before suspending)
22:01 skeggsb_: not a clue really, but it might be an idea to post on lkml or something and see if the pci/acpi subsystem people have any ideas?
22:01 karolherbst: they haven't until now
22:01 karolherbst: maybe I should post to lkml
22:02 mjg59: If you're not writing the D state into the PCI config space you're not actually suspending the chip in the PCI sense
22:02 karolherbst: mjg59: well, right
22:02 mjg59: If you then call _OFF or whatever then that's D3cold, which is a kind of separate state
22:02 karolherbst: mjg59: but we put it inside D3cold from ACPI perspective
22:03 karolherbst: I mean, sure, but why would the GPU care _that_ much
22:03 mjg59: So I guess the question is whether we should be putting something in D3warm if we're actually transitioning to D3cold?
22:03 karolherbst: especially if not using nouveau, but a stub driver it works
22:03 mjg59: Or whether that should be skipped?
22:03 karolherbst: I am sure we do something wrong inside nouveau
22:03 karolherbst: because without nouveau it works
22:03 mjg59: Hrm
22:04 karolherbst: mjg59: that is my stub driver: https://gist.github.com/karolherbst/73e6d053ac38613329a75042a3c5b2af
22:05 karolherbst: kind of annoying to debug that, because it could be like everything inside nouveau
22:05 mjg59: So what's the behaviour you see? On resume the device never responds to register writes?
22:05 karolherbst: *reads
22:05 karolherbst: yes
22:05 mjg59: Ok
22:05 karolherbst: basically the pci subsystem returns -1 on each read
22:05 mjg59: Try just stubbing out the nouveau suspend method?
22:05 karolherbst: and no error
22:05 karolherbst: already done
22:05 karolherbst: doesn't work
22:05 mjg59: Heh
22:05 karolherbst: you can skip the basics ;)
22:05 mjg59: Then yeah sounds like the device is in some funky state
22:05 karolherbst: yeah
22:06 mjg59: So does lspci just give back garbage?
22:06 karolherbst: yeah
22:06 karolherbst: ref ff and whatnot
22:06 karolherbst: a bit silly because lspci actually uses cached stuff by default
22:06 mjg59: Try lspci -H1 and lspci -H2
22:06 mjg59: If you haven't
22:06 mjg59: That should force lspci to hit the config registers directly
22:06 karolherbst: I mean, I know that the stuff is messed up as our envytools thing also fail
22:06 karolherbst: which basically talk via libpciaccess to the device
22:07 karolherbst: mjg59: I did even ACPI calls which read from the device memory
22:07 mjg59: libpciaccess is probably going via the kernel, so using H1 lets you rule out that the kernel is screwing up somehow and the device is actually alive
22:07 karolherbst: and those also return -1
22:07 mjg59: But if it's really not on the bus then yeah
22:07 mjg59: Not an easy way to handle that
22:07 karolherbst: not really
22:08 karolherbst: also removing the device and rescaning doesn't bring it back
22:08 karolherbst: only system suspend/resume and a rescan after does
22:08 mjg59: So what your hack is doing is skipping setting to D3warm when you're transitioning to D3cold
22:08 mjg59: Which doesn't sound like a bad thing?
22:08 karolherbst: yeah
22:08 karolherbst: well
22:08 karolherbst: dunno
22:08 mjg59: I mean
22:08 karolherbst: Lekensteyn mentioned on windows they do it
22:08 mjg59: Skip it, or do it?
22:08 karolherbst: set to d3hot
22:08 mjg59: Hrm
22:08 mjg59: Bleah
22:08 karolherbst: well
22:08 karolherbst: the GPU boots into D0 anyway
22:09 karolherbst: and the bus as well
22:09 mjg59: Do a register dump, write those out in your stub driver, see if it works or breaks?
22:09 karolherbst: what kind of register dump?
22:09 mjg59: All the mmio register regions
22:09 karolherbst: I mean, all the runpm stuff is handled inside the pci subsystem anyway
22:09 karolherbst: so it basically does the same thing
22:09 karolherbst: mhhh
22:10 karolherbst: there are quite huge
22:10 mjg59: Yup
22:10 karolherbst: and some things shouldn't be just read out
22:10 karolherbst: as the GPU might get messed up
22:10 mjg59: But that'd (with luck) let you know whether there's a single register that you're setting that's changing card behaviour here
22:11 karolherbst: mhhh
22:11 karolherbst: finding that single register in a 256MB mmio space, doesn't sound like that great of a plan to me
22:11 mjg59: Bisecting is fast
22:12 skeggsb_: it's "only" 16MiB anyway ;)
22:12 karolherbst: uhh, the 256MB was the aperture stuff? or something?
22:12 karolherbst: and what's the 32MB region anyway?
22:13 skeggsb_: bar1(fb) is 256MiB
22:13 karolherbst: ahh, okay
22:13 karolherbst: mjg59: I mean, how does your plan on bisecting that looks like? Just stubing out reads/writes to all regs outside a range?
22:14 skeggsb_: the 32MiB one is "PRAMIN", meant for kernel mappings of various stuff
22:14 mjg59: karolherbst: Yeah
22:14 karolherbst: mjg59: mhhhh....
22:14 mjg59: karolherbst: I fixed a *lot* of stuff this way a decade ago :)
22:14 karolherbst: mjg59: that wouldn't be so bad if you kind of depend on reads form other regs inside your writes :p
22:14 mjg59: Trick is knowing when to give up trying to fix a bug intelligently…
22:14 karolherbst: but uhm, maybe it might help?
22:14 mjg59: Alternatively, log all the register writes that nouveau does and just replay those
22:15 karolherbst: I guess asking nvidia might save me a lot of time here actually
22:15 karolherbst: I mean, I can say: D3hot state write messes up, why?
22:15 mjg59: And keep cutting back until you find something
22:15 karolherbst: mjg59: ohh, yeah, that sounds a bit more promising
22:17 karolherbst: skeggsb_: we do a full GPU rePOST on resume, right?
22:17 skeggsb_: of course
22:17 karolherbst: mhhh
22:17 karolherbst: wondering what secboot does to the GPU, so that after powering it down, secboot fails after resuming
22:18 karolherbst: because if we don't secboot, secboot doesn't fail after resume
22:18 karolherbst: only if we did it before suspending
22:18 skeggsb_: i still think that particular issue is a just a symptom of the resume issues
22:19 karolherbst: yeah.. maybe
22:19 karolherbst: I mean, this is after I removed those d3hot writes
22:19 karolherbst: and the GPU resumes correctly
22:19 karolherbst: just secboot fails
22:19 karolherbst: well "corectly"
22:20 karolherbst: we don't know
22:21 skeggsb_: what gpu is this btw?
22:21 karolherbst: gp107
22:22 karolherbst: ohh right, I also have that engine off/on hack applied
22:22 karolherbst: because I kind of need that one as well
22:24 skeggsb_: do you still actually need that btw? i seen something similar on gv100 when i initially started bringing it up, and it went away
22:24 karolherbst: mhhh
22:24 skeggsb_: (after various other fixes)
22:24 karolherbst: yeah, I think so
22:24 karolherbst: but maybe I just need to update my tree
22:25 karolherbst: which I think I actually did last week or so
22:26 skeggsb_: on suspend, do you see "running HS unload blob" and "HS unload blob completed" in dmesg if you have secboot debugging turned on?
22:26 skeggsb_: you should, and that should mean it's undone whatever secboot init does to the gpu too..
22:27 skeggsb_: but, that stuff is all kinda a black-box, so who knows
22:29 karolherbst: will check tomorrow