10:34karolherbst: Lekensteyn: do you know if the root bus is set to D3?
10:34karolherbst: maybe it is just that
10:37Lekensteyn: karolherbst: on Windows? Not sure
10:40karolherbst: allthough I am sure it is the pci device
10:40karolherbst: and we upset it in some way
10:40karolherbst: uhm
10:40karolherbst: the nvidia gpu
10:44karolherbst: skeggsb_: do you have any ideas?
10:44karolherbst: why putting the nvidia GPU (or the bus it is on) into the D3 state via the pci config space doesn't allow us to resume the GPU?
12:36karolherbst: Lekensteyn: soo, it seems like I can put the bus into D3 without causing issues
14:33karolherbst: uhh nice, we can just use the kepler stuff to set the pcie link on pascal
14:33karolherbst: at least something....
14:35karolherbst: glxgears 3840x2160 51 -> 85 fps \o/
14:38imirkin: battery life 3h -> 1h
14:41cliff-hm: need new battery?
14:41karolherbst: imirkin: yeah well, if you run glxgears all the time...
14:41karolherbst: but maybe I should check the idle consumption
14:41karolherbst: _could_ make a difference
14:41karolherbst: thing is, I can't use the GPU sensor
14:42karolherbst: imirkin: ohh, btw, did you saw my runpm "fix"?
14:43imirkin: no, but i've been busy
14:43karolherbst: allthough, you don't know _that_ much about PCI devices in general to give me any clues?
14:43karolherbst: https://gist.github.com/karolherbst/3cde7028a6b885ca42863b6f6320658c
14:43karolherbst: or maybe you do, I don't know :)
14:43imirkin: i gtg =/
14:43karolherbst: :(
14:43karolherbst: okay
14:44karolherbst: well short story: setting D3 via pci config breaks runpm :)
14:50karolherbst: ohh wow, if I am not mistaken 2.5 -> 8.0 speed increases total system power consumption by 2W
18:40karolherbst: Lekensteyn: do you have a log from windows with all the PCI config space writes + ACPI invocations?
19:10tertle||eltret: hi
19:11tertle||eltret: hi
21:51Lyude: hm. skeggsb_, https://paste.fedoraproject.org/paste/OAmqhFDY~8TBMbkZE-yt6w I think that explains the random golden context issues on this P50
21:53Lyude: (our gr init can get triggered from outside of drm_load while drm_load() appears to still be finishing)
21:54karolherbst: Lyude: :O
21:54karolherbst: evil
21:54Lyude: i think i have an idea how to fix that though!
21:54karolherbst: Lyude: that trace looks like secboot failing though
21:54karolherbst: or some other crazy stuff
21:54Lyude: karolherbst: maxwell1, no sec boot
21:54karolherbst: uhh
21:55Lyude: ...it doesn't, right?
21:55Lyude: i thought that started on maxwell2
21:55karolherbst: yeah
21:56karolherbst: mhh
21:56karolherbst: could be that the falcon wasn't properly setup then
21:56skeggsb_: Lyude: that can't happen, it's mutexed
21:56karolherbst: but if gr init is triggered before we bootet the entire GPU, then yeah
21:56karolherbst: skeggsb_: what if that gets triggered before we actually run the engine init stuff?
21:56Lyude: skeggsb_: does it make a difference then that it's getting called from a worker we schedule?
21:57skeggsb_: nope, shouldn't do, engines are meant to be init'd on demand
21:57karolherbst: Lyude: maybe it makes sense to increase debug level and see what engines/subdevs are up
21:57Lyude: like-the thing I always see happen right when gr init stops working is really early DP AUX debugging output
21:57karolherbst: skeggsb_: ohh, right
21:57Lyude: on runs where it doesn't happen, that aux output happens a lot later
21:58karolherbst: skeggsb_: did you see my commenct about D3 and PCI config space?
21:58Lyude: jfyi though: that branch is one where I've moved output_poll_changed and connector probing (in response to hpd) into their own workers
21:59karolherbst: skeggsb_: apperantly, this kernel workaround fixes runpm on those fancy new laptops: https://gist.github.com/karolherbst/3cde7028a6b885ca42863b6f6320658c
21:59skeggsb_: yeah, i've been somewhat following (have been sick this week though, so not around much)
22:00karolherbst: that pci_write_config_word basically writes the DX state into the "Power Management version 3" PCI config space thing
22:00karolherbst: and for whatever reaons, this prevents basically everything from accessing the GPU after we resume it
22:00karolherbst: even the UEFI isn't able to access it anymore
22:00karolherbst: or well, the ACPI firmware at least
22:01karolherbst: any idea why that might happen?
22:01karolherbst: (with that fixed we still have secboot failing if we performed secboot before suspending)
22:01skeggsb_: not a clue really, but it might be an idea to post on lkml or something and see if the pci/acpi subsystem people have any ideas?
22:01karolherbst: they haven't until now
22:01karolherbst: maybe I should post to lkml
22:02mjg59: If you're not writing the D state into the PCI config space you're not actually suspending the chip in the PCI sense
22:02karolherbst: mjg59: well, right
22:02mjg59: If you then call _OFF or whatever then that's D3cold, which is a kind of separate state
22:02karolherbst: mjg59: but we put it inside D3cold from ACPI perspective
22:03karolherbst: I mean, sure, but why would the GPU care _that_ much
22:03mjg59: So I guess the question is whether we should be putting something in D3warm if we're actually transitioning to D3cold?
22:03karolherbst: especially if not using nouveau, but a stub driver it works
22:03mjg59: Or whether that should be skipped?
22:03karolherbst: I am sure we do something wrong inside nouveau
22:03karolherbst: because without nouveau it works
22:03mjg59: Hrm
22:04karolherbst: mjg59: that is my stub driver: https://gist.github.com/karolherbst/73e6d053ac38613329a75042a3c5b2af
22:05karolherbst: kind of annoying to debug that, because it could be like everything inside nouveau
22:05mjg59: So what's the behaviour you see? On resume the device never responds to register writes?
22:05karolherbst: *reads
22:05karolherbst: yes
22:05mjg59: Ok
22:05karolherbst: basically the pci subsystem returns -1 on each read
22:05mjg59: Try just stubbing out the nouveau suspend method?
22:05karolherbst: and no error
22:05karolherbst: already done
22:05karolherbst: doesn't work
22:05mjg59: Heh
22:05karolherbst: you can skip the basics ;)
22:05mjg59: Then yeah sounds like the device is in some funky state
22:05karolherbst: yeah
22:06mjg59: So does lspci just give back garbage?
22:06karolherbst: yeah
22:06karolherbst: ref ff and whatnot
22:06karolherbst: a bit silly because lspci actually uses cached stuff by default
22:06mjg59: Try lspci -H1 and lspci -H2
22:06mjg59: If you haven't
22:06mjg59: That should force lspci to hit the config registers directly
22:06karolherbst: I mean, I know that the stuff is messed up as our envytools thing also fail
22:06karolherbst: which basically talk via libpciaccess to the device
22:07karolherbst: mjg59: I did even ACPI calls which read from the device memory
22:07mjg59: libpciaccess is probably going via the kernel, so using H1 lets you rule out that the kernel is screwing up somehow and the device is actually alive
22:07karolherbst: and those also return -1
22:07mjg59: But if it's really not on the bus then yeah
22:07mjg59: Not an easy way to handle that
22:07karolherbst: not really
22:08karolherbst: also removing the device and rescaning doesn't bring it back
22:08karolherbst: only system suspend/resume and a rescan after does
22:08mjg59: So what your hack is doing is skipping setting to D3warm when you're transitioning to D3cold
22:08mjg59: Which doesn't sound like a bad thing?
22:08karolherbst: yeah
22:08karolherbst: well
22:08karolherbst: dunno
22:08mjg59: I mean
22:08karolherbst: Lekensteyn mentioned on windows they do it
22:08mjg59: Skip it, or do it?
22:08karolherbst: set to d3hot
22:08mjg59: Hrm
22:08mjg59: Bleah
22:08karolherbst: well
22:08karolherbst: the GPU boots into D0 anyway
22:09karolherbst: and the bus as well
22:09mjg59: Do a register dump, write those out in your stub driver, see if it works or breaks?
22:09karolherbst: what kind of register dump?
22:09mjg59: All the mmio register regions
22:09karolherbst: I mean, all the runpm stuff is handled inside the pci subsystem anyway
22:09karolherbst: so it basically does the same thing
22:09karolherbst: mhhh
22:10karolherbst: there are quite huge
22:10mjg59: Yup
22:10karolherbst: and some things shouldn't be just read out
22:10karolherbst: as the GPU might get messed up
22:10mjg59: But that'd (with luck) let you know whether there's a single register that you're setting that's changing card behaviour here
22:11karolherbst: mhhh
22:11karolherbst: finding that single register in a 256MB mmio space, doesn't sound like that great of a plan to me
22:11mjg59: Bisecting is fast
22:12skeggsb_: it's "only" 16MiB anyway ;)
22:12karolherbst: uhh, the 256MB was the aperture stuff? or something?
22:12karolherbst: and what's the 32MB region anyway?
22:13skeggsb_: bar1(fb) is 256MiB
22:13karolherbst: ahh, okay
22:13karolherbst: mjg59: I mean, how does your plan on bisecting that looks like? Just stubing out reads/writes to all regs outside a range?
22:14skeggsb_: the 32MiB one is "PRAMIN", meant for kernel mappings of various stuff
22:14mjg59: karolherbst: Yeah
22:14karolherbst: mjg59: mhhhh....
22:14mjg59: karolherbst: I fixed a *lot* of stuff this way a decade ago :)
22:14karolherbst: mjg59: that wouldn't be so bad if you kind of depend on reads form other regs inside your writes :p
22:14mjg59: Trick is knowing when to give up trying to fix a bug intelligently…
22:14karolherbst: but uhm, maybe it might help?
22:14mjg59: Alternatively, log all the register writes that nouveau does and just replay those
22:15karolherbst: I guess asking nvidia might save me a lot of time here actually
22:15karolherbst: I mean, I can say: D3hot state write messes up, why?
22:15mjg59: And keep cutting back until you find something
22:15karolherbst: mjg59: ohh, yeah, that sounds a bit more promising
22:17karolherbst: skeggsb_: we do a full GPU rePOST on resume, right?
22:17skeggsb_: of course
22:17karolherbst: mhhh
22:17karolherbst: wondering what secboot does to the GPU, so that after powering it down, secboot fails after resuming
22:18karolherbst: because if we don't secboot, secboot doesn't fail after resume
22:18karolherbst: only if we did it before suspending
22:18skeggsb_: i still think that particular issue is a just a symptom of the resume issues
22:19karolherbst: yeah.. maybe
22:19karolherbst: I mean, this is after I removed those d3hot writes
22:19karolherbst: and the GPU resumes correctly
22:19karolherbst: just secboot fails
22:19karolherbst: well "corectly"
22:20karolherbst: we don't know
22:21skeggsb_: what gpu is this btw?
22:21karolherbst: gp107
22:22karolherbst: ohh right, I also have that engine off/on hack applied
22:22karolherbst: because I kind of need that one as well
22:24skeggsb_: do you still actually need that btw? i seen something similar on gv100 when i initially started bringing it up, and it went away
22:24karolherbst: mhhh
22:24skeggsb_: (after various other fixes)
22:24karolherbst: yeah, I think so
22:24karolherbst: but maybe I just need to update my tree
22:25karolherbst: which I think I actually did last week or so
22:26skeggsb_: on suspend, do you see "running HS unload blob" and "HS unload blob completed" in dmesg if you have secboot debugging turned on?
22:26skeggsb_: you should, and that should mean it's undone whatever secboot init does to the gpu too..
22:27skeggsb_: but, that stuff is all kinda a black-box, so who knows
22:29karolherbst: will check tomorrow