02:56rhyskidd: I've sent to the ML a revised version of the initial implementation of new gp1xx temperature sensor patch
02:58rhyskidd: would appreciate some additional testing on the range of Pascal based cards
03:07imirkin: rhyskidd: some mostly trivial comments sent.
03:38rhyskidd: imirkin: thanks for the comments
03:40rhyskidd: do you suggest that I don't even need the status variable?
03:42imirkin: and i wouldn't necessarily have the other one either, but your call
03:43rhyskidd: in my view at least having the temperature temp register is helpful, as it semantically splits out what is going on
03:46imirkin: yeah, that's totally fine..
03:46imirkin: although it's odd the way you do it
03:46imirkin: u32 t_f24_8 = (tsensor & 0x0001fff8);
03:46imirkin: should that have a >> 3 at the end?
03:47imirkin: what is the f24_8 referring to?
03:47imirkin: sounds like Z24_S8 ;)
03:50rhyskidd: fixed point number 24.8
03:51imirkin: that would imply a 32-bit value
03:51imirkin: whereas this is a 14-bit value
04:22orbea: you guys ever see an error like this with mplayer when the nvidia drivers have never been installed? "Failed to open VDPAU backend libvdpau_nvidia.so: cannot open shared object file: No such file or directory"
04:22orbea: im not sure I believe the report...
04:24imirkin: libvdpau defaults to libvdpau_nvidia.so when all else fails
04:24orbea: hmm, so it can happen
04:24imirkin: it's hardcoded into the lib
04:24orbea: good to know
04:25imirkin: but it means that all else has failed :)
04:25imirkin: or e.g. the VDPAU_DRIVER name is improperly worked out
04:27orbea: thanks for the info
04:29imirkin: it means that either it can't get the nouveau name from the DRI2 X extension, or that libvdpau_nouveau.so isn't loading
04:30imirkin: actually, forget that second reason
04:30imirkin: it means that either VDPAU_DRIVER=nvidia is set, or the DRI2 X extension is not working properly to return the 'nouveau' name.
04:31orbea: according to the person who experienced it, VDPAU_DRIVER is not set at all, so it must be the latter
04:32imirkin: perhaps wayland without Xwayland?
04:32imirkin: (or perhaps it doesn't work properly with Xwayland? i really don't know that stack)
04:33orbea: maybe, he shouldn't have wayland, but it wasn't all that clear what he did or did not have...
04:35imirkin: could be nouveau with accel disabled
04:35imirkin: in which case you don't get DRI2
04:36orbea: is there a way to check that?
04:36imirkin: xorg log would be pretty revealing. or glxinfo.
04:37orbea: i'll pass that on
14:03ccaione: hey imirkin any progress on GP107? (just to be sure in case I missed anything)
14:04imirkin: (not sure that anyone's working on it with any level of activeness)
14:04ccaione: ouch, how so?
14:05imirkin: not sure i understand the question.
14:06ccaione: :D I mean, none is working because something is missing (and in that case I can provide something since I have the hw) or for some other reasons?
14:06imirkin: developer time/interest, i suspect
14:06ccaione: hoo alright, I see
14:07imirkin: it's not just GP107... it's some GP107M's, so you have to have the right person with the right hardware
14:07imirkin: er no, you had it with a desktop board i guess?
14:08imirkin: either way, the point stands
14:08ccaione: I have a laptop with the GP107
14:08imirkin: i think rhyskidd was looking at it, dunno if anything came of it
14:08ccaione: annoying that HDMI is not working since accel is disabled and eDP is on the intel controller and HDMI on the nvidia one
14:09imirkin: wellll... HDMI is working. accel is not ;)
14:09imirkin: you could run an X server on just the HDMI port
14:09ccaione: hahah yeah, probably not something people would like unfortunately :(
14:10ccaione: it's not my personal laptop, we are trying to enable this hardware
14:10RSpliet: maybe use reverse prime to have Intel render for the HDMI output?
14:10imirkin: RSpliet: requires accel on nouveau.
14:10imirkin: (to copy the buffers)
14:10RSpliet: imirkin: for DMA? Oh...
14:10imirkin: i mean, it's not a HARD requirement -- software could do it too
14:10imirkin: but that's not how it's set up now
14:11RSpliet: imirkin: so there's no gr firmware for GP107? Or is it just plumbing that's required?
14:11imirkin: RSpliet: everything's there. his just doesn't come up.
14:12imirkin: reminiscent of the GK106M issues we had a while back, where they booted into a funky state
14:13RSpliet: Right. Yeah, that probably means queueing up for requesting valuable developer time.
17:27rhyskidd_: ccaione: re GP107 i have been looking at it tangentially for the Pascal work i'm doing
17:27rhyskidd_: so far though, i'm focused on things that will improve the whole series -- so little time to look at GP107 (or a subset of them) bring up problems
17:27rhyskidd_: but not forgotten!
17:28rhyskidd_: the mmiotraces have been helpful
17:28ccaione: :) really glad about that. Let me know if you need anything else
18:41karolherbst: imirkin: anything holding this series back? https://patchwork.freedesktop.org/series/29400/
19:08tobijk: imirkin_: and this? https://patchwork.freedesktop.org/patch/171604/ :D
19:17imirkin_: seems fine
19:19karolherbst: tobijk: mind getting this to production quality? I am quite sure that how the commit currently is, doesn't break anything, but I think there is still more potential: https://github.com/karolherbst/mesa/commit/7ff8ce7c1f53843a9cc212bf74cc8e1f6660a293
19:21karolherbst: ohh, the code in nv50_ir_lowering_nv50.cpp needs adjustments as well
19:21karolherbst: (*(*it)->getSrc(1)->defs.begin())->getInsn()->setDef(0, (*it)->getSrc(0)); -> (*it)->getSrc(1)->getInsn()->setDef(0, (*it)->getSrc(0));
19:22karolherbst: tobijk: and the ++d/--d change is indeed important, because you can't do --d on a unordered_set::iterator
19:23karolherbst: tobijk: https://github.com/karolherbst/mesa/commit/37a07ded6368dc74c87f4de9880222fef9aa5fe7
19:43rhyskidd_: ccaione: how interested are you in video decoding (i.e. h.264) on the GP107?
19:44ccaione: rhyskidd_: not interested at all right now
19:45psyk: hi guys! it's me, CMEPTb. a week ago, you helped to get me a patch for gtx770 so i could set 0f pstate. welllll. something weird. what is happening is things work great if i set 0a first, then 0f. if i try to set 0f right away - jumbled multicolored squares all over screen and computer unresponsive. the patch is still applied....
19:45psyk: imirkin or karolherbst ^^
19:46tobijk: karolherbst: not sure what you are proposing in nv50_ir_lowering_nv50.cpp exactly, do you have a snippet around?
19:46karolherbst: tobijk: I've already changed it
19:46karolherbst: psyk: interesting
19:46karolherbst: psyk: you are on 4.12?
19:46psyk: Linux vmdesk 4.12.5-gentoo #3 SMP PREEMPT Wed Aug 23 17:21:41 CDT 2017 x86_64 Intel(R) Core(M) i7-2600K CPU @ 3.40GHz GenuineIntel GNU/Linux
19:47tobijk: karolherbst: i fail to see the difference between the first link and the second, might be me right now :/
19:47karolherbst: psyk: mhh, odd
19:47karolherbst: tobijk: first change in the nv50 file
19:48karolherbst: psyk: I think the only thing which would help us right now is a mmiotrace of nvidia, so that we know what nvidia is doing
19:51psyk: karolherbst, ok. how do i do mmiotrace of nvidia
19:52karolherbst: psyk: https://wiki.ubuntu.com/X/MMIOTracing something like that
19:54psyk: karolherbst, so i would need to install the nvidia blobs and trace that?
19:54karolherbst: important part is: boot without nvidia being loaded automatically, do the mmiotrace enabling things, start X and do some stuff (like running glxgears without vsync and check with nvidia-settings that the highest clocks are used) and follow those disabling steps
19:54karolherbst: psyk: basically, yes
19:56psyk: start nvidia after enabling mmiotrace things, before getting into x?
19:56psyk: buh wish i didn't install ssdm - used to startx all the time
19:56karolherbst: you can disable those services
19:56karolherbst: if you use systemd, simply do "systemctl disable sddm" and then reboot
19:57karolherbst: and you may want to blacklist nvidia as well
19:57karolherbst: and verify with lsmod that nvidia isn't loaded before starting the mmiotrace
19:57karolherbst: there is a bug inside the trace you might hit though, so maybe this all will fail regardless, but I already think about a proper patch for this problem
20:00psyk: ok karolherbst i will work on getting the logs of mmiotrace to ya
20:00psyk: sometime today i should have a couple hours to get this done
20:01tobijk: karolherbst: after rethinking the stated change in my tired state, i'm ok with it :)
20:02tobijk: just leave the --d/++d change in there i guess, its msall enough
20:02rhyskidd_: psyk: A useful timesaver I found with mmiotrace (at least the way my distro setups GRUB entries) was to boot into the recovery option of each kernel, and then set the filesystem r/w
20:03rhyskidd_: and then as karolherbst mentions do a quick lsmod before you start mmiotrace to ensure nvidia blob isn't already loaded
20:03tobijk: alternatively, pull the --d/++d change out and just put it as a dependency to the defs change in your cts tree :)
20:03tobijk: whatever you prefer :)
20:04psyk: rhyskidd, hmm interesting.. ok. maybe will try that. it's not much of a bother to just stop nvidia from loading on boot and taking sddm out of rc
20:06psyk: karolherbst, just last thing - you won't need me to do this same trace with nouveau and setting to 0f w/o first activating 0a?
20:43psyk: just realized i have this mtrr problem: reg06: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable <- going to fix that and test again before letting that evil nvidia blob on my system
21:15karolherbst: ha, nice
21:17tobijk: karolherbst: looks like a backport of the 4.13 cycle problem i had
21:17tobijk: should be fixed in 4.13
21:17karolherbst: ohh, I am sure I cuased that one locally
21:17karolherbst: mhh ohh I see
21:18karolherbst: tobijk: which commit?
21:18tobijk: karolherbst: i dot know exactly
21:19karolherbst: I already had enough fun bisecting my kernel today :D
21:19tobijk: will pick it up later
21:19karolherbst: mhh, but mhh, I think I might have caused that one
21:19tobijk: just not right now
21:19karolherbst: I need to solve this before we can even think of dyn reclocking on mobile chips
21:19karolherbst: and even way before we can think about it on desktops
21:20karolherbst: could it be the that we wait on the PMU
21:21karolherbst: it's even worse
21:21tobijk: karolherbst: try 4.13 rc should be fine
21:22karolherbst: no, this is a driver bug I start to trigger
21:22karolherbst: or I think it is
21:24karolherbst: one worker is stuck here: https://github.com/karolherbst/nouveau/blob/stable_reclocking_kepler_v7/drm/nouveau/nvkm/subdev/clk/base.c#L292
21:25karolherbst: and called __pm_runtime_resume there
21:25karolherbst: for whatever reason
21:25karolherbst: I really don't like that those call stacks make no sense most of the time
21:25karolherbst: and I think it's stuck here for real: https://github.com/karolherbst/nouveau/blob/stable_reclocking_kepler_v7/drm/nouveau/nvkm/subdev/clk/base.c#L365
21:25karolherbst: because this would make sense
21:26karolherbst: and the other worker is stuck here: https://github.com/karolherbst/nouveau/blob/stable_reclocking_kepler_v7/drm/nouveau/nvkm/subdev/clk/base.c#L642
21:26karolherbst: so classic dead lock
21:27karolherbst: no idea why I can't trust pm_runtime_suspended as much, need a more atomic operation here
21:29karolherbst: something like pm_runtime_if_nonsuspended_put and returns true on nonsuspended
21:29karolherbst: or do I have to use pm_runtime_status_suspended?
21:31tobijk: sadly i'm not familiar with that code :/
21:32karolherbst: status_suspended has weaker conditions to return true, so I guess I'll use this
21:33karolherbst: mhh, but uhm
21:33karolherbst: it shouldn't matter
21:40tobijk: karolherbst: not sure about https://github.com/karolherbst/nouveau/blob/stable_reclocking_kepler_v7/drm/nouveau/nvkm/subdev/clk/base.c#L643 and 644
21:40karolherbst: tobijk: hum?
21:41tobijk: where dies clk->func->fini point to?
21:41karolherbst: it doesn't matter
21:41karolherbst: and it points to 0
21:42karolherbst: but that isn't the point, it hangs on flushing the work queue
21:42karolherbst: and the worker is busy resuming the GPU
21:42karolherbst: allthough a suspend state was triggered
21:43karolherbst: but maybe it is that bug you were talking about
21:43tobijk: heh well, that one is a longstanding issue
21:43karolherbst: and I thought my checks here are sufficient: https://github.com/karolherbst/nouveau/blob/stable_reclocking_kepler_v7/drm/nouveau/nvkm/subdev/clk/base.c#L360
21:43tobijk: let the card sleep, hack in a performance state for the card and there you go, hangup
21:43karolherbst: check the if
21:43karolherbst: I fixed that issue, but it seems like there is a race condition somewhere
21:44karolherbst: it seems to work, except you reclock while the GPU is currently suspending
21:45tobijk: if it sleeps, it works already with that change? yay, but have a mutex for all pm work would be not too bad i guess
21:46karolherbst: the code looks fine to me though
21:46karolherbst: so I don't see why there should be even a mutex
21:47tobijk: did you change pm_runtime_suspended or can i grep a local copy?
21:47karolherbst: maybe I should do pm_runtime_get_if_in_use? But this will lead to all other kind of issues
21:47karolherbst: tobijk: I didn't change it
21:48karolherbst: and pm_runtime_barrier seems like a useless thing as well
21:48karolherbst: at least for me here
21:52tobijk: karolherbst: so there is pending work an you check for pm_runtime_suspend(), mhm maybe pm_runtime_irq_safe() will help
21:52tobijk: (not familiar with the code as i said)
21:54karolherbst: but I am in no interrupt
21:54tobijk: look for runtime_pm.txt, thats where i have my shallow insights ;-)
21:56karolherbst: uhm, duh
21:56karolherbst: " The runtime PM status of a device after successful execution of the suspend callback is 'suspended'."
21:56karolherbst: why set it _after_?
21:57tobijk: because it is supended after, before that it is suspending ;-)
21:57karolherbst: won't do
21:57karolherbst: imagine the device is idle for two seconds
21:57karolherbst: and then it gets used again
21:57karolherbst: but I just didn't reclock becuase it wasn't used
21:58karolherbst: -> ugly
21:58karolherbst: ohh wait
21:58karolherbst: that triggeres the device being idle
21:58karolherbst: even worse
22:00karolherbst: yeah, I need to check against RPM_SUSPENDING
22:03tobijk: still looks racy to me :/
22:04karolherbst: but now it's not like 0.5s racy, but more like 0.00001s racy
22:04karolherbst: which isn't perfect, but better
22:05tobijk: that sounds like a unlikely race codition i fought with at the time i started to get involved with nouveau :)
22:05karolherbst: tobijk: but as I said: I need some atomic operation, which checks _and_ wakes the device up in sync
22:05tobijk: ben had a hard time reproducing my issues :/
22:05karolherbst: which one was it?
22:05tobijk: pm fro suspend and resume
22:05tobijk: 3 or 4 years ago :D
22:06karolherbst: as long as it happens once in a life time, I already consider it an improvement
22:06karolherbst: because currently we don't have that check at all
22:06karolherbst: we just reclock no matter what
22:06tobijk: well that one happened for me every second time (not saying your will)
22:07karolherbst: my GPU reclocks if the temperature changes by 1°C
22:07karolherbst: so, quite often
22:07karolherbst:has to reboot
22:12karolherbst: tobijk: https://gist.githubusercontent.com/karolherbst/93434000467e5ef1aa713e4401692359/raw/50be5dbdb1c46d7766cee82fbf0e3e6959fdf3c6/gistfile1.txt
22:13tobijk: nice tox :O
22:13karolherbst: but the shell can't really deal with it
22:13karolherbst: sometimes I get offsets between display and real data
22:13karolherbst: a bit annoying
22:14tobijk: small price to pay?! :D
22:14karolherbst: seems so
22:14karolherbst: I should open a bug report
22:15karolherbst: tobijk: best part: "Pretty hostname: 🐧"
22:15tobijk: so there is a tux shown? :D
22:15koz_: Dat penguin.
22:15karolherbst: it's a unicode symbol
22:16karolherbst: sadly DNS servers don't really like those
22:16tobijk: knoversation is not clever enough 🐧 for me :/
22:16karolherbst: well, it's part of this years unicode extension
22:17koz_: Weechat manages the penguin just fine.
22:17skeggsb: i really don't think you want to be playing any tricks with runpm at that level of the code
22:18karolherbst: skeggsb: nope, I really don't
22:18tobijk: skeggsb: i proposed a per device lock for this, not sure what your educated propoal is :)
22:18karolherbst: skeggsb: but imagine a user doing a reclock just in a moment where runpm suspends?
22:19skeggsb: karolherbst: you do what every other user entrypoint that has to touch the GPU does, you pm_runtime_get_sync() first
22:19skeggsb: (so, from the sysfs code)
22:19karolherbst: then we always wake the gpu up
22:20karolherbst: and put it down to sleep again
22:20skeggsb: yes, and so what
22:20karolherbst: and then reclock again on resume
22:20skeggsb: the user requested a reclock, so do it
22:21tobijk: skeggsb: playing devils advocate: wait until the device is up by another source and then reclock
22:21karolherbst: well, there is no need to reclock though
22:21karolherbst: I don't see the downside of not doing the reclock, except runpm isn't up to the task (yet)
22:21karolherbst: but then I would prefer to improve runpm
22:22skeggsb: and if there's still races with therm (or whatever) triggering reclocks *during* suspend, fix them by either a) making those subdevs get suspended first and/or b) making the clk trigger function do nothing if the clk subdev is suspended
22:22karolherbst: skeggsb: well, I tried to do the latter ;)
22:23karolherbst: but as it seems, you can't trust the suspended state, because there is still _suspending_ before that
22:23skeggsb: i'm not talking about runpm here, i think fucking with that where you are is *never* going to work like you want
22:24karolherbst: skeggsb: second thing, I need to do that put_sync, because prior that, if the card wasn't suspended/suspending on triggering the reclock, it got later
22:24karolherbst: and crashed the driver
22:25skeggsb: tl;dr: do the sane+simple thing and not worry about an "extra" reclock that the user did actually ask for anyway, and concentrate on real issues :P
22:25skeggsb: it's like going "oh no, glxinfo woke up my gpu and it didn't actually do anything except list capabilities"
22:25karolherbst: yeah, I guess this would be the easiest way
22:25tobijk: karolherbst: do what the man said :D
22:26karolherbst: but hey, that's what I do, right? I just check if the gpu is suspending/suspended. No clue why this check should mess everything up
22:26karolherbst: I am sure I would get the same issues anyway
22:26karolherbst: and this bothers me more
22:26skeggsb: because it's racy-as-fuck :P
22:27karolherbst: that "flush_work(&clk->work);" in nvkm_clk_fini kills me
22:28karolherbst: uhh wait, I actually added a little lock for testing and I see I did it poorly
22:30karolherbst: skeggsb: by the way: is there same kind of subdev->status?
22:31skeggsb: no, there's no need for one, they're in a defined order, you know the status of each already
22:31karolherbst: which kind of tells in which state the subdev is in terms of (preinit, ...)
22:31karolherbst: skeggsb: not inside a kernel worker
22:32skeggsb: yes, even inside a kernel worker. subdevs should (i think we miss some still though, but that's a bug) shut down any worker they have when they fini()
22:32karolherbst: or maybe I overlooked something there as well
22:33karolherbst: skeggsb: so I guess you need to do flush_work and then something like disable_worker?
22:33karolherbst: there is also cancel_worker
22:34skeggsb: if it's a workqueue, flush_work() (and somehow make sure nothing else can queue more in the meantime, either by knowing all callers are disabled first, or some kind of atomic flag to prevent queuing) - if it's a timer, cancel the timer etc
22:34karolherbst: okay, I see
22:35karolherbst: okay, I've added something like that for clk
22:35karolherbst: it is no atomic yet though
22:35tobijk: karolherbst: how did you do it for clk then?
22:35karolherbst: tobijk: add a boolean to clk and check for that in nvkm_clk_update
22:35karolherbst: so that schedule_work isn't called anymore
22:36skeggsb: you can probably just move therm after clk in the subdev list, that'd be easier
22:37karolherbst: probably, but such a lock makes sense regardless
22:37karolherbst: but the reclocking process can be quite long anyhow and this concerns me more, it can take up to several ms
22:38tobijk: karolherbst: yet therm may come every other ns and queue a request
22:39tobijk: not accepting requests while reclokcing seems fine to me
22:39karolherbst: yeah, but this is no problem, because clk_fini disables new schedules and then flushes
22:39karolherbst: tobijk: we already do that as well
22:40tobijk: so cancel_work or wait_work (if there is one)
22:41skeggsb: if you move therm, you can just not worry about the problem, but, either way :P
22:42karolherbst: well, the code is already there, so time is already wasted ;)
22:42karolherbst: but yeah, maybe I just move therm
22:42skeggsb: the simpler the better if you ask me :P
22:43karolherbst: I am actually surprised that the GPU suspends while I let steam run on it
22:43karolherbst: after I closed all windows
22:45karolherbst: and I am sure if not therm kills everything, something else will do it later
22:45karolherbst: or today already
22:54karolherbst: skeggsb: okay, I found a problem with that always pm_runtime_get_sync approach: when a GPU changes the themperature every 2 seconds, because for odd reasons, then the gpu won't suspend anymore
22:55karolherbst: at least this problem exists with my current implementation
22:55skeggsb: where are you calling get_sync() ?
22:55karolherbst: in the clk worker before calling nvkm_pstate_prog
22:56skeggsb: nope, don't touch runpm there.. do it in the sysfs code
22:56skeggsb: assume the gpu is on inside nvkm/
22:56karolherbst: I have to do it when calling from therm as well
22:56karolherbst: there is no way around that
22:56skeggsb: no, you don't. if you've shut down therm before clk, it's not necessary
22:56karolherbst: or I don't see one
22:57karolherbst: nope, still an issue, because the reclock could have been triggered before shuting down therm
22:57karolherbst: due to bad timing
22:57karolherbst: and while we shut down everything, the reclock is still going on
22:57skeggsb: this is why you flush stuff in fini() funcs
22:57karolherbst: mhhh ohhh true
22:58karolherbst: so simple
22:58skeggsb: yes ;)
22:59karolherbst: and later on we will only have two subdevs triggering a reclock anyway: PMU and therm (and PCOUNTER on pre gt215 gpus)
22:59karolherbst: and nothing on pascal...... but this is a different story then