10:18mwk: Lyude: do you have envytools push access?
10:19mwk: hm, apparently not
11:12RSpliet: sooda: the download.nvidia.com FTP is down?
11:13skeggsb: RSpliet: yeah, it's down while they upload all the docs and stuff for us ;)
11:13sooda: doesn't seem to work internally either
11:13RSpliet: sooda: tnx. Apparently I can still access the open-gpu-doc structures over http
11:13mupuf: RSpliet: yeah, the http link has been the only one working for some time
11:15RSpliet: oh hadn't noticed before... I guess FTP is outdated tech so can't blame them for taking it off-line. Wouldn't want to lose access to open-gpu-doc though :-D
11:16pmoreau: skeggsb: Was that a pun or are they really taking it down to upload stuff for us? :-D
11:17RSpliet: mwk: know why on Kepler NVIDIA sometimes introduces nops to make sure "far" backwards" branches end up at the start of the 64-byte instruction block (in fact, pointing at the sched codes). Is that an I-cache optimisation or just a limitation in the ISA?
11:17skeggsb: pmoreau: i'd hoped my ";)" gave away that i was dream-typing
11:17pmoreau: skeggsb: :-)
11:18RSpliet: *far backward branch targets
11:24RSpliet: hmm, doesn't seem to be an ISA encoding requirement...
11:26skeggsb: i don't suppose their compiler just pads out each basic block with nops?
11:27mwk: RSpliet: I would guess cache alignment
11:27mwk: it's quite common on many archs
11:29karolherbst: maybe they read the instruction blocks at once, due to the sched opcodes?
11:29RSpliet: karolherbst: yes, and conveniently 64-byte is also the minimal granularity of a DRAM operation
11:30RSpliet: (with a burst length of 8 and 64-bit channels...)
11:30karolherbst: RSpliet: do they fill up to 6 nops? or simply one?
11:30karolherbst: or more=?
11:30RSpliet: I've seen it fill up with 5 nops, but that's a special case
11:31karolherbst: ohh right, 6 makes no sense
11:31karolherbst: or does it? let me think
11:31karolherbst: okay yeah, 6 might make sense
11:32RSpliet: well, that special case being the expansion of OpenCL's async_work_group_copy()
11:32RSpliet: they might just be blindly inlining some ASM for that
12:12RSpliet: Ignoring the async_work_group_copy() it almost looks like an ad-hoc optimisation to avoid having the start of a loop at the end of an instruction block. Found one loop where they insert a single nop, and three nested loops where they didn't bother introducing 5 or 6. I'd have to look at more shaders, but perhaps it's a small threshold they assume is beneficial
12:29karolherbst: RSpliet: I think for nested loops the overhead of executing those NOPs could be too big. If they now they only have to do it once: fine, otherwise: better not to
12:29RSpliet: karolherbst: for the outer loop it shouldn't matter
12:30karolherbst: why not? the outer loop could be executed a few times as well
12:33RSpliet: yes, but the nops leading up to it aren't
12:35karolherbst: I think that's what I said, wasn't it?
12:35skeggsb: the nops seem like a good way to avoid fetching useless (ie. before the jump target) instructions into the cache
12:35karolherbst: ohh, okay, I think I see what I wrote wrong
12:36karolherbst: RSpliet: I meant if they know they execute NOPs once, it's fine for them, this also includes NOPs in front of outer loops
12:36karolherbst: skeggsb: they have to fetch those anyway (sched)
12:36karolherbst: ohh right
12:36karolherbst: that's what you meant
12:37RSpliet: skeggsb: yeah exactly, an I-cache optimisation :-)
12:38RSpliet: just curious where NVIDIA would've decided the threshold lies
17:23Lyude: mwk: nope
17:27RSpliet: karolherbst: well every inner loop has a high likelihood of being executed more often than its enclosing loops. I suspect the maths always work out
21:06z411: hello everyone, quick question: is vsync breaking when using xrandr a known issue?
21:20pmoreau: z411: Hi. What kind of use of xrandr is breaking vblank? Like just running `xrandr` without arguments?
21:24z411: pmoreau, Hi, thanks for the response. It happens when I do anything with it like switching mode, refresh rate or turning screens on/off.
21:25z411: For example, here I tried running glxgears and it works fine (it caps at 60 FPS), but running a mode change, vsync seemingly stops working, even after I turn it back to the original mode again (1920x1080)
21:25z411: The same happens if I change the refresh rate, or if I turn on my second monitor.
21:28pmoreau: Mhh… I *think* it's not expected to work with multiple screens, regardless if using xrandr or not.
21:30z411: Will disconnect my second monitor to see if it's related to that, a sec...
21:31z411: Mhm, seems to still happen even with one monitor connected, it seems just changing the mode triggers the issue
21:32z411: I might be going old though, I'm on kernel 4.9.25, will still try on mainline
21:32pmoreau: 4.9 is not too old yet :-)
21:35z411: Vsync does work with two monitors as long as I don't manipulate anything with xrandr
21:35z411: Pretty weird, should I report this?
21:35pmoreau: You should open a bug report with kernel + X version, which DRI version you are using, whether you are using the Nouveau DDX (and which version) or the modesetting one, include you Xorg.0.log + the test you pasted
21:36karolherbst: I reworked my pmu counter stuff. If anybody wants to look before I send out a new version of the series, please check it out: https://github.com/karolherbst/nouveau/commits/pmu_counters_v2
21:36pmoreau: That way people who know how it works can have a look
21:36pmoreau: karolherbst: The last 7 commits, right?
21:37z411: pmoreau, I see, will do that. Thanks.
21:38pmoreau: z411: Thanks for reporting
21:43pmoreau: karolherbst: So, load is reported as a hex value between 0x00 and 0xff? Wouldn't it be "easier" to have it as a percentage?
21:44mupuf: pmoreau: don't lose precision!
21:44mupuf: and a value is never reported as hex or binay or anything ... unless it is in a string ;D
21:44mupuf: It is just a u8
21:44mupuf: </pedantic> :D
21:46karolherbst: pmoreau: it
21:47karolherbst: 's easier this way
21:47karolherbst: a simple div on the falcon
21:47pmoreau: Would you lose too much precision if it was reported as 37.8%? It's only the value reported to userspace, not the one used for dyn reclocking, so it shouldn't matter too much, should it?
21:47karolherbst: pmoreau: check the PMU code and it makes sense
21:47pmoreau: I mean the value outputted in debugfs
21:47karolherbst: 2nd commit
21:47karolherbst: ahh I see
21:47karolherbst: oh well
21:47karolherbst: I could print it out as %
21:48karolherbst: details ;)
21:48pmoreau: I agree it's way easier to keep it as u8 on the falcon
21:48karolherbst: we could even report the full u8 value to userspace
21:48karolherbst: and let the gallium_hud convert it to %
21:50pmoreau: True, gallium_hud could certainly do the conversion. Or any userspace script.
21:51mupuf: yes, push all the precision to the userspace and let the HUD do its magic
21:51karolherbst: going higher than u8 doesn't make much sense though, because we really don't need that much precision
21:53karolherbst: I was even thinking about sending the full u32 values to te kernel, but then I couldn't read it out in one go and would need to change too much in the PMU-host communication, so I kept it as u8
21:54mupuf: karolherbst: but there may be more than 4 channels in the future :p
21:54karolherbst: te code supports 8 actually
21:54mupuf: I will check the code later, still working on this article
21:54mupuf: oh, cool
21:54karolherbst: only gt215 has 4 channels
21:54karolherbst: gf100+ has already 8
21:54karolherbst: and this is plenty
21:55karolherbst: even nvidia doesn't have anyting usefull to do with those
21:55karolherbst: they use 5 or 6 usually
21:55karolherbst: aond 2 of those are kind of not needed actually
21:55karolherbst: like they fill 2 slots for video accel stuff....
21:56karolherbst: if I didn't make any mistakes, this is what nvidia uses: https://gist.githubusercontent.com/karolherbst/1eb3759be936406734bcfa308c2652b2/raw/56dd0ea5c1396f70fdcec445455dea8b27773260/gistfile1.txt
22:03mupuf: yeah, it looks familiar
22:03mupuf: and it was funny that they included pcopy and GR together on older platforms
22:03mupuf: btw, nva5 also was using PCOPY IIRC
22:04karolherbst: maybe just bad naming on our end
22:04karolherbst: I am wondering how dyn reclocking worked on older gens
22:05karolherbst: or maybe they just monitored the FPS in userspace and complained about bad perf?
22:07mupuf: karolherbst: they used pcounter
22:08mupuf: check mmiotraces
22:08RSpliet: wasn't it more crude... like "when launching a new context/game/we, crank the clocks up to max"
22:08mupuf: RSpliet: this is still true
22:08mupuf: no, they were monitoring the perf counters
22:09RSpliet: never looked at that much
22:09mupuf: from pcounter
22:09mupuf: the problem is that they had to disable dyn reclocking when the userspace wanted to use the perf counters
22:09mupuf: whereas, with the pmu, both of them are independent
22:10karolherbst: ohh I see
22:10karolherbst: well dyn reclocking was kind of crappy on Tesla anyway
22:13mupuf: in what sense?
22:13mupuf: only the reclocking process was chaotic and very chip-dependent
22:14mupuf: dyn reclocking was quite alright, aside from the fact that memory usage was not taken into account IIRC
22:14karolherbst: well there weren't many perf levels
22:14mupuf: ah, right!
22:14mupuf: yes, only 4 at best, only on laptops
22:15mupuf: desktop PCs had funnier vbioses
22:25z411: pmoreau: Thanks for mentioning the modesetting driver, I can now confirm it only happens with the DDX one
22:35pmoreau: z411: Which version of the Nouveau DDX did you test?
22:43z411: pmoreau, 1.0.13 which is the latest debian sid offers, although experimental has 1.0.15 might try getting that one
22:51z411: Still happens with 1.0.15, sadly.
23:11karolherbst: yay, getting notified and setting new thresholds is working :)
23:17karolherbst: but something is wrong on the PMU set, the max value isn't set to the value I've expected...
23:30karolherbst: a falcon C compiler would have been nice now :D