04:34imirkin: skeggsb: so i need the head or.depth and mode khz to determine if i need to switch things into "tmds high speed" mode, which is in the 0x612300 register, currently set by sor_clock, which only has the ior as input.
04:35imirkin: skeggsb: i'm thinking of moving that stuff into rgclk...
04:36imirkin: rgdiv currently sets 612200 -- it's all pretty related
04:53skeggsb: imirkin: you've confirmed bios scripts don't automagically do this stuff too?
04:55imirkin: i have not. i do see it in the mmiotrace though.
04:55imirkin: skeggsb: which bios scripts? the "display" ones?
04:56imirkin: hm, can't find them with nvbios
04:56imirkin: i think something's off there
04:56skeggsb: care to point me at a trace + bios image?
04:57imirkin: i think there's an issue in nvbios... karol fixed it
04:57imirkin: or rhyskidd i forget
04:57imirkin: but i don't think it's pushed
04:57imirkin: skeggsb: btw, my patches so far: https://github.com/imirkin/linux/commits/hdmi2
04:59imirkin: skeggsb: and if you update rnndb, i fixed up some of the hdmi stuff to display a little nicer
04:59pabs3: imirkin: btw, I got another one of the freezes I mentioned before, had to force-reboot again. https://paste.debian.net/hidden/eb55f2f1/
04:59imirkin: basically instead of 0x0a it wants 0x14 in that reg, and also one of the dividers needs to be 1
05:00imirkin: in the gv100 dev_display.ref they're listed as
05:00imirkin: #define NV_PDISP_FE_CMGR_CLK_SOR_LINK_SPEED_TMDS 0x0000000A /* RW--V */
05:00imirkin: #define NV_PDISP_FE_CMGR_CLK_SOR_LINK_SPEED_TMDS_HIGH_SPEED 0x00000014 /* RW--V */
05:02imirkin: skeggsb: note that this isn't the best mmiotrace in the universe... ideally i would have made one where i switched back and forth between high/low modes
05:02imirkin: but the gv100 dev_display.ref filled in enough holes
05:29imirkin: skeggsb: i'm about to head off to sleep ... let me know if you have any suggestions so i can resume tomorrow. i'd really like to get basic modeset for hdmi2.0 working.
05:29imirkin: and then i can move on to additional crazy like yuv420 and 12bpp and such
05:29skeggsb: ack, sleep well!
06:23soderstrom: I've been following the troubleshooting guide for nouveau. So it looks like I am using the correct drivers and kernel modules. I even see things like "NOUVEAU(0): GLX sync enabled" and "GLX: Initialized DR12 GL provider for screen 0". The problem is, anytime I try to run anything using opengl, it fails. For example glxinfo returns BadValue (integer parameter out of range for operation). I really
06:23soderstrom: wanted to solve this on my own, but starting to run out of ideas. Oh, and I have reinstalled xorg-server, mesa, drm, etc.
06:25soderstrom: One more thing. Historically I used to have a working nouveau setup, then I changed to the nvidia driver for a while, and have now changed back. I cannot find any left overs from the nvidia driver (that was in my head the obvious thing to look for)
06:37soderstrom: kernel log https://pastebin.com/ZtuxhRgD
06:37soderstrom: Xorg.0.log https://pastebin.com/Y9UMznwZ
06:38soderstrom: loaded modules and some modinfo https://pastebin.com/JKTAxpHn
09:24soderstrom: nvm, the obvious thing to look for was still around. a simple "sudo pacman -R nvidia-340xx-utils lib32-nvidia-340xx-utils" and everything works.
09:25soderstrom: strace showed that glxinfo was loading these nvidia files, then it was just a matter of checking which packages owned those. Damm I've been stupid today. Well thanks channel for allowing me to let out some "steam" :)
12:14karolherbst: fun: "BADFET: Defeating Modern Secure Boot Using Second-Order Pulsed Electromagnetic Fault Injection" https://www.usenix.org/system/files/conference/woot17/woot17-paper-cui.pdf
12:28karolherbst: ohhh wow, interesting
12:28karolherbst: my current CTST run simply goes through
12:29karolherbst: not even random fails
13:04mupuf: karolherbst: that is a rare thing
13:05mupuf: that was a lucky run probably
13:05mupuf: and you jinxed it now
13:05karolherbst: mupuf: well, it is the full cts-runner thing
13:05karolherbst: currently at iteration 35 or something
13:05karolherbst: still running
13:05mupuf: I see
13:05mupuf: well, congrats then!
13:06mupuf: is it for gl 4.5?
13:06karolherbst: yeah.. normally it started to fail at iteration 5 or 6
13:06karolherbst: no gl 4.4 because I wanted faster iterations
13:06mupuf: So. when will you submit the conformance results? :)
13:06karolherbst: I mean, I was fixing most of the issues so getting a full iteration without fails isn't surprising
13:06karolherbst: but I wanted to track down random fails
13:07karolherbst: mupuf: well, do we have to pass the "KHR-NoContext.gl43.robust_buffer_access_behavior" tests?
13:07karolherbst: we don't implement reporting gpu reset status to GLX right now
13:07karolherbst: or GL?
13:08mupuf:has no idea what is in the mustpass list
13:08karolherbst: and context creation depends on it, so those KHR-NoContext.gl43.robust_buffer_access_behavior tests just fail
13:08karolherbst: mupuf: it is a different mustpass list
13:19karolherbst: anyway, implementing those bits isn't really that much work. Just implementing an ioctl and use it in mesa (basically)
13:19karolherbst: I think the kernel code is already there, just not wired up right now
13:25karolherbst: mupuf: "37/38 sessions passed, conformance test FAILED" :(
13:25mupuf: how many runs are needed to achieve conformance?
13:26karolherbst: just that one, but the cts-runner does a lot of iterations itself
13:26karolherbst: the config-gl43-khr-master one failed due to the robustness fails
13:28karolherbst: mupuf: also I think we should run it on each chipset or maybe gen would be enough? dunno
13:28mupuf: gen is enough. Look at nvidia's results
13:28karolherbst: but I think everybody splits it around chipsets
13:28karolherbst: ohh really?
13:28karolherbst: they only submit one per gen?
13:28mupuf: yeah, they had one for pascal, one for kepler, etc...
13:28karolherbst: submission or GPU?
13:29karolherbst: because you can bundle multiple GPUs in one submission
13:29mupuf: they had kepler1 / kepler2 though, IIIRC
13:29mupuf: oh, good question
13:29mupuf: as I said, check it out on the website l)
13:29mupuf: the results are public
13:29karolherbst: anyway, I guess I will take care of those robustness bits
13:30karolherbst: fun fact, I was running the cts inside gdb, because I hoped I would catch a random fail :/
13:30karolherbst: but the random fail is something silly anyway
13:30karolherbst: like we wait on a bo
13:30karolherbst: or we fail to wait on fences
13:30karolherbst: stupid issues like that
13:31karolherbst: and we need to review the patches :D
13:32karolherbst: some of the fixes are also a bit questionable
13:33karolherbst: especially "nvc0: force depth block size dimension to 1 for 3d images" breaks images on kepler
14:14RSpliet: Heh. I just hacked around a parboil benchmark to use float4 instead of an array of 4 floats, hoping NVIDIA would issue b128 load/store ops if I make this explicit, getting more perf out of my DRAM subsys. Turns out, they don't. Any constraints I might be overlooking?
14:15karolherbst: nvidia doesn't care ;)
14:15karolherbst: those b128 load/stores are a questionable optimization anyway
14:15karolherbst: and they also don't work with c access afaik
14:16RSpliet: Questionable only if your LLC is large enough I'd say.
14:17karolherbst: why would you think it increases perf? memory bandwidth doesn't change and now you have to stall more and hide latencies better
14:17karolherbst: or not?
14:17RSpliet: If your LLC isn't large enough you could evict data before you re-use it, issuing more DRAM requests.
14:18karolherbst: I don't see how the LLC matters are you write to/read from he same addresses in the end
14:20RSpliet: Because cache lines are wider than a float, and individual f32 reads of float4-formatted data end up being a stride pattern with a pitch of 4
14:20RSpliet: from the warp POV that is
14:20karolherbst: yeah dunno. all I know is, that nvidia doesn't use those 128 bit reads/writes that often themselves anyway
14:20RSpliet: I noticed :-)
14:21RSpliet: May have to use inline assembly for this one to experiment with
14:21karolherbst: I think they used to more often on kepler?
14:21RSpliet: Kepler is my target GPU
14:21karolherbst: ahh, then mabe not
14:22karolherbst: RSpliet: maybe "fake" align the addresses with a & 0xfffffff0 operation?
14:22karolherbst: as the address has to be aligned to be usable for 128 bit reads/writes anyway
14:22karolherbst: maybe nvidia doesn't know they are?
14:22RSpliet: That could be an issue...
14:22karolherbst: or it isn't predictable
14:23RSpliet: Well, actually no. The vload4() specification doesn't allow the offset to be non-aligned. Presumably the base address always is...
14:23karolherbst: ohh, right
14:24karolherbst: but mhh, nothing really checks that, right?
14:25karolherbst: maybe nvidia doesn't care about that alignment and adding a & ~0xf triggers 128 bit read/writes?
14:25karolherbst: I know that nvidia cares about such hints in the end
14:25RSpliet: I could give that a go, sounds easier than inline asm (given the need to have adjacent target regs too)
14:26karolherbst: nvidia uses those for fetches and exports though
14:26karolherbst: maybe for g access it simply doesn't matter?
14:27karolherbst: they use 64 bit load/writes though
14:28karolherbst: but maybe giving RA more freedom is more important than bundling read/writes, and for 64 bit types you screw RA over already anyway
14:28karolherbst: so no point in splitting those up
16:32imirkin: skeggsb: so ... i guess no words of advice on what to do about that register?
16:35imirkin: skeggsb: hm, interesting - on the DVI port, link == 2. i wonder if that means it's not dual-link capable anymore? that'd be nuts. (GT 1030...)
16:39imirkin: skeggsb: i might also add that all this indexing by or / head is pretty confusing... is or id == head id?
19:16imirkin: skeggsb: doesn't look like 612300 gets touched by display scripts
19:16imirkin: or rather ... it does
19:16imirkin: but not the TMDS_HIGH_SPEED thing
19:16imirkin: they set it to NV_PDISP_FE_CMGR_CLK_SOR_MODE_BYPASS_FEEDBACK under some conditions, whatever that means
20:23imirkin: skeggsb: i'm thinking of adding a nvkm_hdmi_acquire hook into outp->func->acquire to compute whether it should be high-bandwidth or not, and then use that result in sor_clock
21:05crucialrhyme: i have a very stupid question, that i'm not sure how to answer. i would like to have a bunch of little blinking activity LEDs, one for each of the 3,854 cores on my GPU. is this physically possible? and is information that granular exposed through software?
21:07imirkin: define activity?
21:08imirkin: afaik there are no current api's for anything like this
21:08imirkin: i think it's possible to get per-lane states via mmio, not sure
21:08imirkin: lanes don't really map well onto "cores" though
21:09mwk: crucialrhyme: forget about "cores"
21:09crucialrhyme: i mean, fundamentally, i just want the blinkenlights
21:09mwk: it's marketing bullshit
21:09mwk: the most granular you can get is streaming multiprocessors
21:09crucialrhyme: and i figure by trying to figure out how this might be done, or else why it's impossible, i will learn something
21:11crucialrhyme: okay so is there some way to query the "activity" of each individual SM for a reasonable definition of activity?
21:11mwk: yeah, lots of ways really
21:12mwk: there's a whole complex performance counter subsystem on the gpu
21:12mwk: you can monitor all sorts of events
21:12mwk: I suppose hooking up an "instruction executed" signal on every SM to one of the per-SM counters and monitoring them periodically would do the trick
21:14crucialrhyme: is this documented somewhere/has somebody reverse engineered it?
21:14mwk: what card would that be?
21:14crucialrhyme: 1080 Ti
21:14mwk: and which drivers?
21:18crucialrhyme: not sure about a version number. ideally the closed-source drivers so I can watch the lights blink as i train neural networks
21:19imirkin: then you're stuck with whatever api's they provide.
21:24crucialrhyme: okay, i'll dig through those, at least. within the nouveau drivers is there any particular code i might want to look at as well?
21:41imirkin: alright. time to test some of this junk i've been working on. bbl.
21:41imirkin: assuming my comp doesn't explode.
21:54imirkin: skeggsb: looks like this basically works -- https://github.com/imirkin/linux/commits/hdmi2
22:02pmoreau: karolherbst: 1xb128 can be more efficient than 4xb32 because you only have one memory request/transaction? instead of 4, and you do have the bandwidth for doing that 1xb128 in one go. There was a nice GTC presentation (2012-2013, around Kepler) about it and the memory subsystem (targeted for CUDA developers)
22:05pmoreau: From my own CUDA-programming experience, when the compiler avoided using coalesced loads was because it would otherwise go over 32 registers per thread.
22:06pmoreau: RSpliet: -^
22:27karolherbst: duh... the second full cts run also goes through
22:27karolherbst: ohh, random fails though
22:28karolherbst: okay, but that was known
22:45JayFoxRox: imirkin: mwk: I have some questions re NV20 PVIDEO. why can only be 1 buffer active at a time [HW has 2 I believe]? which one is displayed? changing the brightness doesn't seem to work?
22:46JayFoxRox: also what are the UVPLANE regs for?
22:46JayFoxRox: (I also have some other minor issues, but I think they are just bugs with my code)
22:47JayFoxRox: *NV20=NV2A; I'm working on an Xbox
22:49JayFoxRox: oh and how does the color key work? I tried a handful of values but couldn't figure out in which format / unit expects the color value
22:52mwk: JayFoxRox: AFAIK there are 2 so that you can do double buffering
22:52mwk: UVPLANE are for planar formats, I think
22:53JayFoxRox: that makes sense re doublebuffer (although kind of a shame: I was hoping to use BUFFER 1 for overlays in homebew, because commercial games only use BUFFER 0 afaik). I might try the UVPLANE stuff - but xbox never uses it so I don't know how the regs work
22:54JayFoxRox: any idea about the colorkey? mwk
22:54mwk: overlays aren't really my thing
22:56JayFoxRox: they are quite cool for what I want to do. because I don't have to mess up the framebuffers or hook the FIFO somehow
22:56JayFoxRox: I can just draw using the CPU and overlay an image :)
22:57JayFoxRox: also does upscaling, so if we are short on memory, we can render low-res and upscale
23:24RSpliet: pmoreau: It's defo going over 32 registers. Think the kernel uses like 43 or sth... whether the loads would be coalesced or not.
23:25RSpliet: And yes, in this case 128b loads would definitely be more efficient than 4x32. Less communication overhead, less pressure on the LLC. :-)
23:26RSpliet: I would somehow love to bounce this off NVIDIA, but... well, it's a 5yo GPU and a 2yo driver. They'll be as interested in my feedback as we are in Ubuntu LTS bugs :-P