00:35imirkin: pmoreau: btw, thoughts on my handling of the third dimension grid dimension?
09:31salawat: Okay... so after some doc diving, just wanted to make sure I actually grok what drives what with info from where. Thumbed throughhttps://nvidia.github.io/open-gpu-doc/ and I think I get the bring up chain of events. POST stuff happens that drives a bunch of legacy VBIOS bring up. That may/may not hand off to a bunch of EFI stuff I'm not all that interested in at the moment. eFI eventually hands off to bootloader, bootloader to kernel, m
09:34salawat: Now built into the card are a series of ROMs containing lookup tables and eventually state of the card. the BIT being one of the main
09:37salawat: sources of info like pointers to clockscripts, power management stuff, Falcon data, etc...
09:39salawat: what I'm not clear on, is the relationship between these Devinit scripts they've got documented, and whether or not those are essentially what everyone seems to refer to as the "firmware".
09:39salawat: or if they are a completely different thing entirely.
09:43salawat: if they are the firmware, then I need to stare at this a lot longer until I catch on to how what would appear to be a one-time read in by the driver leads to any form of dynamic power management whatsoever unless you're using offsets into that blob to dispatch to state changes deeper in the card.
09:44salawat: If they aren't the firmware... then are the safely ignorable in terms of wrapping ones head around clock and power management?
10:05damo22: i would think the firmware relates to the binary blobs in /lib/firmware that get loaded into the cards
10:05damo22: plus the vbios
12:33pmoreau: imirkin: I haven’t looked at it yet, will do so later today while building and running dEPQ and Vulkan-GL-CTS.
17:40imirkin: pmoreau: sounds good. i'm going to probably be pushing some updates to the nv50_compute branch, but not the MR (which is a subset)
18:11imirkin: although i think i'm going to merge all the image commits into one in the end. but for now i'm keeping them separate.
18:49pmoreau: I’ve been having issues running dEPQ: the version they have of the GL CTS does not include your fixes to not use GL >=4.0 when running GL3.3, and running the master of VK-GL-CTS fails to compile.
18:49imirkin: i'm using aosp's deqp fwiw
18:49imirkin: although not the very latest
18:49imirkin: and i have to force-enable ES 3.1
18:50imirkin: gtg for a bit, back later
18:50pmoreau: I used the one from aops too.
18:50pmoreau: I sent a few comments on the MR, will do the rest tomorrow.
19:33Lyude: Anyone (skeggsb if you're around maybe) know what the FLUSH_ENABLE argument for dma-copy does? https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/dma-copy/cla0b5.h#L71 (also known as CE). I ask because I taught igt-gpu-tools to zero-fill new bos by default (eventually the kernel will do this, but for now hacking this into igt is fine), and I'm just curious if disabling flushing might be
19:33Lyude: faster and not cause issues since we try to avoid ever modifying bos via mmaps and just copy them using dma-copy to/from sysmem when we need to change them
20:37imirkin: Lyude: do we set it in mesa?
20:39Lyude: imirkin: I'm not sure tbh, forgot to check there
21:28imirkin: mwk: hey, if i have a 2d array surface, but i want to access it as one big surface -- is it logically w*array x h, or w x h*array, or something else?
21:30mwk: it's, of course, complicated
21:30imirkin: but i wanted simple! :)
21:30mwk: w × (tile_aligned(h) * array_len) would be close, I think
21:30imirkin: right, obviously all tile-aligned. i knew that.
21:31imirkin: ok, so that's what i assumed ... doesn't seem to be working
21:31mwk: assuming it's not a 3d surface and there are no mipmaps involved... it'd probably work
21:31imirkin: oh fffff ... mipmaps
21:31imirkin: how can this even work. gr.
21:32imirkin: i'm trying to make images work on nv50
21:32mwk: ... condolences
21:32imirkin: and basically the address to g is a (x, y) coordinate
21:32imirkin: (and you also feed it the tiling parameters separately)
21:32imirkin: so there's basically no way to do 2d array then?
21:33imirkin: if the 2d array has mips
21:33imirkin: (and without copying tons of stuff around)
21:33mwk: well you probably cannot reasonably *access* the mips
21:33imirkin: i don't want to
21:34mwk: but if the pitch between full images happens to be a multiple of surf width, hmm
21:36imirkin: oh hm
21:36mwk: I don't really remember how the layout determination process works though
21:36imirkin: we DO align the stride to w*h
21:36mwk: it's been so long ago
21:36imirkin: (tile-aligned w*h)
21:36imirkin: so yeah, i should be able to just back it out
21:36imirkin: thanks =]
21:37mwk: ah, sounds doable then
21:41imirkin: no, still fail. but could be me doing any number of things wrong
21:41imirkin: info = mt->layer_stride / lvl->pitch;
21:41imirkin: i'm doing that
21:41imirkin: the level pitch is the tile-aligned pitch
21:41imirkin: and layer stride is the bytes between layers
21:42imirkin: and so the total y = y + (array index * info) basically
21:43mwk: what is a "tile-aligned pitch", anyway?
21:44imirkin: lvl->pitch = align(nbx * blocksize, tsx);
21:45imirkin: nbx = width, tsx = tile size in the x direction (aka 64)
21:45imirkin: blocksize = width of the format, i.e. 4 for rgba8 etc
21:47mwk: hm, that should work
21:48mwk: and you're not running out of uint16_t for the effective y result?
21:48imirkin: let's hope not
21:48imirkin: 64x64x8 32-bit-per element
21:48imirkin: so should be fine.
21:48mwk: that's real small
21:49imirkin: it's a test.
21:49imirkin: that i'm failing btw :)
21:49imirkin: we'll worry about big sizes later
21:49mwk: does it involve mipmaps?
21:50imirkin: nope. single-level, fixed
21:50mwk: so info is basically just 64, right?
21:50imirkin: let me dump it
21:51mwk: pitch is 256, layer_stride is 16384?
21:51imirkin: yes, info = 64
21:52imirkin: and here's the shader: https://paste.debian.net/plain/1187222
21:52imirkin: hopefully i didn't screw it up. sigh
21:52imirkin: c15[..4] = info
21:53imirkin: it reads from one image, stores into another
21:53imirkin: and i know the code isn't *entirely* broken since it does work for 2d
21:54imirkin: (and i have a TODO to do all this with 16-bit values, but that's later)
21:54mwk: ... ugh why is it doing shift&or instead of just using the 16-bit ops
21:54mwk: ... right
21:54imirkin: <-- lazy
21:54imirkin: i've already fixed a ton of emission issues in 16-bit ops
21:54imirkin: i use them for format conversion stuff, but this is R32UI, so not a ton to be converted
21:56mwk: it seems correct; you sure that the c15 constant is wired correctly?
21:57imirkin: works for image size
21:57imirkin: (that's in info[0..2]
21:58imirkin: oh f me
21:58imirkin: it's hooked up wrong
21:58imirkin: but two wrongs make a right
21:59imirkin: still fails =]
22:00imirkin: [1055097.631518] nouveau 0000:04:00.0: gr: TRAP_MP - TP1: 00000400 
22:00imirkin: [1055097.631525] nouveau 0000:04:00.0: gr: 00200000  ch 2 [000faa0000 deqp-gles31] subc 6 class 50c0 mthd 031c data 00000000
22:00imirkin: i'm getting those. dunno what they mean
22:00mwk: *sigh* haven't I made better reporting for those, like, ages ago
22:01imirkin: i think those are just unknown enums
22:01imirkin: there's supposed to be stuff between the 
22:01imirkin: (or at least unknown to the kernel)
22:02mwk: it is, indeed, an unknown MPC trap
22:02imirkin: what can i say - i'm that good
22:04mwk: ... but it seems to be just a variant of g read trap
22:07imirkin: yeah, looks like out of bounds
22:07imirkin: i should probably adjust the limit, huh
22:07mwk: oh gods I forgot these things have per-space limits
22:16imirkin: Passed: 1/1 (100.0%)
22:16imirkin: mwk: thanks =]
22:16mwk: you're welcome
22:17imirkin: now onto 3d...
22:17imirkin: that one's a bit of a mindfuck
22:17imirkin: i'll just bang on the keyboard at random until tests start to pass
22:18imirkin: mwk: btw, unrelatedly ... does nv50 compute have any concept of global memory cache? i.e. volatile, vs cache global vs etc
22:20mwk: the only caches on g80 are vertex, const, texture caches
22:20mwk: okay, and shader code cache
22:21imirkin: ok cool. so i can just ignore all those silly memory qualifiers.
22:22mwk: there are store buffers though
22:22imirkin: but without any sort of shader control
22:22mwk: and global memory definitely has those
22:22mwk: welllll... kind of
22:22imirkin: on nvc0+ LD/ST have those various flags
22:22mwk: cuda has a funky pseudo-instruction that is a "global memory barrier"
22:23imirkin: ah hm. membar or whatever?
22:23mwk: which is necessary if you want other MPs to observe stores in-order
22:23imirkin: i'm going to leave that one off until later :)
22:23imirkin: compute does have a memoryBarrier()
22:23imirkin: so that's probably good enough
22:23imirkin: (there are a few variants)
22:23mwk: except... it's not an actual instruction, and instead it assembles to a sequence that pokes a scratch global memory area in a funny sequence designed to flood the whole store buffer
22:24imirkin: oh ffs
22:29imirkin: mwk: i'm also having weird issues with cas
22:29imirkin: i have one test that fails which doesn't look at the returned results of the cas
22:29imirkin: whereas another one that passes which does
22:30imirkin: 000001c8: d38503fd e0c00789 exit cas b32 $r127 g7[$r1] $r5 $r0
22:30imirkin: is how it's being emitted in that case
22:31imirkin: i guess it could be some sort of lack of sync sort of thing?
22:31mwk: I vaguely remember that using r63/r127/wtvr with g-touching instructions was a very bad idea
22:31mwk: as in, out-of-bounds $r is *not* treated as 0/ignore in these particular insns
22:33imirkin: right ... but in this case, what could go wrong
22:33imirkin: it's the final instruction...
22:33mwk: dunno, a trap?
22:33imirkin: i'd see that in dmesg
22:33mwk: that or the hardware does a complete dumb and overwrites a $r in a new shader invocation later
22:33imirkin: i'll try to fudge it
22:33imirkin: and use $r0 in that case
22:33imirkin: and see if that fixes it...
22:33mwk: try changing it to write to $r0 and see if it still explodes?
22:36imirkin: you were right
22:36imirkin: emitting $r0 makes it pass
22:36imirkin: unfortunately that's not a workable general strategy...
22:36imirkin: will have to figure out how to convince RA to allocate a reg for that
22:37imirkin: thanks =]
22:37imirkin: mwk: if you could track down that memory barrier sequence, that'd be awesome
22:38mwk: I don't even have the toolchains installed anymore
22:38mwk: it might be in the IRC logs somewhere though
22:40mwk: hrm, seems not
22:41mwk: if you have a cuda toolchain capable of converting ptx to g80 assembly, you can try writing "membar.gl" and seeing what happens
22:42mwk: it should touch g15 a few times, I think 8 times, once per each memory partition in the max configuration
22:47RSpliet: If it's any useful, in OpenCL you can write inline assembly
22:47RSpliet: which accepts PTX
22:48RSpliet: Well, on the blob that is
22:50mwk: tbh using ptxas is much simpler
22:50mwk: ... but requires having old enough cuda setup
22:51mwk: (which actually supports g80)
22:51RSpliet: Yeah... either way you'll need that old cuda set-up
23:00emersion: imirkin: can you remind me which kernel i should try?
23:00emersion: just latest nouveau tree
23:08imirkin: emersion: old and new would be good
23:08imirkin: if possible
23:08imirkin: i tested on a GP108 with kernel 5.6 and it was fine
23:08imirkin: allegedly someone tested on a recent kernel with GK208B and it was not
23:09imirkin: (and GK104)
23:09imirkin: so chances are you'll run into the same issues with your GK208 with a new kernel
23:09imirkin: would be good to confirm that and also test on an older kernel (pre-5.9 probably?)
23:09imirkin: let me double-check with the disp macro thing landed
23:11imirkin: mwk: ok thanks
23:12imirkin: mwk: i'll check if mine supports g80 -- cuda 9.2 i think is what i have
23:14imirkin: emersion: yeah, looks like those landed in 5.9, so both pre-5.9 and 5.11 would be good to test with modetest + 256x256 cursors
23:14imirkin: (they were broken in the middle, so not worth testing)
23:28imirkin: pmoreau: out of the image_load_store.2d* tests, these are my current fails: https://paste.debian.net/plain/1187245
23:28imirkin: the qualifiers thing is expected due to lack of membar
23:28imirkin: nfc why those specific format reinterpret things fail. i suspect nothing to do with those tests, but rather more generic issues
23:30imirkin: but perhaps there's a very weird edge case in my format load/store code which is being hit
23:51imirkin: w00t! random banging on keyboard makes 3d images work
23:54RSpliet: been writing Shakespeare again?
23:55imirkin: got that like 10x before this worked