12:27fdobridge: <karolherbst> you mean a normal cbuf?
12:27fdobridge: <karolherbst> but yes, you are correct
12:28fdobridge: <karolherbst> they either take immediates or uniform register/predicates as inputs
12:29fdobridge: <karolherbst> well bindless cbuf also doesn't work :ferrisUpsideDown:
13:23fdobridge: <pac85> Does accessing cbuf have the same limitations as using uniform registers wrg to concurrent access from diverging threads?
13:35fdobridge: <karolherbst> no
15:00fdobridge: <karolherbst> btw, the new bridge I setup runs in #Rusticl now, and I'm kinda planning moving all channels to that once I clean up the setup and everything
15:27fdobridge: <redsheep> I get the sense from the growing number of new branches over at gfxstrand mesa that this performance rabbit hole is continuing to get deeper
15:58fdobridge: <gfxstrand> lol
15:58fdobridge: <gfxstrand> Nah, it's the same 3 pieces it always was. It's just sometimes you have to re-organize a bit.
16:34fdobridge: <gfxstrand> Where is that list of all opcodes someone got out of fuzzing the asm? @mhenning, maybe?
16:36fdobridge: <mhenning> @gfxstrand here https://gitlab.freedesktop.org/mhenning/re/-/tree/main/opclass?ref_type=heads
16:37fdobridge: <mhenning> Although I think it's missing a bunch of ops that operate uniform regs because you need to set additional bits outside of the opcode for some of them
16:37fdobridge: <mhenning> Although I think it's missing a bunch of ops that operate on uniform regs because you need to set additional bits outside of the opcode for some of them (edited)
16:55fdobridge: <gfxstrand> Yup. Bit 91. I just fixed the fuzzer. I'll send you an MR in a bit.
16:57fdobridge: <gfxstrand> Just waiting for a full run to complete
17:00fdobridge: <gfxstrand> Ooh! What is this `stg_uniform` instruction? 😁
17:01fdobridge: <karolherbst> you wanna guess or you wanna know? 😄
17:01fdobridge: <karolherbst> it's something super funky
17:01fdobridge: <karolherbst> I _think_ it's STG with a descriptor
17:02fdobridge: <karolherbst> the memory descriptor is used for cache control
17:03fdobridge: <gfxstrand> There's also a nanosleep?
17:03fdobridge: <karolherbst> yeah
17:03fdobridge: <gfxstrand> Okay, I'm here for UPRMT. I shouldn't get distracted
17:26fdobridge: <gfxstrand> Stupid fadd...
18:39fdobridge: <mhenning> Yeah, I think there are actually two different bits - it might not just be 91
18:41fdobridge: <mhenning> There's an LDG/STG form where the address is reg + ureg + immediate, but that form is only on ampere, not turing
18:43fdobridge: <mhenning> CUDA likes using it even when there's no ureg used on ampere (there's a bit that enables/disables the ureg src) even though the old non-ureg turing opcodes also work fine
18:50fdobridge: <mhenning> Looking at old code, the ureg form of LDG on ampere is:
18:50fdobridge: <mhenning> ```
18:50fdobridge: <mhenning> + self.set_opcode(0x981);
18:50fdobridge: <mhenning> + self.set_field(90..92, 3_u8); // Part of opcode?
18:50fdobridge: <mhenning> + self.set_bit(76, true); // Disable ureg in addr
18:50fdobridge: <mhenning> ```
18:50fdobridge: <mhenning> So we should probably be trying all combinations of bit 90 and 91 in the fuzzer
18:54fdobridge: <mhenning> @gfxstrand ^
19:12fdobridge: <gfxstrand> Oh...
19:12fdobridge: <gfxstrand> I didn't know 92 had anything to do with it
19:12fdobridge: <gfxstrand> Let's throw 92 through the fuzzer as well
19:18fdobridge: <mhenning> I'm not aware of 92 doing anything - I think it's just 90 and 91
19:36fdobridge: <selaaaa> quick thing to add, that range isn't inclusive (`..=` is)
19:36fdobridge: <selaaaa> ~~maybe you got the bit 92 information from some other part of that i'm not well versed in the nouveau code~~
19:37fdobridge: <gfxstrand> Of course UMOV wants to be special...
19:38fdobridge: <gfxstrand> Why is UMOV encoded as a non-uniform ALU?
19:38fdobridge: <gfxstrand> 😩
19:39fdobridge: <dadschoorse> can it read vector register?
19:50fdobridge: <gfxstrand> No
19:50fdobridge: <gfxstrand> It's just encoded funky
19:51fdobridge: <gfxstrand> One of these days, someone should write some encoding unit tests that check NAK against the CUDA disassembler
19:55fdobridge: <magic_rb.> im bit late with this realization, but the machine code that nvk ends up generating is the same as cuda, so both nvk and cuda are compilers, for nvidia gpus
19:55fdobridge: <magic_rb.> thanks for coming to my ted talk
19:55fdobridge: <gfxstrand> 😁
20:23fdobridge: <magic_rb.> Its actually super cool, i should really find time, get an old ish gpus and just like, write my own compiler/driver for the thing
20:23fdobridge: <magic_rb.> Just so i can like, flip an image up side down or smth dumb like that
20:23fdobridge: <magic_rb.> More than enough
21:17fdobridge: <gfxstrand> compilers are fun
21:17fdobridge: <gfxstrand> If you want to hack on NAK, Maxwell and Kepler both still need some love
22:39fdobridge: <gfxstrand> Ugh... I've got a bug with dependency tracking. something to do with uniform and warp ops not waiting on each other or similar. That's not scary at all....
22:45fdobridge: <redsheep> As NVK gets faster it is becoming more frequent that the gaming experience is being held back by my displays not running at full refresh rate. If I ever get enough time I really want to sit down and try to figure out being a kernel developer just long enough to implement FRL and improve modesetting
22:49fdobridge: <redsheep> Testing against the nvidia driver right now I am pretty sure that at least half of the perceptual difference on my system comes down to things that need improvement in the kernel, not in just more raw frames from nvk
22:51fdobridge: <mohamexiety> yeah I was planning to take a closer look when I get more time a bit later (and a display that could replicate it). theoretically it shouldn't be _too_ bad as the GSP does take care of a lot of annoyances and you can compare with openrm for some functionality, but no way to be sure. first though exams then NVK
22:52fdobridge: <mohamexiety> then aftwards can try digging in the kernel stuff
22:54fdobridge: <redsheep> Yeah so I have skimmed all the openrm code that at least touches FRL and I don't think there's more than about a thousand relevant lines, probably much less
22:55fdobridge: <redsheep> In reality I think there would probably only need to be a couple hundred new/changed lines in nouveau but I don't understand our KMD enough yet to do it
22:56fdobridge: <mohamexiety> the neat part too is that this should be GSP exclusive so you don't need to touch things related to older cards (afaik Pascal caps out at HDMI 1.4 and DP1.2?)
22:56fdobridge: <mohamexiety> depends how our kernel handles it though yeah
22:57fdobridge: <redsheep> I think pascal probably has 2.0ish, but yeah not FRL for sure
23:31fdobridge: <gfxstrand> Something with memory ops not waiting properly on `MOV RX URX`
23:43fdobridge: <gfxstrand> I wonder if uniform ops secretly execute in lane0 or something like that
23:44fdobridge: <gfxstrand> And the deps are also only on lane0