05:13 karolherbst[d]: sonicadvance1[d]: not really tho.. it's just hard outside of MMA. Like.. I understand it enough that I'm sure I can make it work in all cases, just need to set the right offsets
05:14 airlied[d]: I doubt there are many cases where it apps would want that
05:14 airlied[d]: outside of MMA stuff
05:14 karolherbst[d]: it looks weird at first, but is pretty flexible on a ~~second~~ 10th loom
05:14 karolherbst[d]: +look
05:15 sonicadvance1[d]: Oh, I meant more like sustaining maximum load through it since balancing RAM versus shared mem might be hard. But I guess that might just as well be up to the person writing the compute shaders 😄
05:15 karolherbst[d]: it's cheaper then using LDS at least
05:15 karolherbst[d]: or rather it gets rid of a loooot of address calculation
05:16 airlied[d]: the shaders in llama.cpp have been written by nvidia engineer at least 😛
05:16 karolherbst[d]: can use the same base address and load entire matrices by changing the constant offset
05:16 airlied[d]: gfxstrand[d]: mhenning[d] anyone considered constant prop in nak?
05:17 airlied[d]: though not sure it's exactly what I need, but we produce slightly stupid code for simple shader output
05:17 gfxstrand[d]: Yeah. We need legalize to know about constants.
05:17 airlied[d]: where we load a constants in a UR then load the UR into a R
05:17 gfxstrand[d]: That's where most of the stupid comes from
05:18 airlied[d]: I think in this case it's due to reg_out being all GPRs, then after lower_par_copies there's no way to recover
05:18 airlied[d]: ur3 = copy 0x3f800000
05:18 airlied[d]: r3 = copy ur3
09:28 karolherbst[d]: okay.. we need value range analysis to properly optimize ldsm 😄 it's going to be fun
09:28 karolherbst[d]: mostly just eliminating `iands` and value ranges of sysvals
09:28 karolherbst[d]: *with
09:29 karolherbst[d]: but I can do that after cleaning up ldsm...
14:06 snowycoder[d]: gfxstrand[d]: If we only propagate constant in legalize then we might have a DCE problem.
14:06 snowycoder[d]: A solution might be to allow immediates in GPR types and make legalize copy them to the appropriate registers
14:19 gfxstrand[d]: snowycoder[d]: We can DCE after legalize. That's fine.
15:13 snowycoder[d]: gfxstrand[d]: I'm reverse engineering whatever nvcc does with isetp and membars:
15:13 snowycoder[d]: There are interesting outputs:
15:13 snowycoder[d]: IMUL32I R4, R0, 0x3; sched=0x20
15:13 snowycoder[d]: MEMBAR.SYS; sched=0x2f
15:13 snowycoder[d]: Also:
15:13 snowycoder[d]: ISETP.NE.AND P0, PT, R4, 0x3, PT; sched=0x2c
15:13 snowycoder[d]: @!P0 BRA 0xa8; sched=0x2f
15:13 snowycoder[d]: So maybe `isetp` is 13 cycles (0xc+1) and membar just collides with predicate writes for unknown reasons?
15:15 gfxstrand[d]: I could believe that membar collides with predicates for mysterious reasons.
16:20 gfxstrand[d]: I think we're gonna land texdepbar for real...
16:20 gfxstrand[d]: 😄
16:51 snowycoder[d]: gfxstrand[d]: you're going to hate this.
16:51 snowycoder[d]: If we just need to check for queue overflows we can only keep a single `max: u8` for all of the queue, each push increments it by 1 and each flush culls it, no need to keep track of each register.
16:51 snowycoder[d]: Please tell me I'm wrong -.-
16:52 gfxstrand[d]: Uh... I don't think you're wrong.
16:52 snowycoder[d]: Writing documentation, the ultimate duck debugging
16:52 gfxstrand[d]: Yeah, I can't come up with a reason that's wrong.
16:53 gfxstrand[d]: snowycoder[d]: Yup!
16:53 gfxstrand[d]: But iteration is good. Every iteration makes it better.
16:55 snowycoder[d]: Ok, since I need to rewrite a lot of code.
16:55 snowycoder[d]: Should we just use a rolling timestamp as I described in the TODO docs?
16:55 snowycoder[d]: It's a bit more complex to code but it makes `push/flush` O(1) instead of O(N).
17:23 snowycoder[d]: After some thought: No I don't want to handle an invalid state with rolling timestamps.
17:27 gfxstrand[d]: And merging with timestamps gets funky
17:58 karolherbst[d]: I see that Faith has fun looking at the coop matrix stuff 🙃
18:07 gfxstrand[d]: Trying to, yeah.
18:10 gfxstrand[d]: But lunch got in the way so I sent the RBs I had. I'm back now and reading again.
18:14 karolherbst[d]: ahh
18:14 karolherbst[d]: I see
18:14 karolherbst[d]: I just had 2.5 hours of WG meetings 🙃
18:15 magic_rb[d]: Productive?
18:15 karolherbst[d]: yes
18:15 magic_rb[d]: At least there is that
19:13 gfxstrand[d]: karolherbst[d]: I've got a bunch of style stuff I'm gonna squash into the cmat stuff before merging so don't apply my comments
19:13 karolherbst[d]: okay
19:23 karolherbst[d]: gfxstrand[d]: also.. might make sense to indeed disable coop matrix support for anything non Turing/Ampere/Ada, because there are for sure subtle changes, and now that Blackwell support is kinda in, I'd rather test on hw before enabling it.
19:23 gfxstrand[d]: Have we not tested on Blackwell?
19:23 karolherbst[d]: I don't have access to any
19:23 karolherbst[d]: maybe airlied[d] could
19:23 gfxstrand[d]: Okay, I'll do that
19:24 karolherbst[d]: thanks!
19:24 gfxstrand[d]: What's the test name gob?
19:24 karolherbst[d]: `'*cooperative_matrix*'`
19:25 karolherbst[d]: yeah soo.. IMMA is toast on blackwell
19:25 karolherbst[d]: turing has 8x8x16, ampere has that and a bunch of others
19:26 karolherbst[d]: Blackwell only 16x8x16 and 16x8x32
19:26 karolherbst[d]: so besides dealing with those, shouldn't be a huge issue, just... testing 🙂
19:27 karolherbst[d]: but can also only land with blackwell disabled and CC: stable it later...
19:28 gfxstrand[d]: Or I can fix Blackwell while I'm waiting for my Ampere run.
19:28 karolherbst[d]: I could also try to write the patch blind
19:28 karolherbst[d]: or that
19:28 karolherbst[d]: the `nvk: add support for 16x8x16 IMMA on Ampere+` patch highlights all the important areas anyway
19:29 karolherbst[d]: probably best to just not advertise those unsupported sizes if there is no lowering for it through `get_hw_nak_cmat_type`
19:30 karolherbst[d]: and just disable it in `nvk_GetPhysicalDeviceCooperativeMatrixPropertiesKHR`
19:30 gfxstrand[d]: karolherbst[d]: It looks like even 16816 is blowing up. 😢
19:30 karolherbst[d]: mhh
19:31 gfxstrand[d]: But it decodes fine
19:31 karolherbst[d]: maybe latencies 🙃
19:31 gfxstrand[d]: `imma.16816.u8.u8 r8, r6.row, r2.col, r8`
19:31 gfxstrand[d]: Nope. `ILLEGAL_INSTRUCTION_ENCODING`
19:31 karolherbst[d]: mhh
19:31 karolherbst[d]: let's see...
19:31 karolherbst[d]: uhhhh...
19:32 karolherbst[d]: it takes a uniform predicate input
19:32 karolherbst[d]: should be True by default
19:32 karolherbst[d]: it's to disable the third source
19:32 karolherbst[d]: hey....
19:32 gfxstrand[d]: What bits?
19:32 karolherbst[d]: so we can do a native matrix mul with that
19:32 karolherbst[d]: no idea 🙂
19:33 karolherbst[d]: HMMA also has that one
19:33 karolherbst[d]: but not sure why that would cause illegal instruction encoding...
19:35 karolherbst[d]: maybe it's something else?
19:38 gfxstrand[d]: Okay, this is really funky
19:38 gfxstrand[d]: It's a upred source alright but it's backwards
19:39 gfxstrand[d]: 0x0: imma.16816.u8.u8 r8, r6.row, r2.col, r8 ?wait1_end_group
19:39 gfxstrand[d]: 0x1: imma.16816.u8.u8 r8, r6.row, r2.col, r8, up6 ?wait1_end_group
19:39 gfxstrand[d]: 0x2: imma.16816.u8.u8 r8, r6.row, r2.col, r8, up5 ?wait1_end_group
19:39 gfxstrand[d]: 0x3: imma.16816.u8.u8 r8, r6.row, r2.col, r8, up4 ?wait1_end_group
19:39 gfxstrand[d]: 0x4: imma.16816.u8.u8 r8, r6.row, r2.col, r8, up3 ?wait1_end_group
19:39 gfxstrand[d]: 0x5: imma.16816.u8.u8 r8, r6.row, r2.col, r8, up2 ?wait1_end_group
19:39 gfxstrand[d]: 0x6: imma.16816.u8.u8 r8, r6.row, r2.col, r8, up1 ?wait1_end_group
19:39 gfxstrand[d]: 0x7: imma.16816.u8.u8 r8, r6.row, r2.col, r8, up0 ?wait1_end_group
19:39 gfxstrand[d]: 0x8: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !upt ?wait1_end_group
19:39 gfxstrand[d]: 0x9: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !up6 ?wait1_end_group
19:39 gfxstrand[d]: 0xa: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !up5 ?wait1_end_group
19:39 gfxstrand[d]: 0xb: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !up4 ?wait1_end_group
19:39 gfxstrand[d]: 0xc: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !up3 ?wait1_end_group
19:39 gfxstrand[d]: 0xd: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !up2 ?wait1_end_group
19:39 gfxstrand[d]: 0xe: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !up1 ?wait1_end_group
19:39 gfxstrand[d]: 0xf: imma.16816.u8.u8 r8, r6.row, r2.col, r8, !up0 ?wait1_end_group
19:40 karolherbst[d]: huh
19:40 karolherbst[d]: funky
19:41 karolherbst[d]: ohh maybe to be compatible with old behavior? 🙃
19:41 gfxstrand[d]: yup
19:42 karolherbst[d]: anyway, seems Hopper also has that one
19:42 karolherbst[d]: and hopper also supports a different set of matrices, it's really fun
19:43 snowycoder[d]: karolherbst[d]: Old cards had reversed predicates encoding? 0_o
19:43 karolherbst[d]: but in theory compatible with ampere enough
19:43 karolherbst[d]: snowycoder[d]: nope, they didn't have that predicate
19:43 gfxstrand[d]: snowycoder[d]: No. Old cards just set those bits to zero
19:43 karolherbst[d]: so the default behavior true needs to be 0 to match it
19:43 gfxstrand[d]: So they got clever and made 0 pT
19:43 snowycoder[d]: ohhh, ahahahah
19:44 karolherbst[d]: anyway.. maybe I should just bother harder to get a blackwell gpu 🙃
19:44 karolherbst[d]: ohh also need a new PSU
19:44 karolherbst[d]: *sigh*
20:03 gfxstrand[d]: karolherbst[d]: Any idea why `imma.16816.u8.u8 r8, r6.row, r2.col, r8` might be wrong? I don't see what it would be violating
20:03 karolherbst[d]: no idea either
20:03 karolherbst[d]: alignment also seems correct
20:04 karolherbst[d]: sure it's the imma?
20:04 gfxstrand[d]: Yes. I hacked the encoder to encode a NOP instead and no illegal instructions
20:04 karolherbst[d]: mhhh
20:05 karolherbst[d]: then I have no idea 🙂
20:05 karolherbst[d]: they got rid of int4 support...
20:06 karolherbst[d]: .row and .col are fixed...
20:07 karolherbst[d]: I'm sure it's something incredible silly, like maybe some special flag on the QMD that is new...
20:07 karolherbst[d]: they also added a bunch of those tensor MMA operations...
20:13 gfxstrand[d]: I think we have the number of registers wrong
20:13 gfxstrand[d]: That seems impossible, though.
20:18 karolherbst[d]: mhhh
20:18 karolherbst[d]: wrong in which sense?
20:20 karolherbst[d]: like in a 16x8x16 imma, Ra is 64 bits, Rb is 32 bits and Rc/Rd are 128 bits, so A is a int8 vec8, V is a int8 vec4 and Rc/Rd are int32 vec4s if I'm not mistaken
20:21 karolherbst[d]: same on ampere
20:22 karolherbst[d]: gfxstrand[d]: maybe consumer blackwells don't have it 🙃
20:22 karolherbst[d]: like.. I know this should be valid on SM10.0, but you got SM12.0
20:22 karolherbst[d]: soooo... maybe...
20:26 karolherbst[d]: dunno...
20:28 gfxstrand[d]: karolherbst[d]: I doubt it given how hard they're pushing neural rendering
20:28 karolherbst[d]: well there are still the tensor MMA ops
20:28 karolherbst[d]: which... are different
20:28 gfxstrand[d]: I'm also seeing issues with hmma
20:28 gfxstrand[d]: Which definitely exists
20:29 karolherbst[d]: what sort of issues?
20:29 karolherbst[d]: but yeah.. no idea... turing and ampere were close enough that nothing funky came up, but who knows what's different on blackwell...
20:29 gfxstrand[d]: faults
20:29 karolherbst[d]: mhhhh...
20:30 karolherbst[d]: mhhhhhhhh
20:30 gfxstrand[d]: It's like hmma is stomping more registers than we think
20:30 gfxstrand[d]: Or dependencies are very wrong
20:30 gfxstrand[d]: or something like that
20:30 karolherbst[d]: there is a new memory thing
20:30 gfxstrand[d]: But some hmma works
20:31 karolherbst[d]: your run on ampere is fine, right?
20:31 gfxstrand[d]: yup
20:31 karolherbst[d]: mhhh
20:32 karolherbst[d]: as I said there is a new tensor memory thing, maybe there needs to be something setup in the QMD or so
20:32 karolherbst[d]: strange tho...
20:33 karolherbst[d]: probably the moment where one needs to check what nvidia is doing
20:33 karolherbst[d]: I'm sure that the code is correct, because otherwise the benchmarks I've been working on would have cause really really weirdo issues
20:34 gfxstrand[d]: 16x8x8 seems fine. 16x8x16 is blowing up. (hmma)
20:34 karolherbst[d]: like if the register size wouldn't be right, code like that would just blow up: https://gist.github.com/karolherbst/b536b173a762931b66604273877162fe#file-gistfile1-txt-L152-L198
20:35 karolherbst[d]: gfxstrand[d]: mhh
20:35 gfxstrand[d]: Yeah.
20:35 gfxstrand[d]: I believe it works on ampere
20:36 karolherbst[d]: anyway 🙂 hence me suggesting to disable on untested GPUs 🙃
20:36 airlied[d]: what test is causing it? I can dump nvidia in a while
20:36 gfxstrand[d]: Well, and we haven't flipped on Blackwell yet so we've still got a couple weeks to fix it
20:36 airlied[d]: official blackwell 3d headers have dropped
20:36 karolherbst[d]: ohh
20:36 gfxstrand[d]: What is .sp?
20:36 karolherbst[d]: something weird
20:37 karolherbst[d]: sparse matrix or something?
20:37 karolherbst[d]: like 50% is considered zero or so
20:37 karolherbst[d]: you can put the matrix in half the storage
20:38 gfxstrand[d]: weird
20:38 karolherbst[d]: ldsm is even weirder
20:38 karolherbst[d]: it has mode where you store 4 bit value as 2 bits
20:39 karolherbst[d]: I'm sure it matters if you know the content pretty well and saving memory bw is important
20:39 gfxstrand[d]: airlied[d]: And copy class?
20:39 airlied[d]: they are checking on it next
20:39 gfxstrand[d]: 👍🏻
20:47 mohamexiety[d]: karolherbst[d]: yeah that's sparse
20:53 gfxstrand[d]: airlied[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36143
20:53 gfxstrand[d]: Compiled on the first try
20:54 gfxstrand[d]: Doing a run now to make sure it didn't break anything but it looks like our stencil headers were right so I'm not worried
20:59 gfxstrand[d]: Okay, that's kepler and cmat assigned to Marge. I'll assign the blackwell headers once I get home, assuming CTS is happy.
20:59 gfxstrand[d]: My current blackwell cmat patches are in nvk/blackwell-cmat in my tree. I'll look into that more tomorrow unless Dave wants to take over.
21:00 airlied[d]: I'll see if I can look at it today
21:01 gfxstrand[d]: I'd really like our Mesa 25.2 story to not be "We support Blackwell XOR cmat!"
21:01 karolherbst[d]: 😄
21:01 gfxstrand[d]: But I'm also happy to treat anything we do as a bugfix and backport.
21:01 karolherbst[d]: not gonna be fast anyway
21:01 karolherbst[d]: but yeah...
21:01 gfxstrand[d]: You just need a big enough blackwell. 😛
21:01 karolherbst[d]: heh
21:01 karolherbst[d]: at least I got the RA stuff in, which was a significant boost
21:01 karolherbst[d]: but atm, just skipping membars I get 2.5x speed up 🙃
21:02 karolherbst[d]: without causing correctness issues
21:02 gfxstrand[d]: woof
21:02 karolherbst[d]: yeah...
21:02 karolherbst[d]: so the benchmarks does a membar inside a loop
21:02 karolherbst[d]: and....
21:02 karolherbst[d]: well...
21:02 gfxstrand[d]: Yeah... membar in a loop is no fun for anyone
21:02 karolherbst[d]: we can't really optimize it after leaving derefs 🙂
21:03 karolherbst[d]: but also ldsm will be important for perf on shared mem
21:03 karolherbst[d]: I almost understand ldsm
21:03 gfxstrand[d]: heh
21:03 karolherbst[d]: soo
21:04 karolherbst[d]: a quad n loads from the address read from thread 8 * nr_matrix + n (ldsm can load 4 matrices), so e.g. quad nr 2 (threads 4-7) load matrix 3 from thread 25 🙂
21:04 karolherbst[d]: *address form thread 25
21:04 karolherbst[d]: it's awesome
21:05 karolherbst[d]: loads 64 bytes per thread
21:05 karolherbst[d]: ehh bits
21:05 karolherbst[d]: or uhm.. 32 * size, so 32, 64 or 128
21:05 karolherbst[d]: anyway
21:05 karolherbst[d]: it's funky
21:05 karolherbst[d]: very funky
21:06 karolherbst[d]: but it perfectly aligns with the 8 and 16 bit matrices
21:07 karolherbst[d]: I _think_ it can also be used for 32 bit ones.. just need to be a bit more crazy with the offsets? not sure
21:07 karolherbst[d]: anyway 🙂
21:34 marysaka[d]: karolherbst[d]: I think I saw it used for 32bit ones too yeah
21:34 karolherbst[d]: yeah.. I think we just have to interleave the matrix
21:34 karolherbst[d]: like x2 to load a single 32 bit matrix
21:34 karolherbst[d]: and just mess up the offsets enough
21:34 marysaka[d]: exactly
21:35 karolherbst[d]: at least it _feels_ like we can do that
21:35 karolherbst[d]: just need to clean up the code so it's not a huge pita to do it...
23:36 x512[m]: Is it true that it is actually more documentation related to kernel drivers are available for Nvidia compared to Radeon GPUs? And Nvidia devs are more friendly to answer technical questions compared to AMD?
23:47 mhenning[d]: I haven't worked with amd directly but I'm skeptical of those claims
23:51 x512[m]: For example "RLC" is not documented anywhere. Just Linux source code and nothing else. No answers in "#radeon" chat.
23:53 airlied: no I don't think either of those are true, but I'm sure there are certain places where one or other company does better
23:53 airlied: AMD started out with docs under NDA for a bunch of stuff so we could write drivers, but they don't really commit to documenting lower level stuff now that they write the driver mostlt
23:54 x512[m]: Nvidia low level official and unofficial documentation is so much better.
23:54 airlied: it's not really, try talking to their pmu fw
23:54 x512[m]: It is GSP so nobody cares?
23:55 airlied: like there's been more of a need to document nvidia stuff because they haven't release any code, but now with GSP I think there will be less documentation on the internals
23:55 airlied: like a bunch of stuff in open-gpu-doc that is there for older cards might not need to be explained as much for newer
23:56 airlied: but we have different relationships and trust levels with both companies, and things constantly change
23:59 x512[m]: This is latest documentation I managed to find: https://www.x.org/docs/AMD/old/R5xx_Acceleration_v1.5.pdf
23:59 x512[m]: It do not mention RLC.