00:04gfxstrand[d]: <a:shrug_anim:1096500513106841673>
00:05gfxstrand[d]: The command streamer pre-fetches like mad so it might be better than DMAing from GART
00:24gfxstrand[d]: I doubt it's worth much for shader upload but that might make some difference for QMDs and the like.
00:28karolherbst[d]: yeah, for the QMD it makes perfect sense to have it inside the push buffer
00:29gfxstrand[d]: Yeah, it might be worth it for QMDs and CS root constants
00:30gfxstrand[d]: And it would give us something we could potentially tag somehow and decode in the push dumper
00:33gfxstrand[d]: But honestly, just putting uploaded stuff (and maybe not the push itself) in VRAM would probably do just as much
00:37karolherbst[d]: okay.. I think I'm done with LDSM.. I abused it way too much to make it work for int8 matrices, including A col major and B row major ones 🙃
00:38karolherbst[d]: though kinda funny to replace 16 load shared calls with a single ldsm... sometimes the vectorizer fails really badly to vectorize loads, though that's also because the constants offsets are hidden really really well in complex address calculations
00:38karolherbst[d]: ohh that reminds me.. https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36113
00:38karolherbst[d]: I still have a few more patches in the pipeline which move constants to the end, so it can be folded into load/store instructions
14:56karolherbst[d]: ohhhh wiring up `nir_opt_offsets` is fun
14:56karolherbst[d]: it helps a lot actually
15:03djdeath3483[d]: Careful with negative offsets
15:08karolherbst[d]: I know
15:09karolherbst[d]: I already supported for passing in a custom `nir_unsigned_upper_bound_config`, because I needed it
15:09karolherbst[d]: and I just set a smaller mask so negative offsets aren't an issue
15:09karolherbst[d]:but
15:09karolherbst[d]: currently I'm looking at this: https://gist.githubusercontent.com/karolherbst/b48e0e384329be870d148e5156e6f712/raw/fc9f389b9605692f7ab52cca1a63041a47a9dca8/gistfile1.txt
15:10karolherbst[d]: and the `%264 = iadd %259, %78 (0x280)` can be also folded in.. just no good idea how to code it up so it does it
15:11karolherbst[d]: so the load_subgroup_id can return max 7 in my case, which makes the and either 0x4 or 0, meaning nothing overflows/wraps/whatever in that chain
15:12karolherbst[d]: the .nuw on the last iadd doesn't really give me much? Or can I assume the entire chain won't wrap?
15:12karolherbst[d]: only for the tagged instruction, right?
15:29karolherbst[d]: okay.. perfect
15:55karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36113/diffs?commit_id=ac3457bcb6283b1ca95edf62440ef5a66bdfade9 mhhhhhh
15:56karolherbst[d]: the pass doesn't do it for scratch mem, which might also be interesting...
15:57karolherbst[d]: mhh scratch doesn't have a base...
15:57karolherbst[d]: annoying
17:09kar1m0[d]: am I the only one noticing performance regression after nvidia drivers update?
17:09karolherbst[d]: win some lose some
17:16chikuwad[d]: is it a real perf regression or is it your system compiling shaders on first run on the new driver
17:19kar1m0[d]: chikuwad[d]: no I mean fps drops, screen freezes and lags
17:19chikuwad[d]: that sounds like performance choking due to shader compiles
17:19kar1m0[d]: probably
17:35karolherbst[d]: `TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 61.024311 TFlops` much speed
18:08karolherbst[d]: I wonder if we can give the load/stores on matrices proper alignment information...
18:08karolherbst[d]: might even allow to merge accesses across matrices
18:12karolherbst[d]: ` 64 %462 = deref_cast (InputC *)%31 (global InputC) (ptr_stride=0, align_mul=0, align_offset=0)` mhh yeah.. that's not great 🙂
18:13karolherbst[d]: right.. we only set the alignment to the base type thing...
18:13karolherbst[d]: okay.. this is gonna be even greater
18:35karolherbst[d]: uhhhhh
18:35karolherbst[d]: I see what's going on... mhh we can make that work
18:37karolherbst[d]: I think we need to deref cast each matrix element load and give it explicit alignment information, because explicit_io can't infer the alignment because the index isn't like constant
18:37karolherbst[d]: mhh
18:37karolherbst[d]: not sure we can do much instead of the value groups 😄
19:38teronimozuck: You do not even understand that I was correct in all matters, cause all it does is *1 *2 *3 and some constant algebra aka it's embarrassing as to how you behave 441+114+57−57−75−270=210 aka 96+114 so *2 is 192+228-135-114-114=57 , hence the earlier methods are obsolede. So all in all i synthesized everything out already. My work has been highest profile and highest production in terms of
19:38teronimozuck: contributions not the other way around, you see my gradfather always said, the line between a genius and idiot is incredibly thin, where he always thought i was genius and a hero and a biggest talent, keeping all my wins at youth from newspapers on his desk, until he died sadly. His words are for your defense, i say you are just bunch of abuse fecalist fuckers.
20:43teronimozuck: So technically even though some earlier methods worked also, i never gave anything out in full as I never wanted to support such wars as Russian Federation vs Ukraine, i considered both of those lines our own ethnics, and fairly intelligent people in general hailing from there. But it is what it is, we still landed in war3, my lines are also coming to crack up scammers who destroyed me
20:43teronimozuck: however, totally nonsense wankspammers and cheaters. A payback is remorseless kind, it's genuinly justified and ruthless take on those people around the world to be downed. I expect huge amount of people gathering to commit this all, my friends around the world, final phase might be the defending Estonian independence indeed. We do not start any additional conflict our own as always, and I
20:43teronimozuck: believe the best way is always not to have any wars.
20:45soreau: not even AI could make this stuff up :P
20:48airlied: he just needs a rlhf
20:58karolherbst[d]: airlied[d]: wanna review and test some nice perf opts for MMA? 😄
21:01karolherbst[d]: https://gist.githubusercontent.com/karolherbst/ee24456b8010ba20696c6876eadf4b74/raw/8ee6689a3c6a486e9ad1056868ea84cac5aa14aa/gistfile1.txt
21:01karolherbst[d]: need to nuke some of those shifts tho...
21:02airlied[d]: oh nice, I've had some success in the past removing the LEA optimisations
21:02karolherbst[d]: that's nicer to read: https://gist.githubusercontent.com/karolherbst/a44c1207432da63590b908889ff9534a/raw/e5876e08d72aa11e4b08cc69865536c776113817/gistfile1.txt
21:03karolherbst[d]: I wonder if I want to merge those `ldsm.16.m8n8.trans.x2` 🙃
21:03karolherbst[d]: but that's gonna make RA unhappy
21:03karolherbst[d]: airlied[d]: I looked at your code, threw it away and came up with this instead: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36113 😄
21:03airlied[d]: I think ideally you'd interleave a bit more the ldsm and the hmma
21:04airlied[d]: everyone piling into r80..82 seems oversubscribedf
21:04karolherbst[d]: it's a huge shader
21:05karolherbst[d]: but yeah...
21:05airlied[d]: I'll give the branch a spin today, I did try that other barrier opts branch, but it didn't seem to make much difference on it's own
21:05karolherbst[d]: https://gist.githubusercontent.com/karolherbst/898b234cfba16fbefaa51298e2efac97/raw/d88c6f65747e4fbcb68ce06e92a2acba2e3c0d0c/gistfile1.txt is the full one
21:06karolherbst[d]: uhm..
21:06airlied[d]: I started playing with coopmat2 reductions on radv, at least figured out how to initially do indirect functions
21:06karolherbst[d]: `nak/ldsm_opts` has integrated a few MRs and some safer WIP opts
21:06karolherbst[d]: *branch on my fork
21:06karolherbst[d]: anyway
21:07karolherbst[d]: got LDSM working on pretty much everything now
21:07karolherbst[d]: except col major A/row major B x4 loads
21:07karolherbst[d]: but also did the work to make it work on int8
21:08airlied[d]: can col major x4 loads work? or would you need a transpose?
21:08karolherbst[d]: I think it just needs very smart address caalc
21:09karolherbst[d]: got it working for x2 at least: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36363/diffs#2811255f0166cfecac738d47dd92cdc0a7a25c96_571_636
21:09airlied[d]: I'm not caffinated enough for matrix layouts
21:09karolherbst[d]: 😄
21:09karolherbst[d]: it's just tiling
21:09marysaka[d]: that's a mood
21:10karolherbst[d]: like it works for B col major x4
21:10karolherbst[d]: just not A col major x4
21:10airlied[d]: I'd have to look at the nvidia dumps maybe as well to what they do
21:10karolherbst[d]: I'm sure it's some trivial shift somewhere
21:10airlied[d]: well B col major is just A row major
21:10karolherbst[d]: wellll...
21:10karolherbst[d]: not quite
21:10airlied[d]: I've got a vague memory of seeing some MOVM pop up in the nvidia shaders
21:10karolherbst[d]: yeah.. I haven't spent the time to figure out MOVM yet
21:11karolherbst[d]: like conceptionally it's pretty basic, but it feels like it's useful for magic stuff
21:11karolherbst[d]: like.. LDSM with transpose should be equal to ldsm without and movm
21:11karolherbst[d]: so not really sure it's gonna help there
21:17airlied[d]: oh the docs say MT1616 is only valid for 1/2
21:19karolherbst[d]: MT1616?
21:19karolherbst[d]: ohh blackwell
21:19karolherbst[d]: yeah..
21:19karolherbst[d]: uhm..
21:19karolherbst[d]: we just ignore all the non M(T)88 versions on ldsm for now
21:20karolherbst[d]: they do size expansion and it's weird
21:20karolherbst[d]: not useful for the normal mma stuff at all
21:21karolherbst[d]: mhhh.. it seems blackwell LDSM MT1616 without expansion?
21:21karolherbst[d]: ohhh...
21:21karolherbst[d]: nevermind 😄
21:21karolherbst[d]: you can either do M88.16 or MT1616.8
21:22karolherbst[d]: that feels almost pointless
21:22karolherbst[d]: mhh..
21:22karolherbst[d]: makes the address calc less akward I guess
21:23airlied[d]: is STSM also blackwell only
21:23airlied[d]: ?
21:23karolherbst[d]: yeah
21:24karolherbst[d]: only got an turing and ampere here, so not gonna touch that stuff
21:24karolherbst[d]: well technicall STSM is hopper+
21:25karolherbst[d]: the MT1616 LDSM is a bit odd.. it only supports x1 and x2 (guess because 32x32 int8 isn't a thing on hardware)
21:26karolherbst[d]: it_feels_ like you can nuke a shift and/or or with it, but not much else
21:26karolherbst[d]: like this part: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36363/diffs#2811255f0166cfecac738d47dd92cdc0a7a25c96_571_640
21:26karolherbst[d]: the `if (ldsm_count == 4) {` branch inside the 2 && col major thing
21:27karolherbst[d]: so you won't have to switch bit 0x10 down to 0x8 anymore
21:27airlied[d]: maybe once the base stuff lands, I can do the blackwell bits
21:28karolherbst[d]: STSM is gonna be more helpful than LDSM.MT1616
21:32karolherbst[d]: anyway.. not gonna spend hours trying to make super corner cases work for LDSM, if nothing needs the perf there...
21:35karolherbst[d]: anyway.. I'll benchmark against nvidia a bit and see where we are at in vk_cooperative_matrix_perf
21:35karolherbst[d]: mhh though might be unfair if it uses VK_NV_cooperative_matrix2 under the hood 😄
21:36karolherbst[d]: ohhh.. getting close
21:40karolherbst[d]: `TILE_M=256 TILE_N=128, TILE_K=64 BColMajor=1 workgroupSize=256 89.182372 TFlops`
21:48karolherbst[d]: the closest sub benchmark is like 70% blob perf
21:50karolherbst[d]: as long as it's not fp32 our perf doesn't suck 😄
21:55karolherbst[d]: nuking the last membar gives me 10%, that's not gonna cut it
21:56karolherbst[d]: looking at it logically, everything outside the loop doesn't matter, which leaves us this: https://gist.githubusercontent.com/karolherbst/989d973aa294dd9bd04970cd8025aab3/raw/179c7034af5bb603ca4e7ebc29080a63006723df/gistfile1.txt
21:56karolherbst[d]: block b7 is already great
21:58karolherbst[d]: nak: https://gist.github.com/karolherbst/98cf840cdb1ff4218ba5a8ea78be3b5d
21:58karolherbst[d]: the ldsm block is looking great, so not sure there is a lot of potential there
21:58karolherbst[d]: might want to reorder things for RA reasons, but....
21:59karolherbst[d]: the bar and membar is gonna hurt
22:00karolherbst[d]: uhhh
22:00karolherbst[d]: nuking the bar 45 -> 60 tflops
22:01karolherbst[d]: but correctness screams at me 😄
22:01karolherbst[d]: ohh just about the membar
22:03karolherbst[d]: `TILE_M=256 TILE_N=128, TILE_K=64 BColMajor=1 workgroupSize=256 97.164336 TFlops` for the int matrix
22:04karolherbst[d]: I hate those shifts...
22:04karolherbst[d]: I'm sure they are like 10% perf alone
22:10magic_rb[d]: So 90%? 70+10+10
22:15mangodev[d]: i wonder
22:15mangodev[d]: how's the progress on WSI stuff? last i heard, there was some stuff to make Zink with X11 a little less janky
22:15mangodev[d]: did that actually land, or was it just progress towards getting the WSI path working on X?
22:16karolherbst[d]: yeah.... membars and barriers is where we lose a lot of perf...
22:16mangodev[d]: fair
22:17gfxstrand[d]: mangodev[d]: It landed. It's all in 25.2.
22:17mangodev[d]: weirdly enough, i actually don't get too bad of perf anymore, especially in ogl
22:17mangodev[d]: my compositor just can't keep up with handling a gpu-heavy application and compositing at the same time :P
22:18mangodev[d]: what about wayland stuff? i remember hearing that zink (pre-MR) was closer to working on X than on wayland afaik
22:18mangodev[d]: i'm on a wayland compositor rn, so maybe that's why i haven't seen a ton of improvement from that commmit
22:18airlied[d]: karolherbst[d]: I was hoping that spirv/nir change would help, and maybe it does a bit, but it doesn't seem get rid of the worst bits
22:19karolherbst[d]: airlied[d]: which change specifically?
22:20mangodev[d]: i would love to test more 3D games once the membar, barrier, and zeta compression changes roll around, but i'm honestly primarily concerned about wayland wsi, since it causes very noticeable stutter and lag when the CPU isn't the bottleneck (which is a lot of the time)
22:21karolherbst[d]: let's see if 33683 helps 🙃
22:21karolherbst[d]: I mean 33306
22:21airlied[d]: the radv led one I cc'ed you and gfxstrand[d] on
22:21airlied[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36080
22:23karolherbst[d]: the inst scheduling pass makes it worse 😢
22:25karolherbst[d]: I think it doesn't help at all here
22:25karolherbst[d]: yep.. nothing
22:25karolherbst[d]: the issue isn't really that anywya
22:26karolherbst[d]: the issue is, that the barrier opt passes only work with derefs
22:32karolherbst[d]: mhh that one also does global loads within the loop mhhh
23:54karolherbst[d]: another curious thing is, that with the nvidia driver my GPU runs a lot hotter 🙃
23:58soreau: smells like a recipe for another Michael Larabel article