00:39 gfxstrand[d]: Yeah, it's gone on Blackwell
03:30 sonicadvance1[d]: Huh, I'm surprised it was removed.
04:47 anarsoul: are there any known issues with zink/nvk (probably hybrid graphics-related) and firefox?
04:48 anarsoul: I lose gpu accel in firefox-142 if I have nvk (or rather nouveau in general) installed
04:49 chikuwad[d]: NVK/zink doesn't support VA-API yet
04:49 anarsoul: I'm not talking about video decoding
04:50 anarsoul: it switches to software rendering, as in about:support says Compositing: WebRender (Software)
04:50 anarsoul: and webgl stops working
04:51 anarsoul: anyway, blacklisting nouveau kernel module "fixes" it
07:05 karolherbst[d]: sonicadvance1[d]: probably just doing it in hardware, because I don't see why it would expensive of doing so honestly 🙃
07:31 karolherbst[d]: snowycoder[d]: yeah, but the rules are like quite simple
07:32 karolherbst[d]: just need to know what's what
09:46 huntercz122[d]: anarsoul: I think Firefox has some whitelist for drivers? It also wasn't enabled on prop Nvidia some years ago.
16:39 anarsoul: huntercz122[d]: it shouldn't break iGPU though
16:44 mhenning[d]: I have firefox running on nvk+zink and it has accelerated compositing and webgl working
16:44 mhenning[d]: not sure what you're running into
17:04 leftmostcat[d]: Should I rebase https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34388 for discussion or should it just be closed?
17:18 mhenning[d]: leftmostcat[d]: I think it's worth rebasing. I've been meaning to look at it more closely, sorry for the delay
17:19 leftmostcat[d]: No problem at all. It's not exactly a high priority change either way. 🙂
17:31 leftmostcat[d]: Rebased.
18:44 karolherbst[d]: `{%r74 %r75 %r76 %r77} = ld.global.a64.strong.gpu.b128 [{%r72 %r73}+{%ur34 %ur35}]` getting there
18:47 karolherbst[d]: noooo legalize lowers it obviously 🙃 annoying
18:48 karolherbst[d]: ohh I think I'll need to torch `legalize_ext_instr` 😄
18:48 karolherbst[d]: at least on some ops
19:13 karolherbst[d]: mhhh need to deal with this: https://gist.github.com/karolherbst/a1e679f2fa9492095130366473d700eb
19:14 karolherbst[d]: that's going to be a fun opt pass to write
19:15 karolherbst[d]: but anyway, it already works for load_global: https://gist.githubusercontent.com/karolherbst/d3d2f28f75e130327ef3b77e60510ed1/raw/4a00c173bcbf951e1bbccd0ba5fc0f435b727571/gistfile1.txt 🙂
19:16 karolherbst[d]: `LDG.E.128.STRONG.GPU R12, [R8.64+UR4]` e.g.
19:17 karolherbst[d]: some of the gpr values I _think_ can be 32 bits, but that means I need to be able to proof it's safe
19:17 karolherbst[d]: will probably require some uub smartness? dunno
19:29 gfxstrand[d]: leftmostcat[d]: Please rebase and thanks for the reminder! I've been meaning to ack and merge, I just forgot
19:31 gfxstrand[d]: I was going to give it a quick skim, too, mostly for sanity. But if Mel wants to do that and merge, go ahead.
19:48 leftmostcat[d]: Merge conflicts are for reminding us that we have outstanding MRs. 😅
20:01 gfxstrand[d]: Hehe
20:02 gfxstrand[d]: I'm still kinda mind-boggled that it's faster but I'll take it.
20:07 karolherbst[d]: probably caches or something
20:07 karolherbst[d]: `Box` is another indirection after all
20:08 karolherbst[d]: We should wire up `ralloc` as a custom rust at some point
20:08 karolherbst[d]: once it's stable and all
20:08 karolherbst[d]: *custom allocator
20:11 mhenning[d]: why? isn't ralloc just for tracking memory lifetimes? I thought it didn't have any benefit for cache-friendliness
20:12 karolherbst[d]: mhhh somehow I thought it was more cache friendly... but apparently not
20:15 leftmostcat[d]: I'm not sure if the speed difference is statistically significant since my testing wasn't thorough. I'm just glad it's not _slower_.
20:16 karolherbst[d]: well it is one memory allocation less
20:18 karolherbst[d]: and iterating over `Vec<T>` is just faster than `Vec<Box<T>>` anyway
20:32 karolherbst[d]: `Instruction count: 692 -> 570` uwu
20:32 karolherbst[d]: getting somewhere
20:33 karolherbst[d]: `Static cycle count: 3843 -> 3102`
20:33 karolherbst[d]: though a lot is also just lea -> iadd sadly
20:33 karolherbst[d]: ehh
20:33 karolherbst[d]: lea -> shf
20:35 airlied[d]: karolherbst[d]: did you look at my hacks branch at all, I thought I did some of the ur+r opcode in there
20:36 karolherbst[d]: airlied[d]: what branch was it again? But I think it needed some heavy clean up anyway
20:37 mhenning[d]: leftmostcat[d]: yeah, I've been planning to test it a little more on the shaders I have locally
20:37 karolherbst[d]: this looks almost acceptable now: https://gist.githubusercontent.com/karolherbst/00855c5aff73d4d5d437c1ce592d15f1/raw/7f7557e9479afd33be787b8a5d46120f153a20ab/gistfile1.txt
20:38 airlied[d]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/nak-coop-matrix-hacks2?ref_type=heads has most of them in it
20:39 karolherbst[d]: nah, doing it in nak is useless
20:39 karolherbst[d]: we need to do it way earlier so we can make use of nir_opt_offsets
20:39 karolherbst[d]: and nir algebraic opts and other funky things
20:39 karolherbst[d]: ehh wait
20:39 karolherbst[d]: you did that in earlier commits
20:40 karolherbst[d]: oh well
20:40 airlied[d]: I remember doing some stuff post divergence analysis
20:40 karolherbst[d]: guess I wrote the same patch as 3607417af78b903c3ddb6f239d82f7743db6916f basicaally
20:40 karolherbst[d]: but I have a lot of offset stuff asa well
20:41 karolherbst[d]: anyway.. I added also an opt to rebalance iadds
20:41 karolherbst[d]: the more annoying part will be to deal with mixed 32 + 64 bit additions
20:42 karolherbst[d]: there are some 64 bit alu ops that could be done in 32
20:42 karolherbst[d]: and the LDG can do a 32 + 64 bit add
20:43 karolherbst[d]: but yeah.. I do the divergent stuff in nir as well
20:43 karolherbst[d]: ez: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/6d1db9d072ef9411b25d8ccba425f2cfe3309634#3b78b8724786a6e61ca9cbb68460fd0e195d831c_35_35
20:45 karolherbst[d]: but yeah.. ULDC loading from a raw pointer is also what I wanted to properly upstream
20:46 karolherbst[d]: but it's like one single load in the shader and not even in the loop
20:47 karolherbst[d]: the issue is, that none of this really matters all that much for performance given the loop is already optimal 🙃
20:47 karolherbst[d]: well
20:47 karolherbst[d]: except for the predication thing nvidia is doing
20:49 karolherbst[d]: snowycoder[d]: ohh where can I check out your cross block stuff?
20:51 karolherbst[d]: airlied[d]: if you feel like doing some benchmarks https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37016 on top of main should give you great numbers since the last time
20:51 karolherbst[d]: like actually competitive
21:00 leftmostcat[d]: mhenning[d]: Anything else I can do to be helpful?
21:05 airlied[d]: I've sworn this week to cgroups
21:09 karolherbst[d]: oof
21:09 karolherbst[d]: gpu related stuff?
21:10 karolherbst[d]: div 32 %230 = ushr %228, %229 (0x3)
21:10 karolherbst[d]: div 64 %231 = u2u64 %230
21:10 karolherbst[d]: div 64 %233 = ishl %231, %232 (0x4)
21:10 karolherbst[d]: div 32x4 %234 = @load_global_nv (%233, %29) (base=0, access=none, align_mul=16, align_offset=0)
21:10 karolherbst[d]: this pattern is sooo common here...
21:11 karolherbst[d]: I need to take another look with uub and see what can be done about it...
21:14 airlied[d]: yes initial cgroups support for system memory for amdgpu/xe has to get done, I keep letting coopmat preempt me 😛
21:14 snowycoder[d]: karolherbst[d]: I can push it in a MR if you want to test it, but I need to rewrite it a bit probably
21:14 karolherbst[d]: that's fine
21:14 karolherbst[d]: if it breaks I'll notice
21:14 karolherbst[d]: I just want numbers to go high 😄
21:16 mhenning[d]: leftmostcat[d]: nah, I just need to get around to it
21:18 karolherbst[d]: I need to figure out what to do with this:
21:18 karolherbst[d]: https://gist.githubusercontent.com/karolherbst/afc6b6a3ac61d97e4b0597d57fd50947/raw/65c6bcba8d75c60d58763581db23b240a94e5090/gistfile1.txt
21:18 leftmostcat[d]: Okay, well if anyone has tasks that don't need deep understanding, I'll be in a corner pushing buttons.
21:21 karolherbst[d]: ohh I should wire up stores...
21:35 karolherbst[d]: nice...
21:36 karolherbst[d]: nvidia: https://gist.github.com/karolherbst/c803c9e20cf645fb5673d92615dbbb8e
21:36 karolherbst[d]: nvk: https://gist.github.com/karolherbst/00855c5aff73d4d5d437c1ce592d15f1
21:36 karolherbst[d]: getting there
21:37 karolherbst[d]: so now I got 200 instructions removed 🙃
21:42 snowycoder[d]: Cross-delay scheduling is here:
21:42 snowycoder[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37108
21:42 snowycoder[d]: Tell me if it works and how it performs!
21:42 snowycoder[d]: karolherbst[d]
21:44 karolherbst[d]: I'm sure it's better than my hack
21:44 karolherbst[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/1f23fc2d5fff1318a6f1e049e58a02e0238a156d 🥲
21:46 karolherbst[d]: let's see...
21:46 karolherbst[d]: snowycoder[d]: it's slower than my hack 🥲
21:47 karolherbst[d]: let's see...
21:47 karolherbst[d]: mhhh
21:49 karolherbst[d]: strange...
21:50 snowycoder[d]: It's that much slower?
21:50 karolherbst[d]: no
21:50 karolherbst[d]: it's just as fast as without my hack
21:50 karolherbst[d]: which is...
21:50 karolherbst[d]: weird
21:51 karolherbst[d]: like the output is better
21:52 karolherbst[d]: let me try again to see if I messed up testing somehow
21:52 snowycoder[d]: maybe it really doesn't count that much, in practice it seems to change mostly the last instruction
21:53 snowycoder[d]: or maybe we need to cross-block barriers too
21:54 karolherbst[d]: nah.. the change is significant
21:54 karolherbst[d]: removes like 15 cycles in the hot loop
21:55 karolherbst[d]: let me compare with my hack just in case
21:56 karolherbst[d]: ahh yeah...
21:57 karolherbst[d]: my hack is removing 10 cycles more
21:57 snowycoder[d]: mmmh, so there's a bug in my patch?
21:57 karolherbst[d]: https://gist.github.com/karolherbst/b289d00b165407616b65ed9cf52ef525
21:58 karolherbst[d]: ` r92..94 = hmma.m16n8k16.f16 r92..96 r30..32 r76..78 // delay=8 wr:5`
21:58 karolherbst[d]: I think that can be lower
21:58 karolherbst[d]: maybe
21:58 karolherbst[d]: maybe not.. mhh
21:59 karolherbst[d]: maybe my shader is broken who knows
21:59 karolherbst[d]: it does pass validation, but it also uses a different shader for validation
21:59 karolherbst[d]: or well same shader just other spec constants
21:59 karolherbst[d]: but yeah.. I think it can be lower
21:59 karolherbst[d]: hmma is like.. 22?
22:00 snowycoder[d]: How does your hack work?
22:00 karolherbst[d]: uhm...
22:00 karolherbst[d]: through thoughts and prayers
22:01 karolherbst[d]: karolherbst[d]: ^^
22:01 karolherbst[d]: I just messed with the value until it works
22:01 karolherbst[d]:but
22:01 snowycoder[d]: ahahah, nice.
22:01 snowycoder[d]: If you give me a source I can debug, my current patch is holding on with tape too
22:01 karolherbst[d]: I'm still convinced that ` r92..94 = hmma.m16n8k16.f16 r92..96 r30..32 r76..78 // delay=8 wr:5` should be lower
22:01 karolherbst[d]: let me check what latency hmma has
22:01 karolherbst[d]: it's used in ` r31 = mov r93 // delay=1`
22:02 karolherbst[d]: the first
22:02 karolherbst[d]: yeah...
22:03 karolherbst[d]: okay.. it's 24
22:04 karolherbst[d]: mhh yeah...
22:04 karolherbst[d]: dunno
22:04 karolherbst[d]: maybe it's correct and my hack isn't work
22:04 karolherbst[d]: *working
22:04 karolherbst[d]: with my hack I get `1` which is clearly too low
22:05 snowycoder[d]: I think it's correct, the `mov r93` line has `8` cycles aboce and the `hmma` line has `10` cycles below, `24-18=8` so it should check out
22:06 karolherbst[d]: yeah...
22:06 karolherbst[d]: I mean mine is a hack, so it's possibly very broken
22:06 snowycoder[d]: Buut, if that is a hot loop and the hardware has just a slow-path on wrong scheduling (no UB), it should still be correct and faster to use a lower value
22:06 karolherbst[d]: the test has a `--correctness` flag and that passed, but maybe there are $hw_reasons it all worked out
22:07 karolherbst[d]: but the `24` is also just guessing
22:07 karolherbst[d]: we didn't get the fixed tables yet
22:08 snowycoder[d]: Ohh, still no table values?
22:08 karolherbst[d]: yeah...
22:09 snowycoder[d]: maybe you can try to lower the hmma RaW to match your hack
22:09 karolherbst[d]: anyway.. the benchmark has a high variance and randomly the results are just bad for whatever reason
22:10 karolherbst[d]: sometimes it's almost the same result.. so who knows
22:10 karolherbst[d]: maybe just randomness and I got unlucky
22:10 snowycoder[d]: This happens alwos with no cross-block patch, right?
22:10 karolherbst[d]: anyway, it seems to work and the latencies look way better
22:10 karolherbst[d]: snowycoder[d]: varriance in results?
22:11 snowycoder[d]: Yes, both variance and wrong results
22:11 karolherbst[d]: well the results are correct
22:11 karolherbst[d]: just sometimes slow
22:11 karolherbst[d]: and yeah.. it's normal
22:11 karolherbst[d]: also happens with the nvidia driver
22:11 karolherbst[d]: 🙃
22:11 snowycoder[d]: ahh, weird 0_o
22:11 snowycoder[d]: karolherbst[d]: Thanks for testing!
22:11 karolherbst[d]: I guess it's PCIe or something
22:11 karolherbst[d]: this is on an eGPU case after all
22:29 karolherbst[d]: mhhh
22:30 karolherbst[d]: I just noticed that those two ifs inside the loop have the same condition...
22:37 karolherbst[d]: anyway... I have no idea what's still slow in this shader 🙃
22:37 karolherbst[d]: except the control flow predication thing...
22:37 karolherbst[d]: but that can't be _that_ much of a difference can it? 🙃
22:40 karolherbst[d]: mhhh...
22:40 karolherbst[d]: nvidia uses `LDG.E.STRONG.SM`
22:41 karolherbst[d]: we use `.STRONG.GPU`..
22:41 karolherbst[d]: this surely can't make that much of a difference can it
22:41 snowycoder[d]: Just to feed the serotonin goblin that lives rent-free in my head, what's the perf difference of my patch without the hack?
22:44 snowycoder[d]: karolherbst[d]: That's easy to test by patching the emitter
22:46 karolherbst[d]: yeah.. changes nothing
22:46 karolherbst[d]: snowycoder[d]: like... 2% or so?
22:47 snowycoder[d]: karolherbst[d]: Yayy! In line with the other tests
22:47 karolherbst[d]: it's hard to tell with this benchmark, I'd need to get more numbers
22:47 karolherbst[d]: but nvidia is like... still a lot faster and I think it might just be super duper scheduling
22:48 karolherbst[d]: also them playing tricks with IMAD
22:48 snowycoder[d]: karolherbst[d]: Don't worry it's in my todo list
22:49 karolherbst[d]: yeah anyway... the ugpr stuff kinda helps but not in the numbers 🙃
22:49 karolherbst[d]: but also not unexpected because it only changes things outside the loop
22:49 karolherbst[d]: well.. mostly
22:49 karolherbst[d]: I'm sure it's going to be more useful for random other things 😄
22:50 karolherbst[d]: the compute MME stuff helped a lot.. so I wonder if most of the perf gap is really somewhere else now
22:50 karolherbst[d]: but also... it's still a mystery to me why their QMD is so weird...
22:51 karolherbst[d]: different cta size...
22:51 karolherbst[d]: way fewer registers...
22:52 karolherbst[d]: but then again.. the membar.cta change was a massive win in the shader
22:52 karolherbst[d]: I kinda hate that nothing can tell me what the current bottleneck is 😄
22:53 snowycoder[d]: karolherbst[d]: Time to wire up perf counters?
22:53 karolherbst[d]: I was considering it...
22:53 karolherbst[d]: but first I'm gonna clean up and upstream the address calc stuff, because nuking like 20% of the shader ain't nothing
22:53 mhenning[d]: yeah, need good profiling support. you know, for ai
22:54 karolherbst[d]: of course
22:54 karolherbst[d]: I already brought up the topic
22:54 karolherbst[d]: it's just a pain without docs of the counters
22:55 karolherbst[d]: anyway.. gonna clean up the patches and throw it into my offset MR 🙃
22:55 karolherbst[d]: ohhh
22:55 karolherbst[d]: I should try out that pass again..
22:57 airlied[d]: karolherbst[d]: you could try nvk on openrm 🙂
22:57 karolherbst[d]: heh
22:59 karolherbst[d]: you think it's kernel overhead or something?
22:59 karolherbst[d]: I mean.. there are still very obvious things to do in the shader, but they are all... well.. a lot of work
22:59 karolherbst[d]: though there are a few subtle things
23:00 karolherbst[d]: like `IMAD.MOV` or `IMAD.SHL`
23:00 karolherbst[d]: or `IMAD.IADD`
23:00 karolherbst[d]: `.reuse` might also make enough of a difference I dunno
23:01 karolherbst[d]: or... maybe I start to hit CPU bottlenecks?
23:02 karolherbst[d]: my CPU is hitting the 100% actually.. maybe I should check out CPU side stuff
23:03 airlied[d]: I'm not sure how many submissions it does, but it could be a bit of user space vs kernel submission overhead in it
23:04 airlied[d]: or try and munge the nvidia shader
23:04 karolherbst[d]: mhhhh
23:04 airlied[d]: so we submit nearly exactly the same thing and see where we end up
23:04 karolherbst[d]: it's like 60 QMDs back to back in the push buffer, but not sure how often the same thing gets submitted
23:05 karolherbst[d]: yeah sooo
23:05 karolherbst[d]: here is the weird thing...
23:05 karolherbst[d]: we do 60 QMDs, nvidia... 4 or 5?
23:05 karolherbst[d]: it kinda feels like they merge compute jobs together, but not sure
23:08 airlied[d]: I would think predication would be something
23:08 karolherbst[d]: build mesa and the benchmark in release mode.. no significant difference
23:08 karolherbst[d]: airlied[d]: mhhh dunno.. the branching is all uniform
23:08 snowycoder[d]: karolherbst[d]: In one recent talk about risc-v and AI they talked about merging compute jobs
23:08 karolherbst[d]: however
23:08 karolherbst[d]: nvidia is doing something else with the predicates
23:08 karolherbst[d]: https://gist.github.com/karolherbst/c803c9e20cf645fb5673d92615dbbb8e
23:09 karolherbst[d]: they just schedule across "blocks"
23:09 karolherbst[d]: and I suspect they just schedule way way way better than we do
23:10 karolherbst[d]: at least the GPU is getting properly hot 😄
23:10 karolherbst[d]: so I guess it's doing something
23:10 RSpliet: snowycoder[d]: there's a strand of research called "graph compiler" which does this on I think a higher level of abstraction
23:10 karolherbst[d]: yeah....
23:11 airlied[d]: I know for coopmat2 the workgroup scope means they get to schedule the loads to/from shared memory and it helped
23:11 leftmostcat[d]: I love the word "graph" and I love the word "compiler".
23:11 airlied[d]: though they also seem to have some sort of barrier patching in 580
23:11 karolherbst[d]: airlied[d]: I've disabled the extension in the benchmark
23:11 karolherbst[d]: but I was dumping the QMDs and they were... weird
23:12 karolherbst[d]: there is also that occupancy register/warp thing
23:12 karolherbst[d]: yeah.. so that's the weird thing..
23:13 karolherbst[d]: we set up the CTA to be 32 by 32 , nvidia does... 0x200000 by 0x1
23:13 karolherbst[d]: but the QMD also only had 16 registers
23:13 karolherbst[d]: maybe something is odd with the dumping
23:13 karolherbst[d]: maybe they do sneaky things
23:13 karolherbst[d]: maybe they actually submit a different shader
23:13 karolherbst[d]: I just found that one shader
23:13 karolherbst[d]: but they might use a different one? who knows
23:19 karolherbst[d]: but then again.. the compute MME patch improves performance by like 15% 🙃
23:22 karolherbst[d]: maybe I really should just focus on upstreaming what I have and then wire up all the performance metric stuff
23:23 karolherbst[d]: I should try out some unsafe math opts which are "fine (tm)" and see if that helps with anything...
23:28 airlied[d]: I'd definitely focus on upstreaming first
23:28 airlied[d]: having incremental improvements over time is good for certain mgmt 😛