00:00 karolherbst[d]: totally forgot about that one, but it's not apparent to me what's the actual way this can be generated in vulkan..
00:01 karolherbst[d]: oh well.. I can ignore that until I see shaders that could benefit from it
00:29 mhenning[d]: karolherbst[d]: vtn can't generate it directly. we have some lowering passes that generate it
00:29 karolherbst[d]: ahh I see..
00:29 karolherbst[d]: wiring up `I2IP` might be interesting then, though it only supports saturated conversions
00:30 mhenning[d]: maybe? not sure if anything actually hits that case
00:30 karolherbst[d]: CL can 🙃
00:31 karolherbst[d]: though not sure if zink can even convert it to something directly
00:31 karolherbst[d]: ohh I think it's even worse
00:31 karolherbst[d]: it's S32 as a source format only
00:31 mhenning[d]: yeah, not sure what that looks like after zink transforms it
00:32 mhenning[d]: karolherbst[d]: that sounds pretty niche
00:32 karolherbst[d]: sounds like it was added for coop matrix stuff tbh
00:32 karolherbst[d]: they also support 4 and 2 bit integers as dest formats
00:34 karolherbst[d]: mhh okay.. so with the shader option and the vectorization part, the nak_nir_algebraic thing I've added on its own seems to be not a great deal:
00:34 karolherbst[d]: Totals from 131 (0.01% of 1163204) affected shaders:
00:34 karolherbst[d]: CodeSize: 2760592 -> 2738096 (-0.81%); split: -0.85%, +0.04%
00:34 karolherbst[d]: Number of GPRs: 6592 -> 6720 (+1.94%)
00:34 karolherbst[d]: Static cycle count: 1734015 -> 1724323 (-0.56%); split: -0.60%, +0.04%
00:34 karolherbst[d]: Max warps/SM: 5136 -> 5024 (-2.18%)
00:36 phomes_[d]: would you expect this one to have an impact? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36359 (nvk: Support nir_opt_varyings)
00:37 karolherbst[d]: phomes_[d]: generally yes, but hard to tell
00:39 phomes_[d]: I was just thinking about why prop driver sometimes reach 0 duration for things where we might be fast but not 0
00:41 phomes_[d]: instead of looking at the slowest games, I tried to look at the perf gap of a game that was doing quite well at 87% of prop: Urban Trial Playground
00:41 phomes_[d]: this time there is not a single big problem, but 1000+ draw calls all being a bit slower than prop
00:42 phomes_[d]: and once again we have a depth pass that sums up to be quite slow
00:43 karolherbst[d]: ohh.. my opt has an issue mhhh
00:43 karolherbst[d]: mhhhh
00:44 esdrastarsis[d]: phomes_[d]: is this related to zcull?
00:45 phomes_[d]: I am running with the zcull patches in my kernel though
00:47 esdrastarsis[d]: me too
00:47 karolherbst[d]: div 32x2 %43 = vec2 %40, %42
00:47 karolherbst[d]: div 16x2 %44 = f2f16 %43
00:47 karolherbst[d]: div 16x2 %112 = ffma %44.xx, %110 (-1.000000, 1.000000), %111 (1.000000, 0.000000)
00:47 karolherbst[d]: ->
00:47 karolherbst[d]: div 16x2 %107 = f2f16 %40.xx
00:47 karolherbst[d]: div 16x2 %113 = ffma %107, %111 (-1.000000, 1.000000), %112 (1.000000, 0.000000)
00:47 karolherbst[d]: I think I'll have to insert a `is_used_only_once` and see if that makes anything better or something...
00:47 karolherbst[d]: `is_used_once`
00:48 phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1481816257320980613/image.png?ex=69b4b06e&is=69b35eee&hm=b65be8c5069d30783314de23988dfdcfc1cbee54cf5dfe7a934ec55a7af3686b&
00:48 phomes_[d]: plotting the draw calls:
00:48 karolherbst[d]: ohh yeah.. much better
00:49 karolherbst[d]: phomes_[d]: yeah.. those look... odd
00:50 phomes_[d]: I looked into the shaders in the slow depth path
00:51 phomes_[d]: we compile code that looks to be close in size
00:51 karolherbst[d]: waaaaiaiiiit....
00:52 karolherbst[d]: MUFU supports F16 🙃
00:52 karolherbst[d]: we don't use it at all
00:53 karolherbst[d]: it supports even a vec2
00:53 karolherbst[d]: ooooooofff
00:53 karolherbst[d]: r0 = f2f.f32.f16.re r0.xx // delay=2 wr:0
00:53 karolherbst[d]: r0 = mufu.rcp r0 // delay=2 wt=000001 wr:0
00:53 karolherbst[d]: r0 = f2f.f16.f32.re r0 // delay=1 wt=000001 wr:0
00:53 karolherbst[d]: could be asingle instruction
00:54 phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1481817781535113346/image.png?ex=69b4b1d9&is=69b36059&hm=fd378567964f6f81cfc7820a675f1c14c90a114129080ef1c38ca864643de968&
00:54 phomes_[d]: the depth pass has a lot of calls. Again none is really slow but all of them are just a bit slower than prop. I wrote a small thing to browse comparisons for nak/prop:
00:55 karolherbst[d]: ours is obviously better 🙃
00:55 karolherbst[d]: well..
00:55 karolherbst[d]: `reuse` could matter...
00:55 phomes_[d]: every call just takes longer to run
00:55 phomes_[d]: I was starting to think that too but now we are way past where I know anything 🙂
00:56 karolherbst[d]: reuse is a bit of a weird opt
00:57 karolherbst[d]: ohh nvidia has way lower delays than us
01:02 karolherbst[d]: anyway.. maybe I look into MUFU.F16 tomorrow
01:02 karolherbst[d]: that's gonna be a huge change
01:02 karolherbst[d]: it's even Turing+
01:02 karolherbst[d]: ehh maybe it's scalar only
01:03 karolherbst[d]: still
01:03 karolherbst[d]: can select lo/hi bits on the source
01:12 karolherbst[d]: mhenning[d]: do we have any handling of `.lo|.hi` selectors inside NAK? I couldn't find anything so far
01:12 karolherbst[d]: well nvidia calls it `.H0|.H1`
01:13 karolherbst[d]: not sure how to deal with this on the from_nir side, _but_ I think it could fold in some `prmt`, but no idea what's a good approach with that
01:13 karolherbst[d]: sadly `MUFU.F16` is scalar, but it could select a vec2 component mhhh
01:16 mhenning[d]: karolherbst[d]: We have .xx and .yy swizzles
01:16 karolherbst[d]: ohhh... I dum dum
01:17 karolherbst[d]: okay... so I guess if the source is a swizzled one, I could make use of it..
01:17 karolherbst[d]: great
01:17 karolherbst[d]: and I guess most F16 ops are already vec2 anyway
01:18 karolherbst[d]: the only annoying part is that MUFU.F16 only writes into the lower bits, but whatever
01:18 karolherbst[d]: it's better than the f2f dance
03:38 gfxstrand[d]: phomes_[d]: I doubt that has anything to do with the shader.
03:38 gfxstrand[d]: Those shaders add close enough to identical that it's probably something else.
03:41 airlied[d]: years ago I found one of those wierd slowdowns in radeon MRT performance came from the cbuf addresses needing to be swizzled different to avoid bank conflicts
03:44 gfxstrand[d]: Yeah, I did that on Intel, too
03:44 gfxstrand[d]: phomes_[d]: Do they change shaders a lot?
09:25 karolherbst[d]: with MUFU.F16 the shader I'm looking at: 1317 -> 1277 instructions
09:25 karolherbst[d]: but there seem to be a few pointless prmts left that could be handled via swizzles
09:25 karolherbst[d]: around f16 ops in general I mean
09:31 marysaka[d]: yeah I think we still have to improve the swizzling optimisation side of things
09:34 karolherbst[d]: `the type f16 is unstable` :blobcatnotlikethis:
09:34 karolherbst[d]: yeah.. maybe I skip fp16 fsin for now 🙃
09:35 karolherbst[d]: ohh wait.. I could just write a constant
09:35 marysaka[d]: yeah… tho typing some helper or using the half crate could be solutions
09:36 marysaka[d]: Or just yeah use constants 😄
09:38 karolherbst[d]: yeah.. it's just 1.0 / (2.0 * pi)
10:08 phomes_[d]: gfxstrand[d]: within the depth pass? yes quite a few times. I will get some stats
10:09 phomes_[d]: marysaka[d]: btw I finally found a game to test the mesh shaders with. Remnant 2. I am testing it right now
10:18 marysaka[d]: oh nice!
10:28 phomes_[d]: seems to work just fine. Lots of calls to CmdDrawMeshTasksIndirectEXT
10:31 marysaka[d]: amazing, I guess to get more testing going there we will need DX12.2 support so like interlock and raytracing too
10:37 phomes_[d]: gfxstrand[d]: in the depth pass 4 there is 58 changes of shaders/pipelines. There is a total of 346 vkCmdDrawIndexed calls across those
10:41 phomes_[d]: there seems to be a constant overhead, as an increase in indexCount only increases duration by a tiny bit:
10:41 phomes_[d]: `353 | - vkCmdDrawIndexed(6, 1) | 91 | 0.16077
10:41 phomes_[d]: 356 | - vkCmdDrawIndexed(294, 1) | 92 | 0.16179
10:41 phomes_[d]: 359 | - vkCmdDrawIndexed(5766, 1) | 93 | 0.16282
10:41 phomes_[d]: 362 | - vkCmdDrawIndexed(23814, 1) | 94 | 0.16384`
10:42 karolherbst[d]: MUFU.F16:
10:42 karolherbst[d]: Totals from 1427 (0.12% of 1163204) affected shaders:
10:42 karolherbst[d]: CodeSize: 18612784 -> 18507888 (-0.56%); split: -0.57%, +0.00%
10:42 karolherbst[d]: Number of GPRs: 91635 -> 91627 (-0.01%)
10:42 karolherbst[d]: SLM Size: 14144 -> 14140 (-0.03%)
10:42 karolherbst[d]: Static cycle count: 96332999 -> 96244567 (-0.09%); split: -0.13%, +0.04%
10:42 karolherbst[d]: Spills to memory: 2684 -> 2688 (+0.15%)
10:42 karolherbst[d]: Fills from memory: 2684 -> 2688 (+0.15%)
10:42 karolherbst[d]: Max warps/SM: 48812 -> 48816 (+0.01%)
10:43 karolherbst[d]: but the code is a disaster ..
13:11 karolherbst[d]: I wonder if it's better to lower fsin/fcos/etc.. in nir...
13:11 karolherbst[d]: because the hmul2 part could be vectorized
13:12 karolherbst[d]: and the constant could also be folded into previous instructions
13:12 karolherbst[d]: but also.. whatever 🙃
13:29 gfxstrand[d]: phomes_[d]: How often does it change descriptor sets
13:40 zmike[d]: marysaka[d]: zink mesh should be fixed in main, so it should be even easier to test glcts now
14:11 mohamexiety[d]: phomes_[d]: Do you have Alan wake 2 btw?
14:11 mohamexiety[d]: That was one of the big games using it if you have it
14:12 mohamexiety[d]: Shouldn’t require ray tracing but no clue about FSR
14:12 mohamexiety[d]: Also that game has both a non mesh shader and a mesh shader path so perf testing could be interesting
14:51 phomes_[d]: gfxstrand[d]: before every vkCmdDrawIndexed it does 1 or 2 vkCmdSetDescriptorBufferOffsetsEXT
14:55 phomes_[d]: mohamexiety[d]: Alan wake 2 is not on steam, so I went for a few other games that were mentioned as using mesh shaders (Avatar: Frontiers of pandora, Wolfenstein: youngblood). Avatar does not launch and wolfenstein does not activate the extension even on prop
15:02 esdrastarsis[d]: phomes_[d]: Maybe Wolfenstein uses the nvidia vendor extension? Since it's an old game
15:05 phomes_[d]: that was also what I read, but it also does not enable that ext
15:16 gfxstrand[d]: phomes_[d]: Okay, so that should be kinda fine.
15:17 gfxstrand[d]: But I also wonder if cbufs are biting us. Can you try with `NVK_DEBUG=no_cbuf` and see how that affects the depth pass?
15:41 phomes_[d]: in-game the fps went 54->51
15:41 phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1482041007330689109/image.png?ex=69b581bf&is=69b4303f&hm=00e3fcf5bf5516e6be6cef1ca78d655cef257e81f9350dd620e225eebf0fd94a&
15:42 gfxstrand[d]: Yes but what about just the depth pass?
15:43 phomes_[d]: blue is no_cbuf, orange is plain mesa
15:44 gfxstrand[d]: So depth pass 4 is slightly faster and depth pass 5 is slightly slower
15:44 gfxstrand[d]: Holy shit! That is a lot of passes, though.
15:46 phomes_[d]: gfxstrand[d]: yes. Are there specifics I should try to get to?
15:49 zmike[d]: I'd assume a lot of those are just clears?
16:05 phomes_[d]: probably. I did not check. Most those passes are faster than even a single draw call in depth pass 4 though
16:09 hatfielde: So is there a certain setting like "exact mode" or something for shaders that I can detect in the compiler where we don't do certain floating point optimizations bc the user indicates that they need results that strictly follow IEEE754? Otherwise we can go ham with `fmul_pdiv_nv` substitution. Sory for not getting to this for a couple days. Gotta make money.
16:09 karolherbst: hatfielde: yes, there are per instruction flags you can check, there are some examples in nir_opt_algebraic.py for it
16:09 hatfielde: As someone who writes shaders sometimes idc if my operations are faster + strictly more accurate, even if they don't follow IEEE754. But like, if the fmul instructions are marked {sat}, then we have to respect that
16:11 hatfielde: Ok I was thinking of like an overall setting that would differentiate like scientific simulation applications or something.
16:18 gfxstrand[d]: No over-all flag. Just per-op allowed optimization bits.
16:19 gfxstrand[d]: If you compile with CL, we set exact all over
16:20 hatfielde: Oh okay, and these are different bits from ftz/dnz/sat/rnd_mode?
16:24 gfxstrand[d]: yes
16:24 gfxstrand[d]: And they explicitly indicate when you need to care about inf/nan being correct, when it's okay to lose a tiny bit of precision, etc.
16:25 gfxstrand[d]: What are the semantics of `fmul_pdiv_nv` that are problematic?
16:33 karolherbst[d]: gfxstrand[d]: it behaves like `fma` as in the constant pre factor is with infinite precision
16:33 hatfielde: Noice! Sounds like those will provide some canonical answers to these ambiguities. To your question: Oh just if the "intermediate" subexpression dips into denormal or INF, `fmul_pdiv_nv` will prevent rounding/INF propagation from happening respectively. It won't respect an intermediate fmul{sat} since it could "rescue" the value from 0.
16:34 hatfielde: more precisely described here for "bitwise perfect" substitution: https://gist.github.com/htfld/1a59f1519cc64c28f8572ff13ae41ec9
16:34 gfxstrand[d]: karolherbst[d]: Right, so no over/under-flow until the whole calculation is complete. Yeah, we can get away with that unless `.exact` is set.
16:35 hatfielde: It will be very useful to look at those flags. To check my work/prove these semantics, is that what hw_tests.rs is for?
16:36 karolherbst[d]: if we feel super smart we could ignore exact with some range analysis information, but....
16:36 karolherbst[d]: I doubt it matters much
16:36 karolherbst[d]: like if a is above 1 and the factor below one, we could fold it even with exact
16:36 karolherbst[d]: but.....
16:37 karolherbst[d]: maybe
16:37 karolherbst[d]: I'm not actually sure
16:37 gfxstrand[d]: hatfielde: Yes. If you implement `Foldable` for your new ops or add it to `OpFMul`, it should be easy enough to add something to hw_tests.rs.
16:38 hatfielde: gfxstrand[d]: but to clarify, those tests aren't about the substitution validity itself, but more about proving "this is how the hardware works".
16:40 gfxstrand[d]: Yes. That's exactly what hw_test.rs is for
16:40 karolherbst[d]: I think one thing to figure out is also if folding a constant factor into the fmul or creating an ffma is better on avarage
16:41 gfxstrand[d]: `Foldable` is just a CPU implementation of what we're pretty sure the HW op does. And then we beat the shit out of it with a bunch of different test cases to make sure we're right about the details.
16:41 hatfielde: karolherbst[d]: of course it is fun to do things 100% correct, but I think the range analysis would block any ranges that include 0 which could be a lot..
16:41 karolherbst[d]: like.. if you got this: `fadd(fmul(fmul(a, 2.0), b), c)` -> `fadd(fmul_pdiv_nv(a, 2.0, b), c)` or `ffma(fmul(a, 2.0), b, c)`
16:42 karolherbst[d]: and from a gut feeling, creating `ffma` feels more important than creating `fmul_pdiv_nv`, but...
16:42 karolherbst[d]: dunno
16:43 hatfielde: Does ffma not exist?
16:43 karolherbst[d]: maybe needs some heuristics to make it more optimal
16:43 karolherbst[d]: hatfielde: what do you mean?
16:44 hatfielde: karolherbst[d]: like we can only perform the substitution if ((a * c < INF && a * c > 0x007FFFFF (max denormalized #)) OR (a * c > NINF && a * c < 0x807FFFFF (min denormalized #))), which are ranges that do not include the number 0. Which is a pretty important number.
16:46 hatfielde: also we haven't really looked into a * (c * b) case, although it's probably similar?
16:48 karolherbst[d]: hatfielde: well the constant applies to the first source, which could matter depending on things
16:50 karolherbst[d]: hatfielde: soo we don't have to be 100% precise in all cases, the shaders generally tell when it's important. I was more thinking about cases where we do have restrictions in place and we could say "but like we can do it safely regardless in those specific cases"
16:56 hatfielde: karolherbst[d]: nah i don't think it does apply to the first source. more like it is just added onto the exponent sum within the fmul. i shall use Foldable to test such assertions! it's interesting about the ffma, i see that is only used for sm <= 50?
16:58 karolherbst[d]: hatfielde: ffma should be used on every gen that NAK supports
17:00 hatfielde: karolherbst[d]: ah i only saw it encoded <=50, maybe it is called something else for >50
17:00 karolherbst[d]: hatfielde: it's also called FFMA later
17:01 hatfielde: True I see
17:01 karolherbst[d]: my MUFU.F16 stuff is ready I think.. https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40392
20:15 _lyude[d]: btw airlied[d] - finally got back to looking at that suspend/resume issue that I've been seeing:
20:15 _lyude[d]: Mar 04 15:46:15 GoldenWind kernel: nouveau 0000:c1:00.0: gsp: rc engn:00000001 chid:6 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003fd1f8a000 fault_type:0000000a
20:15 _lyude[d]: If I did my homework right, according to src/nvidia/generated/g_kern_gmmu_nvoc.h and `FAULT_TYPE` it's `fault_unsupportedAperture`
20:20 _lyude[d]: should I try to get more info from a gsp log? (a bit hesistant and curious if there's anything else I can do, since it never seems like we get anything useful from the log :s)
20:36 airlied[d]: when does it happen in the suspend/resume sequence? sounds like page table corruption maybe, though the fault addr looks reasonable
20:38 _lyude[d]: airlied[d]: looks like it's shortly after we finish resuming
20:40 airlied[d]: so it that a userspace process then? maybe we are missing a tlb flush or a something
20:40 _lyude[d]: yes - specifically:
20:40 _lyude[d]: Mar 04 15:46:15 GoldenWind kernel: nouveau 0000:c1:00.0: gsp: rc engn:00000001 chid:6 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003fd1f8a000 fault_type:0000000a
20:40 _lyude[d]: Mar 04 15:46:15 GoldenWind kernel: nouveau 0000:c1:00.0: fifo:c00000:0006:0006:[gnome-shell[7440]] errored - disabling channel
20:40 _lyude[d]: Mar 04 15:46:15 GoldenWind kernel: nouveau 0000:c1:00.0: gnome-shell[7440]: channel 6 killed!
20:47 airlied[d]: doesn't sound like GSP's fault then, unknown aperture sounds like a corrupt page table
20:53 _lyude[d]: makes sense, would this be with the page table we save in the radix3 table on suspend?
20:57 airlied[d]: no it'll be one of the things we probably backup from instmem on fbsr, is this by any chance 570 only?
20:58 airlied[d]: since how we backup instmem is different between 535 and 570
21:01 _lyude[d]: oh good question, I'll check in a bit
21:35 _lyude[d]: [ 1667.337472] nouveau 0000:c1:00.0: [drm] *ERROR* failed, fb_id=0 handle=3 size=1920x1080 modifier=300000000606015 offset=0 format=30335258
21:35 _lyude[d]: By the way, also got some more info from the flickering screen issues
22:38 airlied[d]: So not a cursor at least
22:38 _lyude[d]: yep
23:02 karolherbst[d]: r2 = hmul2 r5.xx r2.xx // delay=1 wt=000001
23:02 karolherbst[d]: r1 = hmul2 r5.xx r1.xx // delay=4
23:02 karolherbst[d]: mhhh...
23:02 karolherbst[d]: oh well..
23:07 karolherbst[d]: airlied[d]: actually.. have we chatted about aliasing shared memory before? Didn't you want to look into it at some point?
23:08 karolherbst[d]: there is this MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33914
23:08 airlied[d]: karolherbst[d]: Yup there is an MR I pointed you at for NIR. But I think coopmat2 would really like it eventually
23:08 karolherbst[d]: I have one where it could allow me to run 5 instead of 4 workgroups 🙂
23:08 airlied[d]: But after I finish getting workgroup scope working at a basic level
23:10 karolherbst[d]: right.. I'm just wondering if you wanted to work through it or not, because then I'd pick something else
23:10 karolherbst[d]: no idea what Rhys plans here are anyway
23:58 airlied[d]: not directly something I want to work on yet, but I need to see how much it might help workgroup, but I've realised it probably isn't essential up front but more a nice to have