15:05 karolherbst[d]: what MRs did we have that improve 3D <-> compute stuff? I think they would benefit borderlands 3 a lot and I can give it a try
15:19 karolherbst[d]: only seeing the cond rendering one and not sure that even matters there mhh
15:19 karolherbst[d]: I'm sure SCG would help
15:55 hatfielde1: wrote up some unsurprising test cases where the fmul_pdiv_nv substitution has a different result than fmul, such as: fmul ((fmul f32::MAX, 2), 0.5) != fmul_pdiv_nv (f32::MAX, 0.5, [scale_exponent=1]). LHS is INF, RHS is f32::MAX. Other mismatches include FTZ behavior. At the very least we need to account for this in the constant folding expression, but not sure how these edge cases are usually handled for an optimization like
15:55 hatfielde1: this, maybe that AMD thing that was mentioned here yesterday would be relevant for range analysis
16:07 karolherbst[d]: hatfielde1: we can ignore constant folding for `fmul_pdiv_nv`. `fmul_pdiv_nv` will probably be created through `nak_nir_lower_algebraic_late` which is soo late, that it runs after any constant folding
16:08 karolherbst[d]: and we could just ignore any optimizations on `fmul_pdiv_nv`
16:09 karolherbst[d]: but yeah.. I'm aware that the factor is not IEEE compliant, just needs to figure out how
16:09 karolherbst[d]: maybe it's not a factor on the first source and my information is wrong 🙃
16:14 hatfielde1: ok good to know about the constant folding.
16:16 karolherbst[d]: but the folding should also be equally inaccurate as the op is
17:26 mhenning[d]: karolherbst[d]: I'm not sure that's true? I think all nir algebraic opts are required to have constant folding
17:27 mhenning[d]: karolherbst[d]: I'm not sure we have open MRs for that stuff
17:28 karolherbst[d]: mhenning[d]: mhh yeah guess we'll be required to supply one anyway, but the point was rather to rely on const folding prior forming `fmul_pdiv_nv`
17:28 karolherbst[d]: okay.. I found an issue with `nir_opt_load_store_vectorize` and it was creating `vec5s` which... caused sup optimal load/store accesses
17:28 karolherbst[d]: not sure it matters much, but heh
17:30 karolherbst[d]: mhenning[d]: mhh well.. I am seeing very weird and very non uniform stuttering with BL3 and I was wondering if it's some compute stuff they do.. in previous versions they've used physx for particle simulations
17:30 mhenning[d]: karolherbst[d]: how so? that should get lowered to a vec4 + vec1 shortly afterwards right?
17:31 karolherbst[d]: mhenning[d]: yeah. but it was like longer and it ended up: vec4 + vec 1 + vec 1 + vec 2 + vec 4
17:31 karolherbst[d]: or something weird
17:31 karolherbst[d]: like completely split apart
17:31 karolherbst[d]: now I have a 32x4 store 😄
17:32 karolherbst[d]: mhenning[d]: https://gist.githubusercontent.com/karolherbst/2d67f15e360619f74156e87890524c99/raw/40b599045bc4e9a724da829d880a9ce1395acb3b/gistfile1.txt
17:33 karolherbst[d]: now it's:
17:33 karolherbst[d]: con 32x4 %24 = load_const (0xbfad9881, 0xf6eaddcf, 0x26190d01, 0x71594535) = (-1.356217, -2.381829e+33, 0.000000, 1.075871e+30) = (-1079142271, -152379953, +639175937, +1901675829) = (3215825025, 4142587343, 639175937, 1901675829)
17:33 karolherbst[d]: @store_scratch_nv (%24 (0xbfad9881, 0xf6eaddcf, 0x26190d01, 0x71594535), %6 (0x0)) (base=0, align_mul=1073741824, align_offset=0)
17:35 karolherbst[d]: anyway.. I just prevent it from merging into non pot vector sizes above 4
17:40 karolherbst[d]: I'm sure 0 shaders in the fossils I have are running into this problem
17:42 karolherbst[d]: ohh there are some
17:42 karolherbst[d]: 39
17:45 karolherbst[d]: of course those are all compute shaders
17:50 karolherbst[d]: mhh seems like there is more to solve
17:54 mhenning[d]: does https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39334 help at all?
17:55 karolherbst[d]: mhhhhhhh....
17:55 mhenning[d]: the current callback was meant to incrementally expand the vector. If we go the "disable vec5" route we might need to call vectorization multiple times
17:55 karolherbst[d]: only if we add them upt to 16 I think
17:55 karolherbst[d]: like it's 16 single byte stores I'm seeing here
17:56 karolherbst[d]: though that should end up as two vec8
17:56 karolherbst[d]: and then vec16
17:56 mhenning[d]: there's no "and then vec16" right now, iirc
17:56 karolherbst[d]: but that's no different than doing it at vec4
17:56 karolherbst[d]: yeah...
17:56 mhenning[d]: the pass gets called once, so it makes two vec8 and then stops
17:56 karolherbst[d]: it seems like the pass decided to convert from 8 to 32 bit instead anyway
17:57 mhenning[d]: yeah, that's the other part, I think the callback is supposed to widen the bit width where possible
17:57 karolherbst[d]: I have a shader that with my fix has less optimal vectorization and not quite sure what's going on there yet
17:58 mhenning[d]: yeah, that's what I would expect
17:58 mhenning[d]: try calling it in a loop or using the vec6/7 patch
18:02 karolherbst[d]: it's like a 5 + 2 vs 4 + 3 situation here
18:04 karolherbst[d]: maybe it's RA being RA again
18:05 karolherbst[d]: mhh
18:05 karolherbst[d]: is there something special going on for shared memory?
18:06 karolherbst[d]: somehow those vectors are split again, even though they really shouldn't?
18:07 karolherbst[d]: yeah...
18:07 karolherbst[d]: something else is weird
18:08 karolherbst[d]: `nir_lower_mem_access_bit_sizes` seems to split them up again?
18:08 karolherbst[d]: yeah.. so it seems like I'm triggering something else additionally
18:10 karolherbst[d]: ohhhhh...
18:10 karolherbst[d]: uuuhhh
18:10 karolherbst[d]: vec 1 + vec4 + vec3 🥲
18:11 karolherbst[d]:why
18:12 karolherbst[d]: and without my patch it's vec 1 + vec 5 + vec 2
18:12 karolherbst[d]: soo
18:12 karolherbst[d]: after splitting it up again
18:14 karolherbst[d]: before: vec1 + vec 1 + vec2 + vec2 + vec2
18:14 karolherbst[d]: after: vec1 + vec1 + vec2 + vec1 + vec1 + vec2
18:14 karolherbst[d]: yeah so mhhh
18:15 karolherbst[d]: the issue is that the firs vector is at 2 bytes alignment and then it gets split up
18:15 karolherbst[d]: no idea why it doesn't merge with +0
18:15 karolherbst[d]: ohhhhh
18:15 karolherbst[d]: how evil
18:16 karolherbst[d]: 32 %5046 = iand %5043, %5045 (0xfffffffe)
18:16 karolherbst[d]: @store_shared (%5044, %5046) (base=0, access=none, wrmask=x, align_mul=16, align_offset=0)
18:16 karolherbst[d]: 32 %5048 = iadd %4466 (0x2), %5043
18:16 karolherbst[d]: @store_shared (%5081, %5048) (base=0, access=none, wrmask=abcde, align_mul=16, align_offset=2)
18:17 karolherbst[d]: like that `iand` can just go
18:17 karolherbst[d]: and then it should be able to properly vectorize it...
18:21 karolherbst[d]: mhenning[d]: any specific reason why we don't take the alignment into account proper?
18:21 karolherbst[d]: heh wait.. we do, but maybe it's wrong somehow?
18:23 karolherbst[d]: yeah.. so the check is `12 <= 16` in this instance
18:24 karolherbst[d]: so it returns true
18:24 karolherbst[d]: `(2 + 5 * 2 <= 16`
18:24 karolherbst[d]: maybe I make this very strict and see if that's overall better
18:26 mhenning[d]: The idea is to collect as much as we can into 16 byte chunks, regardless of initial order
18:27 karolherbst[d]: mhh yeah but that leads to sub optimal splitting later
18:27 mhenning[d]: the comment there explains the idea - combine aggressively and then split later
18:27 mhenning[d]: karolherbst[d]: then the splitting is broken
18:27 karolherbst[d]: it can't be better
18:27 karolherbst[d]: or maybe
18:28 karolherbst[d]: yeah.. it can't be better
18:28 karolherbst[d]: at least not how it's done currently?
18:29 karolherbst[d]: like even assuming the +0 component doesn't get merged, it should be `vec 1 + vec 1 + vec 2 + vec 4`
18:29 karolherbst[d]: but we have `vec 1 + vec5 + vec2`
18:29 karolherbst[d]: so there is no way to get that
18:29 karolherbst[d]: but if the merging is alignment aware, it wouldn't create the vec5 in the first place
18:29 karolherbst[d]: but `vec 1 + vec 1 + vec 2 + vec4` and everything would be fine
18:30 karolherbst[d]: but I could also try to optimize the offset
18:30 karolherbst[d]: and then I'd get a proper vec8...
18:31 karolherbst[d]: of course with vec6 and vec7 that would mitigate it
18:31 karolherbst[d]: but then we'd have the same issue with 8 bit accesses
18:41 karolherbst[d]: okay.. people tell me I can't get rid of this `iand`, but I disagree 🙃
18:45 karolherbst[d]: in any case.. until we figure out if that `iand` can be safely optimized away, I think `nak_mem_vectorize_cb` has to be less optimistic here..
19:17 hatfielde1: I think the fmul_pdiv_nv scale modifier is implemented as a modifier to the exponent sum during multiplication, performed before rounding and overflow/underflow. so not exactly applied to the first/second args in a way comparable to fmul(fmul(a, 2**c), b). i left some data here: https://gist.github.com/htfld/4b4da38e3bc5cde51611001c11e5e818
19:20 hatfielde1: That is what I think but it will need more verification, some x * y that puts bits into the rightmost register, then a c that shifts them back and applies rounding based on the sticky/round/guard bit on the suffix, that can be tested against a known answer.also not totally sure where this leaves us wrt correctness but we should still be able to use it, right??
19:21 hatfielde1: *c that shifts them left
19:56 karolherbst[d]: hatfielde1: oh.. yeah that _could_ be it. If it's simply adding to the exponent it's indeed fast, but also very much violating IEEE 😄
19:57 karolherbst[d]: in terms of correctness.. I think we can only use it on `fmul` that are `ninf`
19:57 karolherbst[d]: ... _maybe_
19:57 karolherbst[d]: I think it kinda depends what's the internet accuracy here
19:59 karolherbst[d]: though if the `(fmul, (fmul(ninf), a, #c), b)` -> `(fmul_pdiv_nv, a, b, c)`?
19:59 karolherbst[d]: uhm.. that expression might be solid
19:59 karolherbst[d]: for certain `c` of course
20:00 karolherbst[d]: hatfielde1: mind checking that if as long as `a * pdiv` stays a finite number if the hw instruction behaves correctly?
20:00 karolherbst[d]: that's something we could easily model
20:06 glehmann: this sounds similar to fma fusing, i.e. the result is only closer to the infinitely precise result
20:07 glehmann: so it should be fine without NoContract (atm still exact in nir)
20:10 karolherbst[d]: ohhh
20:10 karolherbst[d]: yeah
20:10 karolherbst[d]: I think you are right on that one
20:13 karolherbst[d]: I think with the `ninf` check it can be done for exact `fmul`, but without exact it can be done on any?
20:14 karolherbst[d]: though not sure what happens with values close to 0 and a 0.125 multiplier
20:15 karolherbst[d]: because in theory this might return non 0 where it is expected to?
20:16 glehmann: more interesting is the case where the multiplication underflows, the multiplier is not fractional, and the infinitely precise result does not underflow
20:18 karolherbst[d]: yeah...
20:19 karolherbst[d]: but I think we can assume it's also done with infinite precision there
20:20 karolherbst[d]: mhenning[d]: I restricted the vectorize cb only to aligned ones (stats on top of https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40292):
20:20 karolherbst[d]: Totals from 689 (0.06% of 1163204) affected shaders:
20:20 karolherbst[d]: CodeSize: 26315792 -> 26315184 (-0.00%); split: -0.02%, +0.02%
20:20 karolherbst[d]: Number of GPRs: 60914 -> 60882 (-0.05%)
20:20 karolherbst[d]: Static cycle count: 156504391 -> 156468562 (-0.02%); split: -0.05%, +0.02%
20:20 karolherbst[d]: Spills to memory: 15453 -> 15441 (-0.08%)
20:20 karolherbst[d]: Fills from memory: 15453 -> 15441 (-0.08%)
20:20 karolherbst[d]: Max warps/SM: 18640 -> 18656 (+0.09%)
20:20 karolherbst[d]: will have to check what's going on there, but it doesn't seem to practically matter much
20:20 karolherbst[d]: but it helps with the one shader I was looking at
20:21 karolherbst[d]: I suspect the regressions are RA things
23:47 karolherbst[d]: con 64 %1481 = pack_64_2x32 %226
23:47 karolherbst[d]: div 64 %1482 = u2u64 %1480
23:47 karolherbst[d]: div 64 %1483 = iadd %1481, %1482
23:47 karolherbst[d]: div 32 %1484 = iadd %1480, %333 (0x3)
23:47 karolherbst[d]: div 1 %1485 = ult %1484, %227.x
23:47 karolherbst[d]: div 32 %1486 = @load_global_nv (%1483, %1485) (base=0, access=readonly|reorderable, align_mul=16, align_offset=0)
23:47 karolherbst[d]: div 32 %1487 = iadd %307 (0x4), %1480
23:47 karolherbst[d]: div 64 %1488 = u2u64 %1487
23:47 karolherbst[d]: div 64 %1489 = iadd %1481, %1488
23:47 karolherbst[d]: div 32 %1490 = iadd %303 (0x7), %1480
23:47 karolherbst[d]: div 1 %1491 = ult %1490, %227.x
23:47 karolherbst[d]: div 32 %1492 = @load_global_nv (%1489, %1491) (base=0, access=readonly|reorderable, align_mul=16, align_offset=4)
23:47 karolherbst[d]: div 32 %1493 = iadd %342 (0x8), %1480
23:47 karolherbst[d]: div 64 %1494 = u2u64 %1493
23:47 karolherbst[d]: div 64 %1495 = iadd %1481, %1494
23:47 karolherbst[d]: con 32 %1496 = load_const (0x0000000b = 11)
23:47 karolherbst[d]: div 32 %1497 = iadd %1496 (0xb), %1480
23:47 karolherbst[d]: div 1 %1498 = ult %1497, %227.x
23:47 karolherbst[d]: div 32 %1499 = @load_global_nv (%1495, %1498) (base=0, access=readonly|reorderable, align_mul=16, align_offset=8)
23:47 karolherbst[d]: con 32 %1500 = load_const (0x0000000c = 12)
23:47 karolherbst[d]: div 32 %1501 = iadd %1500 (0xc), %1480
23:47 karolherbst[d]: div 64 %1502 = u2u64 %1501
23:47 karolherbst[d]: div 64 %1503 = iadd %1481, %1502
23:47 karolherbst[d]: div 32 %1504 = iadd %328 (0xf), %1480
23:47 karolherbst[d]: div 1 %1505 = ult %1504, %227.x
23:47 karolherbst[d]: div 32 %1506 = @load_global_nv (%1503, %1505) (base=0, access=readonly|reorderable, align_mul=16, align_offset=12)
23:47 karolherbst[d]: I think I can do something about this 😄
23:47 karolherbst[d]: mhh mybe not.. I guess we need to do the bound checks per component.. mhh
23:47 karolherbst[d]: annoying