00:25 karolherbst[d]: phomes_[d]: might want to check if this patch does anything to performance? https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/f7733bfb0a28b886da5331ae0721b052a041ca37
00:25 karolherbst[d]: I have a quadro card myself and kinda wondering if it has a different default value
00:27 karolherbst[d]: it's kinda interesting, setting it to `TRUE` drops performance in pixmark_piano from 85 to 80 fps
00:53 karolherbst[d]: p4 null = plop3 p5 p4 p6 LUT[0xfe] LUT[0x0] // delay=1
00:53 karolherbst[d]: p5 null = plop3 pT p4 pT LUT[0x33] LUT[0x0] // delay=1
00:53 karolherbst[d]: -> `p4 p5 = plop3 p5 p4 p6 LUT[0xfe] LUT[0x0] // delay=1` and I wonder if I should make copy_prop deal with this...
00:54 karolherbst[d]: uhh wait I think it's more complicated than that
00:55 karolherbst[d]: so the 2nd output uses the second LUT
00:58 karolherbst[d]: uhh...
00:58 karolherbst[d]: yeah...
01:00 karolherbst[d]: oh uhh.. we already make use of it for some..
01:05 karolherbst[d]: okay so for swap that works out
01:10 karolherbst[d]: or maybe opt_lop is the better place for it
04:17 gfxstrand[d]: Yeah, opt_lop is supposed to figure that stuff out.
04:18 gfxstrand[d]: The fact that it has two LUTs and two outputs is both a blessing and a curse for optimization.
04:29 mhenning[d]: karolherbst[d]: Yeah, I've seen that. My guess is that it only works on turing+ although I can't say I've tried it
05:56 phomes_[d]: karolherbst[d]: I don't see any perf change in the tests
11:50 karolherbst[d]: okay, good
12:05 karolherbst[d]: gfxstrand[d]: yeah...... folding in another result is sure a messy optimization
12:06 karolherbst[d]: at least for `inot` is should be trivial enough
12:06 karolherbst[d]: just move the dest into the previous `plop`
12:06 karolherbst[d]: or something
12:29 gfxstrand[d]: Yeah
12:30 gfxstrand[d]: And even if not all uses are inot, we can make one of the destinations the not or like that
13:05 karolherbst[d]: well we could also merge to boolean logic ops with a shared src, so I'm a bit hesitant to "always" make it the not or something. Especially as you can also chain `plop3`s
13:06 karolherbst[d]: the semantics of the second output changes with the `.AND`, `.OR`, etc.. modifiers
13:06 karolherbst[d]: though not sure.. I think that's just PSETP that's not there in hw anymore
13:07 karolherbst[d]: what's also interesting is that `PLOP3` can also read from a register and that could undo some of the predicate spilling
13:08 karolherbst[d]: (they read from the sign bit)
13:58 karolherbst[d]: so for RA that means, if we spill predicates, we should rather convert PLOPs
13:58 karolherbst[d]: and use FSET instead of FSETP
13:59 karolherbst[d]: but not sure if that fits well into the current RA model...
14:24 karolherbst[d]: somehow this feels like easier to do with nir 😄
14:25 karolherbst[d]: mhh alus don't have space for the luts 😢
14:25 karolherbst[d]: (could be sources :'))
14:25 karolherbst[d]: (I'm joking)
18:53 karolherbst[d]: okay.. this kind of optimization sucks to do in NAK 🙃 Like this is a classic "I have to adjust a source" thing and that's kinda not really possible and I'm wondering if something more fundamental needs to be done to tackle those kind of opts
18:55 karolherbst[d]: the hacky way would be to just place a new instruction directly in the instruction list of a block, but not sure that's a great idea either...
20:28 _lyude[d]: I was definitely 100% on the wrong track going through the vmm code trying to fix this screen flashing issue on my AD102 GPU, but now I've found what seems to be the actual issue and it's a bit strange.
20:28 _lyude[d]: It seems like most of the time creating the primary display framebuffer for each monitor results in this:
20:28 _lyude[d]: Apr 17 16:10:02 GoldenWind kernel: [drm:nouveau_framebuffer_new [nouveau]] offset=0 stride=7680 h=1080 gobs_in_block=32 bw=120 bh=5 gob_size=512 bl_size=9830400 size=9830400
20:28 _lyude[d]: But when things start failing, we get this:
20:28 _lyude[d]: Apr 17 16:10:02 GoldenWind kernel: [drm:nouveau_framebuffer_new [nouveau]] offset=0 stride=7680 h=1080 gobs_in_block=32 bw=120 bh=5 gob_size=512 bl_size=9830400 size=262144
20:30 _lyude[d]: so it would appear seemingly at random either userspace or the kernel is miscalculating the required size for the framebuffer
20:30 _lyude[d]: I noticed as well the other half of this seems to be when we fail to create cursor framebuffers, where we will randomly get a pitch of `0` passed in with one of the framebuffers
20:39 jannau: what size has the cursor fb? 262144 is 4 * 256 * 256
20:39 _lyude[d]: omg
20:39 _lyude[d]: that explains it perfectly
20:39 _lyude[d]: the pitch is 0 because that's correct for a block linear buffer but it's the wrong size on the other one because it's the size of the cursor
20:39 _lyude[d]: it's literally mixing up the size info between framebuffers
20:40 _lyude[d]: now i need to figure out how on earth that could happen
20:42 karolherbst[d]: ~~would be funny if it's some uninitialized stack value and most of the time it works out alright~~
20:43 _lyude[d]: I am thinking that might be exactly it
20:48 _lyude[d]: How common is it typically for a tiled fb format to not have a pitch? I know that it's correct in blocklinear's case but I'm just curious if that's normal for the tiling formats other GPUs use
20:49 _lyude[d]: or, hm, no that's the wrong question we do technically have a pitch for a blocklinear framebuffer i think
22:10 karolherbst[d]: wait.. blocklinear was tiled, right?
22:12 karolherbst[d]: then you shouldn't have a pitch if I'm not mistaken dunno.. what would a pitch mean in a tiled image?