02:10 gfxstrand[d]: This will get us the EDB path with VKD3D: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33613
02:26 redsheep[d]: gfxstrand[d]: Is this also only in conjunction with the MR over on vkd3d?
02:34 gfxstrand[d]: Nope. It supersedes it
03:39 gfxstrand[d]: gfxstrand[d]: If folks wanted to throw some D3D12 titles at that MR, I wouldn't mind. I think it should be okay but it switches VKD3D to the EDB path which may have unexpected results. Everything seems good with the VKD3D tests and DA:TV, though.
04:56 tiredchiku[d]: :salute:
05:34 tiredchiku[d]: I don't have very many D3D12 titles installed, but
05:36 tiredchiku[d]: in Amid Evil, the frametime graphs are much smoother with the MR, same framerate as without
05:36 tiredchiku[d]: and the same is observed in Alan Wake 2
08:50 phomes_[d]: gfxstrand[d]: I will run it through my list of games today. We are looking for both performance changes and correctness right?
12:08 linkmauve: Hi, what’s the current status on the Tegra X1?
12:09 linkmauve: I’m going to try getting a kernel as close to mainline as possible running on it first, but once I’m there can I expect NVK to run?
12:23 Mary: linkmauve: tldr; syncpoints aren't implemented on nouveau kernel driver, meaning that all sync is busted on Gallium and NVK
12:23 Mary: so that need to be fixed first
12:24 Mary: NVK can run from the little testing I did but with all sync all around the places, it's not usable
12:24 linkmauve: Is nova any better at that? Although from what I’ve read it only deals with the GSP firmware, could it be extended to work on Tegra too?
12:27 Mary: It's GSP only so the only Tegra devices that might be used with it (with proper integration as GSP is different there) are Xavier and Orin
12:28 Mary: linkmauve: syncpoints are provided by the host1x kernel driver basically, I think it shouldn't be too hard to integrate in nouveau (you can look at how nvgpu handle those) but I think host1x will need changes too
12:28 Jasper[m]: <linkmauve> "I’m going to try getting a..." <- I've had major issues with even getting nouveau to run on my Pixel C, so if you get some sort of breakthrough, please let me know.
13:32 gfxstrand[d]: phomes_[d]: Any changes are good to know about. Perf, correctness, crashes. It's a pretty major change in how VKD3D and NVK interact.
15:01 phomes_[d]: gfxstrand[d]: The way I am testing is that I do a checkout at this branch, at main today, main one week ago, and main two weeks ago. Then I test games by loading into a place in the game where fps is somewhat stable. Or use a benchmark mode. And then I log the development in perf that way
15:02 phomes_[d]: Age of empires IV is not affected by any of the changes
15:03 phomes_[d]: Hitman improved from 31 last week to 63 with current main. 64.5 with your MR
15:03 phomes_[d]: no crashes or corruption spotted
15:17 gfxstrand[d]: Ugh... I hate robustness2...
15:19 karolherbst[d]: what are you hating about it?
15:20 gfxstrand[d]: Everything?
15:20 gfxstrand[d]: Really, I just hate bounds checking.
15:20 karolherbst[d]: I see
15:21 karolherbst[d]: besides global memory there are a couple of ways to get them for free
15:21 gfxstrand[d]: But I think HK has convinced me that we at least don't need to split things.
15:21 gfxstrand[d]: Well, yeah, UBOs are free.
15:21 gfxstrand[d]: UBOs are nice like that.
15:21 karolherbst[d]: vbos as well
15:22 gfxstrand[d]: Yeah, those are hardware.
15:22 gfxstrand[d]: It's SSBOs that suck.
15:22 karolherbst[d]: yeah.. that's the big one
15:24 karolherbst[d]: but that bit of alu shouldn't be that expensive, well.. fetching the size of the ssbo is probably the painful part
15:50 gfxstrand[d]: I think I have a branch in which NVK doesn't scalarize absolutely everything. I'm hoping to get a perf bump from it.
15:55 gfxstrand[d]: No observable difference. 😭
15:56 gfxstrand[d]: Not sure if it's not helping or if I'm just doing something wrong.
16:01 karolherbst[d]: I think without a proper instruction scheduler those things might not help as much
16:01 karolherbst[d]: most of the optimizations on that level are about reducing latencies and waits
16:01 karolherbst[d]: also
16:01 karolherbst[d]: higher gpr usage can counter those benefits
16:06 gfxstrand[d]: It should help a bit because it's also not bounds checking per-component anymore (at least in theory)
16:06 karolherbst[d]: those things don't really matter
16:06 snowycoder[d]: Fixed a tiny bug with MSAA sparse residency lowering, should remove the last crashes in deqp tests: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33625
16:07 karolherbst[d]: like.. if a warp hits a state where it waits on memory, the scheduler just moves to a different warp
16:07 gfxstrand[d]: karolherbst[d]: It does when all your loads are in if statements that stall and wait on the load at the end of the if.
16:07 karolherbst[d]: and normally you just have warps being more bored when removing some alu
16:07 gfxstrand[d]: So splitting a vec4 4xes the number of stalls
16:08 karolherbst[d]: well you also have caches
16:08 karolherbst[d]: so might just stall once and the other requests are served directly from L1 cache
16:09 gfxstrand[d]: Yeah, potentially
16:10 karolherbst[d]: the more gprs you use the hardware it gets for the hardware to mask the latencies there, so this is certainly one potential drawback, but normally if the compiler optimizes and RAs well enough it shouldn't be much of an issue
16:11 karolherbst[d]: but proper instruction scheduling would also help in some form.. it's kinda a complex mess in the end
16:11 gfxstrand[d]: So would some sort of predication.
16:11 gfxstrand[d]: TBH, that might be the next thing I try.
16:11 karolherbst[d]: predication instead of branching?
16:12 karolherbst[d]: nvidia uses guard predicates pretty aggressively
16:12 gfxstrand[d]: Yeah. Well, a very lightweight predication optimization, anyway. Enough so most of our loads turn into `@p ldg rX rY; @!p mov rX rZ`
16:12 karolherbst[d]: you want to be more aggressive there
16:13 karolherbst[d]: like up to 6 instructions predicated is a good idea
16:13 karolherbst[d]: or something
16:13 karolherbst[d]: branching hurts you more
16:13 gfxstrand[d]: gfxstrand[d]: But also, I need to teach the dependency tracker to ignore `WaW` dependencies with opposite predicates.
16:13 karolherbst[d]: well not branching itself, but diverging threads
16:14 karolherbst[d]: mhh yeah...
16:14 gfxstrand[d]: Getting this to actually do what we want is going to be tricky.
16:15 gfxstrand[d]: It's going to need to be multiple things stacked
16:16 gfxstrand[d]: Because part of the problem right now is that we're doing `if (...) { ld } else { 0 }` for every component, and that's stalling like mad. It's also probably inserting re-convergence barriers around each of those tiny ifs and those are a mess as well.
16:17 karolherbst[d]: yeah.....
16:17 karolherbst[d]: you want to use predication for sure
16:17 gfxstrand[d]: If I can get it to use predicates for bounds checking, we'll be in much better shape.
16:17 gfxstrand[d]: Unfortunately, it needs to be *real* predication, not just do both things and `sel` like NIR does because we need to not fault.
16:18 karolherbst[d]: codegen had a concept for that and it was a mess
16:18 gfxstrand[d]: Yeah, I've seen it.
16:18 gfxstrand[d]: I think the first try will be something simple that runs after RA
16:18 karolherbst[d]: most of the opts were done after leaving SSA form I think
16:18 karolherbst[d]: yeah
16:18 gfxstrand[d]: Full SSA predication support is harder. I have half a plan but it's not easy.
16:19 karolherbst[d]: I don't think it matters much if it's done before or after, or I can't really imagine why it would
16:20 karolherbst[d]: but yeah.. this would cut out a loot of overhead in the shaders
16:20 karolherbst[d]: I think it also helps with latency masking if the thread blocks are bigger
16:20 karolherbst[d]: at least the scheduler is less busy
17:09 gfxstrand[d]: I should try to fix Zink before I get too lost in NAK predication.
17:14 mhenning[d]: gfxstrand[d]: speaking of zink fixes, have you seen this discussion that happened while you were out? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31585#note_2628332 I never got around to looking at the regression more closely
17:25 gfxstrand[d]: Yeah, I missed that. No idea what's going on there.
19:16 snowycoder[d]: Quick question, I've only written a toy compiler before.
19:16 snowycoder[d]: I'm trying to implement the `i2b(b2i(x))` folding optimization pass.
19:16 snowycoder[d]: Should I replace all SSA references for the folded value or should I just replace the second conversion with a copy and let copy_prop handle SSA propagation?
19:16 snowycoder[d]: I guess the second one is more clean?
19:17 karolherbst[d]: use `nir_opt_algebraic.py`
19:17 snowycoder[d]: I can't, we're already in the NAK IR https://gitlab.freedesktop.org/mesa/mesa/-/issues/10204
19:18 karolherbst[d]: ahh, I see
19:18 snowycoder[d]: I could implement something similar for NAK, but it seems a bit of an overkill
20:03 Lyude: skeggsb9778[d]: any chance you might be able to look at https://lore.kernel.org/all/im7gtswtfo6c24waourrtaoeazxuk5paeqblzig73knks735b2@dsj2svieqmur/ ? just got poked about it, I think I was just waiting for the OK from you on that
20:04 Lyude: ------------r5t444444444444444444444444444
20:07 karolherbst: hi cat (what's their name btw? and was that even a cat?)
20:24 gfxstrand[d]: snowycoder[d]: I thought we already had that. Or maybe we just have the other one.
20:26 snowycoder[d]: gfxstrand[d]: Doesn't seem we have it yet.
20:26 snowycoder[d]: I reproduced the issue and got
20:26 snowycoder[d]: {%r40 %r41 %r42 %r43} %p44 = suld.p.2d.strong.sys [{%r38 %r35}] %ur20
20:26 snowycoder[d]: %r45 = sel %p44 rZ 0x1
20:26 snowycoder[d]: %p46 = isetp.ne.i32 %r45 rZ
20:26 snowycoder[d]: %r47 = sel %p44 %r40 %ur9
20:26 snowycoder[d]: %r48 = f2i.u32.f32.rz.ftz %r47
20:27 snowycoder[d]: Using the test: `dEQP-VK.sparse_resources.multisampled_image_sparse_residency.r32f.samples_2`
20:45 gfxstrand[d]: Or maybe I just planned to write one and never did? That seems plausible.
20:46 gfxstrand[d]: snowycoder[d]: Probably replace it with a copy. Or if you can make it part of copy-prop somehow, that might work even better.
20:56 snowycoder[d]: gfxstrand[d]: Mmh, never tought of merging the two, I'll try
20:57 snowycoder[d]: In copy_prop, why can't we propagate the modifiers (e.g. bnot)? ["If there are modifiers, the source types have to match"](https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/compiler/nak/opt_copy_prop.rs#L247)
21:08 gfxstrand[d]: Not sure what you're asking.
21:16 snowycoder[d]: If I replace the operations with a copy, I get negated copies sometimes:
21:16 snowycoder[d]: %p2 = copy !%p1
21:16 snowycoder[d]: %r3 = sel %p2 %r40 rZ
21:16 snowycoder[d]: This cannot get propagated because there are modifiers and source types differ (predicate vs gpr).
21:16 snowycoder[d]: I tried removing that check and everything still works?
21:34 mhenning[d]: The source type shouldn't differ in your example. sel's first argument is a predicate and copy is writing a predicate
21:36 mhenning[d]: In general we can't always fold a modifier into an instruction. Eg. fadd can't use an integer negate on one of its operands - there's no way to encode that in the isa - even though there's nothing technically making that impossible for a shader to do
21:38 snowycoder[d]: It might help optimizations if we add a `SrcType::is_modifier_compatible(SrcMod)` to be less conservative
21:40 gfxstrand[d]: There isn't really much information lost there.
21:40 mhenning[d]: I'm not sure I understand how the current check is conservative
21:40 gfxstrand[d]: Also, you can't always do that. A fneg on an f32 doesn't propagate to an f16v2
21:41 gfxstrand[d]: But also, it's possible that it's just because copy-prop isn't being run again after your pass. NAK doesn't currently run optimizations in a loop.
21:41 gfxstrand[d]: Oh, and copy doesn't support modifiers at all
21:42 gfxstrand[d]: So that's an issue
21:42 snowycoder[d]: gfxstrand[d]: You mean the copy opcode or the copy-prop pass?
21:42 gfxstrand[d]: the copy opcode
21:43 snowycoder[d]: welp, now I need to be sure that copy-prop optimizes it if I want to do it in a separate pass xD
21:43 snowycoder[d]: I'm sure copy-prop is run after mine (I put mine at first and I'm checking everything with NAK_DEBUG=print)
21:44 snowycoder[d]: I guess OpCopy has always src and dst type as GPR?
21:44 gfxstrand[d]: copy-prop will propagate a no-op OpPLop2
21:44 gfxstrand[d]: snowycoder[d]: yes
21:44 mhenning[d]: It's worth noting that copy-prop also has a known bug: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12480
21:49 gfxstrand[d]: I think that one's an easy fix. See the MR. Patch untested but it builds.
21:51 gfxstrand[d]: phomes_[d]: ^^
21:52 mhenning[d]: gfxstrand[d]: Take a look at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33107 which has a fix that then exposes another bug
21:53 snowycoder[d]: gfxstrand[d]: Ooooh, I can output plop3 for b2i2b and addi for i2b2i so I don't have to use modifiers
21:54 phomes_[d]: mhenning[d]: I planned to look at updating that MR tonight. I jgot busy and then sick. But I am back today and working on things
21:54 snowycoder[d]: snowycoder[d]: Makes sense, I'll try that as a fallback if I can't merge that pass with copy-prop
21:57 phomes_[d]: I am still doing testing with vkd3d. A lot of the games I installed to test are crashing early or not launching at all though. Atomic heart has a free demo version but surprisingly it did not seem to improve
21:59 gfxstrand[d]: phomes_[d]: I left a couple comments. I'm back full time now, BTW, so feel free to ping me about stuff. (I do sleep, though, so don't expect me to always be on. 😉 )