05:01tiredchiku[d]: mhenning[d]: will test this today
05:01tiredchiku[d]: have a game that runs below 30fps, so even small percentage gains should be easy to notice
09:50phomes_[d]: I tested it on three games so far. X4 Foundations (vulkan), Serious sam (vulkan), and Rust (dxvk).
09:51phomes_[d]: All games are working as expected with the MR
09:52phomes_[d]: X4 had the same performance as with main. The two other games had improved performance when looking at the FPS and frametime reported in mangohud
09:55phomes_[d]: I measure by loading to a spot in the game where things are calm. No characters moving or other gameplay changes
10:00phomes_[d]: I run this on a 2080 card so I have graphics settings set to low. So this is just a data point for low graphics settings. It would be nice to test on high settings as well
10:01phomes_[d]: Is there a better way to test these performance improvements in games?
10:24karolherbst[d]: it was kinda important 10 years ago
10:25karolherbst[d]: well.. maybe 15 years
10:30tiredchiku[d]: the latter
11:40tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1309846063644934164/image.png?ex=674310b5&is=6741bf35&hm=83f4e4aa77ac11b9c18f0e0b268ad9704d9e54613d4f748452950e4f020fb0a0&
11:40tiredchiku[d]: mhenning[d]: after:
11:41tiredchiku[d]: before:
11:41tiredchiku[d]: https://media.discordapp.net/attachments/855613452118130729/1307367932086714399/image.png?ex=67429e84&is=67414d04&hm=cffd62a7961b19d908d1f3e679346f0ffcf1aad26585942606aa174ea1937af8&=&format=webp&quality=lossless&width=1187&height=668
11:41mohamexiety[d]: oh wow that's big
11:41tiredchiku[d]: wait
11:41tiredchiku[d]: hang on
11:41karolherbst[d]: mhh, the scene is different 😄
11:41tiredchiku[d]: yeah, realized that :p
11:42karolherbst[d]: mhenning[d]: do you want to help out with instruction scheduling latencies? Kinda always want to do it, but don't really get to it, but if somebody else does the base work and I just fill in the details, that would help out a lot
11:43tiredchiku[d]: mmh roughly the same
11:43karolherbst[d]: disappointment of the century
11:43tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1309846811028094986/image.png?ex=67431167&is=6741bfe7&hm=755287176736153b4d84407dc23a1f89a3bc66d10d54dd61d71a39e20f675dd9&
11:43tiredchiku[d]: after:
11:44tiredchiku[d]: let's try SM remaster
11:45karolherbst[d]: there are a few better benchmarks for those kind of things. I'd try pixmark_piano, but it's GL, but it's _very_ stable in terms of FPS
11:45karolherbst[d]: like you can with high confidence tell if something made a 0.1% difference
11:45tiredchiku[d]: will do
11:46karolherbst[d]: mhhh, the issue is just that with GSP you have reclocking going on, so thermals might mess up the data :blobcatnotlikethis:
11:48tiredchiku[d]: so, SM remaster was no good on my machine 😅
11:48tiredchiku[d]: granted, 1440p
11:48tiredchiku[d]: 4fps .-.
12:01tiredchiku[d]: karolherbst[d]: with the MR gets 1425 points
12:02tiredchiku[d]: ziNVK
12:09tiredchiku[d]: without the MR gets 1130
12:10tiredchiku[d]: 23 vs 18 fps respectively
12:11karolherbst[d]: I wonder how much of that is clocks getting lower, but yeah, that's significant
12:12tiredchiku[d]: could try without GSP too
12:12karolherbst[d]: pixmark_piano also only cares about shader speed, it relies on 0 memory bandwidth, it's impressive
12:12tiredchiku[d]: nice
12:12karolherbst[d]: or literally anything else
12:12karolherbst[d]: it's just a huge fragment shader doing a lot of maths
12:12tiredchiku[d]: oh I should do a dirt rally 1 benchmark
12:12tiredchiku[d]: on lowest settings :D
12:13karolherbst[d]: mhh
12:13tiredchiku[d]: or maybe highest settings
12:14tiredchiku[d]: to make it more GPU limited
12:14tiredchiku[d]: yeah
12:14tiredchiku[d]: lowest is better suited for CPU testing
12:14karolherbst[d]: fps might be too low to find any differences tho
12:15tiredchiku[d]: only one way to find out
12:17karolherbst[d]: mhenning[d]: anyway, I can explain to you how the proper instruction latency is chosen, so you could prototype it and then I can fill in the actual data
12:19tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1309855798977106000/image.png?ex=674319c6&is=6741c846&hm=59f20081623d37ecd4475851d2b43b906a03c47639e9a4e4e270ec5ee6359e08&
12:19tiredchiku[d]: before:
12:19tiredchiku[d]: 1440p ultra preset
12:28tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1309858012764311644/image.png?ex=67431bd6&is=6741ca56&hm=0bd345190bff0caeebad29194742ca54b603c4216841426a1e9651bcf0f17579&
12:28tiredchiku[d]: after:
12:30karolherbst[d]: looks good
12:52tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1309864052238979082/image.png?ex=67432176&is=6741cff6&hm=a56108b798f88f3c6ccf0fd2be4e5ecbccb22b35298d32359110105e3091349f&
12:52tiredchiku[d]: this is nvprop for reference
13:06karolherbst[d]: obviously they are cheating
13:08tiredchiku[d]: 22% diff on avg frame time
15:32gfxstrand[d]: karolherbst[d]: I started working on that but we need to rewrite the barrier code again. I should have the numbers when I get back so I was planning to do that first thing.
15:32karolherbst[d]: ah, cool
15:33gfxstrand[d]: It's annoying because, the way you described the docs, barriers are per-dependency but the current code assumes they're per-instruction.
15:33karolherbst[d]: they are counters
15:33karolherbst[d]: yeah
15:33karolherbst[d]: and release on reaching 0
15:34karolherbst[d]: there is an instruction which might come in handy for those
15:35karolherbst[d]: `DEPBAR`
15:35gfxstrand[d]: I think the current simplification technically works but it ends up inserting extra barriers and makes it hard to do the right thing in all the cases.
15:35karolherbst[d]: with `DEPBAR` you can wait on non 0
15:35gfxstrand[d]: Unfortunately, doing the right thing in all the cases requires building a full graph data structure which is annoying.
15:37karolherbst[d]: I don't know if there are certain ordering constraints, but in theory you could assign multiple instructions the same barrier and then say `DEPBAR 2 0` (wait for barrier 0 to reach 2 or below)
15:38karolherbst[d]: ehh.. it's `DEPBAR.LE` technically
15:38gfxstrand[d]: karolherbst[d]: Yeah, that's next level stuff. I'm just talking about getting the barriers right in the first place without inserting extras.
15:38karolherbst[d]: mhhh wait...
15:38karolherbst[d]: it's different
15:39karolherbst[d]: there are two lists
15:39karolherbst[d]: one barrier you can wait with an explicit number, the list are all 0 🙃
15:39karolherbst[d]: but yeah...
15:39karolherbst[d]: it's all funky
15:41karolherbst[d]: but
15:41karolherbst[d]: it also means you can assign the same barrier to multiple instructions if there is a single user (either implicit or explicit)
15:43karolherbst[d]: without DEPBAR I mean
15:43gfxstrand[d]: Yes. And that might be useful
15:43karolherbst[d]: yeah
15:44karolherbst[d]: for `ffma` especially 😄
15:44gfxstrand[d]: Most of these tricks won't matter much in the common case but I suspect we'll need them if we want fp64 to not totally suck.
15:44karolherbst[d]: ohh, there as well
15:45karolherbst[d]: but like if an ffma waits on two memory loads, could jsut use a single barrier instead of two
15:46gfxstrand[d]: Like if you ffma two texture results together, no one is going to die if the two texture ops have different barriers. But when you're doing fp64, you have so many barriers floating around that 6 is really limiting.
15:46karolherbst[d]: yeah, fair
15:47gfxstrand[d]: It's more about barrier pressure than having one less bit in the bitfield.
15:49gfxstrand[d]: But I think we can model that once we have a graph. If you go bottom-up, it's easy to assign everything waited on by the same instruction to the same barrier. You can do it top-down, too, with a little bookkeeping.
15:50karolherbst[d]: yeah
15:54gfxstrand[d]: But the first step is actually getting the dependency graph correct. And that needs a rewrite. 😫
15:55gfxstrand[d]: gfxstrand[d]: We already do something kinda like that to avoid a lot of read barriers.
15:58marysaka[d]: btw HMMA scheduling is a mess, it kind of works with serial but fails if you chain multiple of them currently :painpeko:
17:05karolherbst[d]: I think I need to do another round of dealing with clippy stuff, because building NVK sometimes doses my vscode 😄
17:06karolherbst[d]: as in.. disable clippy for a bunch of stuff
18:13mhenning[d]: gfxstrand[d]: The scheduling MR actually already materializes the full dependency graph before it re-orders instructions. https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32311/diffs?commit_id=f8c32020107f9c8fe981b5a9d50c9c8a0bcda3e8#31179351dfb5ef5f2d32f177674e55fad74ee417_0_318
18:14mhenning[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1309945197924323461/instr_dep_graph.png?ex=67436d08&is=67421b88&hm=97286ba286b6c495530e387978e36ce395cab88f9b6a185b15c931a8c57d1989&
18:14mhenning[d]: here's part of the vkcube fragment shader, for the fun of it
18:16mhenning[d]: karolherbst[d]: I'd be happy to do some prep work for putting in the real instruction latencies if you want to do that soon
18:17mhenning[d]: To be honest, I'm not following some of the discussion above about barrier semantics, so I don't fully understand what would change about the barrier pass
18:18karolherbst[d]: tldr: the scoreboard read/write barriers aren't booleans, but counters
18:18karolherbst[d]: and they signal waiting instructions on reaching 0
18:19mhenning[d]: Okay. But I guess we're currently correct by treating them as booleans and then their count is either 0 or 1?
18:19karolherbst[d]: maybe?
18:19karolherbst[d]: the issue is more that you only have 6 and it's a pain for fp64 emulation
18:20karolherbst[d]: apparently
18:21karolherbst[d]: but yeah, for the latency it matters what relation two instructions have (WaR, RaW, RaR, WaW) and what slot the register is on each side (and the GPU generation)
18:21karolherbst[d]: with slot I mean the literal source slot
18:21karolherbst[d]: or dest
18:21karolherbst[d]: there are instructions where a different sort slot might be read later, or a dest written to later
18:22karolherbst[d]: and each source/dest slot kinda has a category
18:22karolherbst[d]: though a simpler model might be possible
18:23pixelcluster[d]: mhenning[d]: oh boy imagine this graph but with like an Unreal Engine-tier ubershader
18:24karolherbst[d]: image that with a ray tracer
18:24pixelcluster[d]: well, if you're using RT pipelines it could be tolerable because it's split into a lot of small(er) chunks
18:25pixelcluster[d]: but e.g. Portal RTX does have a mode to inline the entire pipeline into one shader for ray queries and it is *not* funny
18:25karolherbst[d]: don't worry, I've seen CL raytracer generating 2 million of SSA values after being inlined
18:26mhenning[d]: karolherbst[d]: I think that Faith's code is already set up in a way to support all of that well. Take a look at https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/compiler/nak/calc_instr_deps.rs#L522-562
18:26pixelcluster[d]: karolherbst[d]: yeah... rip
18:26karolherbst[d]: mhenning[d]: ahh yeah, might be good enough already
18:26karolherbst[d]: though
18:26karolherbst[d]: you need to know both sides
18:26karolherbst[d]: it differs depending on both instructions
18:27karolherbst[d]: like alu - alu have the same latency than float - float, but float - alu have a different one
18:27karolherbst[d]: mhh
18:28karolherbst[d]: but the interface should be fine
18:28karolherbst[d]: just `instr_latency` not 😄
18:28mhenning[d]: yeah, those functions get a pair of two Ops for the first and second function, as well as indices into their sources or dests
18:28asdqueerfromeu[d]: mhenning[d]: Why is the scheduler greedy though?
18:28karolherbst[d]: mhhhhh
18:28karolherbst[d]: I don't think the Op is enough
18:28karolherbst[d]: instruction flags also matter
18:28karolherbst[d]: like hi/lo
18:29karolherbst[d]: maybe it was just .wide that does
18:29mhenning[d]: Yea, instr_latency is only used in one place in the acutal scheduling code right now - we can probably get rid of it
18:29karolherbst[d]: ohh yeah hi/lo matters on the source
18:29karolherbst[d]: imad.wide is the one instruction which goes crazy here
18:30karolherbst[d]: but for MMA the config matters as well
18:30mhenning[d]: karolherbst[d]: When I say of, it's not just the opcode, it's the `Op` enum, which has all of the flags
18:30karolherbst[d]: ahh
18:30karolherbst[d]: so you have all the information
18:30mhenning[d]: yeah, I think it's all there
18:30karolherbst[d]: okay
18:31mhenning[d]: asdqueerfromeu[d]: It;s greedy in this sense: https://en.wikipedia.org/wiki/Greedy_algorithm
18:31karolherbst[d]: maybe I give it a go as a Christmas present and give everybody 25% more perf
18:31karolherbst[d]: lol
21:47gfxstrand[d]: karolherbst[d]: Well, mostly. It gives both ops for latency but still expects fixed vs. variable to be per-op. I have a branch where I started fixing that but then I needed to rework the barrier code.
21:49gfxstrand[d]: I also was reworking things to go through the SM abstraction for it all.
21:49karolherbst[d]: gfxstrand[d]: sooooo.. there is a thing, sometimes it's also useful to know what register index is used 🙂
21:49gfxstrand[d]: 😫
21:49karolherbst[d]: because some instructions have internal shortcuts
21:50gfxstrand[d]: Yeah, this is why I put it down. I need to be able to look at the tables myself before I can model it.
21:50karolherbst[d]: but I'm not 100% on that, but there are funky situation swhere the latency can be much lower under certain circumstances
21:50gfxstrand[d]: But I should have it within a week or to of coming back in February
21:50karolherbst[d]: might also be something like "dest goes into src2" or something like that
21:51gfxstrand[d]: Yeah
21:51karolherbst[d]: the explanation isn't part of the tables sadly 🙂
21:51gfxstrand[d]: Well no but I should have it
21:52karolherbst[d]: mhhh I'd need to check if the explanation is in the docs actually...
21:52karolherbst[d]: might just be you'd need to know the exact term
21:53karolherbst[d]: it only really matters for MMA anyway
22:49mrmx450: hi
22:51mrmx450: is the command 'piglit $ ./piglit-run.py -x glx -x streaming-texture-leak -x max-texture-size \ -1 --dmesg tests/gpu.py \ ~/piglit-results/nvXX-`date +%Y-%m-%d`-$USER' still good for testing nouveau or is there anything better?
23:50awilfox[d]: Just as an interesting footnote…
23:51awilfox[d]: Remember on the Fermi on my big endian PPC64, we had a "failed to create ce channel -22" and "using M2MF for buffer copies", and I had to patch it to make COPY work?
23:52awilfox[d]: I am doing unrelated distro tests on a ThinkPad W520 with a Quadro 1000, which is a GF108 (same chip as the GT 520 in the ppc64), and it has the same messages in dmesg! But it does work in X11 and Wayland.
23:52awilfox[d]: So I guess those messages are just normal on Fermi. Or maybe I should try and get my patch better so it doesn't have to be normal…