00:01 redsheep[d]: Is it possible to do something like wireshard with the pcie bus, just to see what traffic happens and when, to try to see why?
00:01 gfxstrand[d]: NVIDIA hardware counters can probably tell me that.
00:02 gfxstrand[d]: But I don't think that you'd want to intercept all PCIe traffic. That'd slow things down incredibly.
00:02 gfxstrand[d]: And it'd need to be done by something that's aware of GPU internals or else it's just "OMG the GPU is using lots of PCI"
00:02 redsheep[d]: Yeah it would just be for debug of course
00:02 mhenning[d]: some cpus have hardware counters for pcie traffic
00:03 mhenning[d]: unfortunately I never got it working on my zen4
00:03 gfxstrand[d]: Yeah but unless I know where in the GPU that PCIe traffic is going and how it relates to the game trace, it's not that helpful.
00:03 gfxstrand[d]: I guess it'd be nice to know if I was PCIe-bound
00:03 mhenning[d]: Yeah, it could at least hint at whether you're saturating the link
00:17 gfxstrand[d]: Okay, I think I've done enough damage for one day. Maybe tomorrow I can decide what to do about VKD3D and SSBOs.
00:27 redsheep[d]: gfxstrand[d]: I was envisioning something like a network packet capture, but instead of source and destination ip you'd have like, maybe the function that ran on the cpu that generated that traffic? Not sure if that is even conceptually possible or makes any sense
00:27 redsheep[d]: And obviously some of the traffic doesn't originate from the cpu explicitly doing something
00:33 airlied[d]: you should make sure we are ramping up the pcie bus to top speed as well
02:11 gfxstrand[d]: airlied[d]: How do you measure that?
02:11 gfxstrand[d]: I should also see if I can verify that I'm getting all 16 lanes. I think I plugged stuff into my motherboard the right way but it's a mess.
02:29 gfxstrand[d]: faith@zoot% sudo cat /sys/bus/pci/drivers/nouveau/0000:01:00.0/current_link_width ~/.local/share
02:29 gfxstrand[d]: 8
02:29 gfxstrand[d]: faith@zoot% sudo cat /sys/bus/pci/drivers/nouveau/0000:01:00.0/current_link_speed ~/.local/share
02:29 gfxstrand[d]: 16.0 GT/s PCIe
02:29 gfxstrand[d]: Does that mean I have something plugged in wrong and I'm only getting 8x?
02:29 gfxstrand[d]: That would be a bummer
02:31 gfxstrand[d]: `max_link_width` is also 8
02:32 gfxstrand[d]: `lspci` says `LnkSta: Speed 16GT/s, Width x8`
02:32 airlied[d]: Yeah sounds like 8, lspci -vvnn is usually where I look and sysfs
02:32 gfxstrand[d]: bummer
02:32 airlied[d]: Some GPUs are also only 8x capable
02:32 gfxstrand[d]: Time to pull the motherboard manual out. ๐Ÿ˜–
02:32 airlied[d]: Though not sure what NVIDIA is like there
02:33 redsheep[d]: Is it in the primary slot? Is there more than one card in your board?
02:34 redsheep[d]: gfxstrand[d]: Hang on, is this your 4060? AD107 is only 8x capable, it doesn't physically have 16x
02:35 gfxstrand[d]: Okay, that would explain it.
02:36 gfxstrand[d]: So PCI seems to be doing as much PCI as PCI is going to do
02:36 gfxstrand[d]: I don't need to freak out about my NVME drive config. ๐Ÿ˜…
09:05 magic_rb[d]: mhenning[d]: Never mind i cant try it unless i backport it to 6.12 :( im running ZFS
09:37 snowycoder[d]: What IDE are you using for development? vscode is nice for c++ but only gives basic syntax highlighting for rust and I cannot get it to fully work.
10:03 magic_rb[d]: Honestly, install clion. As in i use Emacs and its really really good. But emacs is hard to setup and you essentially have to become a emacs lisp wizard eventually. But ive never heard of anyone *not* having problems with vscode especially in the more low level spheres. Vscode works great for javascript and python, but thats about the limit of "painless languages"
10:04 magic_rb[d]: For c++ and rust you need a lang server, ccls/clangd and rust-analyzer respectively. My recommendation is on ccls, in my experience it works better. For syntax highlighting you either need a better plugin or somehow convince vscode to use tree sitter
10:04 snowycoder[d]: It works great for "standard" things, but this project is far from standard and rust-analyzer kinda gives up
10:05 magic_rb[d]: Yeah if thats the case, youll have to get used to no LSP and rely on ctags or similar
10:05 magic_rb[d]: For autocompletion you can get a generic "completes words that it sees in the project without any context" solution
10:06 magic_rb[d]: Like if i were to do linux kernel dev, i wont even try to get LSP working. ctags and grep will have to do
10:07 snowycoder[d]: Oh no with linux LSP works pretty well once you generate the compile command list, even though I've only figured out how to setup nvim to work with it
10:07 airlied[d]: Been using Emacs for 20 years without learning any lisp, I don't customise much
10:07 magic_rb[d]: Hm, maybe someone else can chip in then. I havent actually tried to setup rust analyzer for mesa
10:08 magic_rb[d]: airlied[d]: Youre the exception not the rule i feel like :(
10:08 magic_rb[d]: \*:)
10:09 magic_rb[d]: My experience is that people start using emacs, it doesnt quite work how they want. They try to adapt it, but quickly hit the hill that is elisps learning curve and then give up and go back to vscode
10:09 snowycoder[d]: That's me with nvim
10:10 snowycoder[d]: "Oh wow these commands are fun, how do I add syntax highlighting?"
10:10 snowycoder[d]: *300 pages to setup lua plugins in a slightly different way with 4 different package managers and sync/async methods*
10:10 snowycoder[d]: "Welp"
10:11 magic_rb[d]: Meanwhile im rewriting my emacs config from scratch, see: https://git.redalder.org/magic_rb/rock-emacs im writing it in a way where its meant to be used by multiple people. So feel free to try
10:11 snowycoder[d]: I've only really learned how to use it properly when I was forced to work on Linux as all other IDEs didn't work
10:11 magic_rb[d]: It requires emacs 29 possibly 30
10:49 bigsmarty[d]: snowycoder[d]: Try helix
12:23 mohamexiety[d]: snowycoder[d]: https://nvchad.com I heard of this a while back but havenโ€™t tried it
12:24 mohamexiety[d]: The advertisement is basically neovim with fancy defaults tho
13:09 gfxstrand[d]: snowycoder[d]: I use vim with like 3 extensions and a pretty simple config.
13:10 snowycoder[d]: gfxstrand[d]: Huh, the official rust LSP server can handle even this strange config?
13:11 gfxstrand[d]: I don't use any fancy shit
13:11 gfxstrand[d]: The fanciest I get is `rustfmt`
13:15 tiredchiku[d]: I just use Kate .-.
13:16 gfxstrand[d]: I used Kate for years back in the day. Having a good text editor with a built-in terminal was pretty killer.
13:16 gfxstrand[d]: That was my standard before I switched to vim.
13:16 tiredchiku[d]: yeah, Kate is quite nice
13:17 tiredchiku[d]: I'd use vim but I cba to learn the keybinds
13:17 tiredchiku[d]: I do know the basics of vim, but nothing past that
13:18 gfxstrand[d]: gfxstrand[d]: No shade on the people who do. Use whatever helps to make you effective. Just that with my brain, auto complete often makes me *less* productive.
13:19 snowycoder[d]: tiredchiku[d]: I used vim with basic keybinds and some custom shortcuts like "go to definition" or "go to implementation" that are always helpful when navigating a big codespace
13:22 mohamexiety[d]: I use vscode personally. Couldnโ€™t get rust analyzer to work properly with mesa though beyond the bare basics so I am on my own there but itโ€™s not really a big deal
13:23 mohamexiety[d]: The only annoyance I have with vscode is it sometimes likes to insert spaces/screw up indentation
13:24 snowycoder[d]: mohamexiety[d]: Doing the exact same thing now.
13:24 snowycoder[d]: I'm kinda craving code completion or errors because I completely forgot how to use proc-macros
13:24 snowycoder[d]: (Trying to hack together an assembler, let's see if it works)
13:25 mohamexiety[d]: Yeah.. I come from HW design land so I have pretty high tolerance to jankiness in tools :blobcatnotlikethis:
13:25 snowycoder[d]: Oh god, I'm having flashback from my FPGA exam:blobcatnotlikethis:
13:26 mohamexiety[d]: Yep. Rust analyzer not working is a far cry from a IDE thatโ€™s actively hostile towards you :KEKW:
13:28 tiredchiku[d]: mohamexiety[d]: oh mohammed can I DM you about this (hw design)
13:40 marysaka[d]: I have a setup with vscodium + clangd + rust-analyzer and some extensions for XML and that's about it
13:41 marysaka[d]: otherwise I use a plain unconfigured vim sometime
13:41 snowycoder[d]: rust-analyzer works in your setup?
13:42 marysaka[d]: You need to pass the correct rust project yes
13:42 snowycoder[d]: I did pass rust-project.json, it kinda works with regular nak code but it completely doesn't for proc-macro code
13:43 marysaka[d]: like it doesn't show up or just erroring?
13:43 snowycoder[d]: It shows up, but doesn't have any type hint from the `syn` crate
13:44 snowycoder[d]: And that's like 90% of the strange types xD
13:44 marysaka[d]: hmm yeah I think that's expected on my setup... I'm just too used to proc macros so :linatehe:
13:44 marysaka[d]: you can open docs.rs on the side to look up stuffs in that case but yeah....
13:45 snowycoder[d]: Yeah, I'm writing like that, it's not too hard, just inconvenient
13:45 snowycoder[d]: tysm!
13:45 marysaka[d]: no worries
13:46 marysaka[d]: I just use clangd and rust-analyser to get type info on my side, the rest doesn't really matter
13:46 marysaka[d]: I even have clangd setup to suppress all errors and warnings because I don't want to see them in my IDE
15:15 esdrastarsis[d]: snowycoder[d]: zed :ferris_happy:
15:22 mohamexiety[d]: tiredchiku[d]: Sure
15:34 gfxstrand[d]: Ugh... I don't think the cheap predication plan is going to work.
15:34 gfxstrand[d]: Time to start typing, I guess.
15:34 tiredchiku[d]: https://tenor.com/view/cat-computer-typing-working-funny-cats-gif-12030261
16:12 snowycoder[d]: Yes! my hacks are working!
16:13 snowycoder[d]: I've hacked together a derive-macro that auto-implements DisplayOp from an annotated struct:
16:13 snowycoder[d]: #[repr(C)]
16:13 snowycoder[d]: #[derive(SrcsAsSlice, DstsAsSlice, DisplayOp)]
16:13 snowycoder[d]: #[display_op(format="opfadd")]
16:13 snowycoder[d]: pub struct OpFAdd {
16:13 snowycoder[d]: #[dst_type(F32)]
16:13 snowycoder[d]: pub dst: Dst,
16:13 snowycoder[d]: #[src_type(F32)]
16:13 snowycoder[d]: pub srcs: [Src; 2],
16:13 snowycoder[d]: #[modifier(".sat")]
16:13 snowycoder[d]: pub saturate: bool,
16:13 snowycoder[d]: #[modifier(def = FRndMode::NearestEven)]
16:13 snowycoder[d]: pub rnd_mode: FRndMode,
16:13 snowycoder[d]: #[modifier]
16:13 snowycoder[d]: pub ftz: bool,
16:13 snowycoder[d]: }
16:13 snowycoder[d]: The results are identical to the manual impl.
16:13 snowycoder[d]: The next step would be to, uhm, auto-implement a deserializer 0_0
16:14 snowycoder[d]: Tell me if it might be interesting
17:22 gfxstrand[d]: Oh, neat!
17:22 gfxstrand[d]: Yeah, that looks pretty nice.
17:22 gfxstrand[d]: I was wondering how we should do booleans and I really like the `#[modifier]` thing you did
17:23 gfxstrand[d]: gfxstrand[d]: I've got the initial typing for predicate destinations. They weren't nearly as bad as I'd feared. Now to see if I can figure out the liveness and R/A rules.
17:25 magic_rb[d]: marysaka[d]: Do you also disable the squiggles? I have it so it doesnt automatically show and i have to trigger the pop up with `C-c ,` (emacs keybind)
17:54 tiredchiku[d]: question from another server
17:55 tiredchiku[d]: faith you can answer it there directly too ๐Ÿ˜… https://discord.com/channels/853130811581530142/855613452118130729/1342554363213774918
18:24 gfxstrand[d]: Okay, I think once I figure out RA, NAK will have predication support.
18:24 gfxstrand[d]: Then I have to use it for something and show that it actually works. ๐Ÿ˜ฌ
18:34 nunwan: Hi. Quite the noob question, but I was trying to understand my way to nouveau and nvk and I was checking how the nvidia proprietary driver works. But I had to reinstall nvidia drivers. How do you manage having both to check how nvidia driver is doing and then test with the built nouveau driver ? Do you even have both installed ? My bad if it is part of a FAQ I missed.
18:34 tiredchiku[d]: personally, I have 2 kernels installed, with only one of them having the nvidia driver for it
18:38 nunwan: Ok ! Didn't even think of doing that ... Thanks :)
18:41 tiredchiku[d]: I use the distro's default kernel for nvidia, and the release-channel kernel for nouveau/nvk
18:42 nunwan: Will definitely try thanks
18:51 esdrastarsis[d]: Why is predication important? (is it for perf?)
18:53 tiredchiku[d]: if I understand it correctly, yes
18:53 tiredchiku[d]: it allows ops to be executed conditionally
18:54 tiredchiku[d]: afaik it's used for stuff like occlusion culling
18:56 mhenning[d]: Faith is working on a different kind of predication than that
18:56 tiredchiku[d]: https://developer.nvidia.com/gpugems/gpugems2/part-iv-general-purpose-computation-gpus-primer/chapter-34-gpu-flow-control-idioms
18:56 tiredchiku[d]: mhenning[d]: oh
18:58 mhenning[d]: but yes, it's a performance thing. At the shader level we can turn small if statements into predicates. If the `if` statement is small enough, then we save a few cycles because we don't have the overhead of running a full branch instruction
18:59 tiredchiku[d]: huh
19:00 tiredchiku[d]: is this similar to likely/unlikely in C?
19:00 mhenning[d]: Oh, maybe you are talking about the same thing
19:00 tiredchiku[d]: for a very surface level understanding
19:00 tiredchiku[d]: mhenning[d]: maybe ๐Ÿ˜…
19:00 tiredchiku[d]: I don't understand it all very well
19:01 kayliemoony[d]: tiredchiku[d]: predication at an instruction level means the instruction can check a condition as part of its operation and simply no-op if that condition is not true
19:01 mhenning[d]: It's not exactly the same thing as likely/unlikely, but it is a way for the shader compiler to optimize branches
19:01 kayliemoony[d]: The condition in question is generally flag checks
19:01 tiredchiku[d]: I see
19:02 HdkR: For a CPU side comparison, it's more similar to AVX512 mask register usage :)
19:02 kayliemoony[d]: So instead of any explicit branch instructions (which have overhead when being taken) a constant cost can be paid for the instruction
19:02 tiredchiku[d]: kayliemoony[d]: so my understanding wasn't too far off, it's executing code (instructions) based on what's been executed previously
19:03 tiredchiku[d]: HdkR: I know CPUs even less e-e
19:05 kayliemoony[d]: kayliemoony[d]: I would assume the gpu shader cores are pipelined, so small branches skipping only 1-3 instructions are likely the target here opts wise
19:06 HdkR: Avoiding divergent branches are the real improvement
19:06 kayliemoony[d]: Yea I simply don't know your divergent branch cost to comment on that.
19:07 HdkR: For NVIDIA, up to 1/32nd the perf if fully divergent. A simple selection for ternary operation could be 1/16th in wrost case
19:08 HdkR: Huge performance costs because of it :)
19:08 HdkR: er wait, 1/2th performance in worst case for ternary. derp derp
19:11 kayliemoony[d]: I mean with predication it's still divergent, you're just paying a constant cost for the divergence
19:11 HdkR: Yea, which is a bigger win compared to serializing each divergent thread
19:14 kayliemoony[d]: okay tbf I don't know Nvidia's exec model otoh
19:14 kayliemoony[d]: My assumption was the individual threads within a wavefront could either be executing or not executing, and could not all be individually serialized
19:17 HdkR: Yea, if you only have some partial divergence then you'll get multiple threads executing still, but they still means you're serializing each set of diverged PCs
19:25 HdkR: Volta and newer also slightly changed the programming model, Where on a reconvergence point it doesn't need to guarantee the other threads get executed or something. Kind of wacky
19:30 HdkR: syncwarp is fancy stuff :D
19:51 gfxstrand[d]: In theory, NAK re-converges hardcore so I'm not too worried about the divergence.
19:51 gfxstrand[d]: However, I'm sure those barrier instructions aren't cheap
19:52 gfxstrand[d]: And predication will also allow us to more effectively pipeline loads and stores because we won't have control flow around every single one.
19:55 rinlovesyou[d]: tiredchiku[d]: you can also just run everything through one kernel, sadly just `module_blacklist=nvidia .` and `nouveau.modeset=1` in the kernel parameters still doesn't get nouveau to load at start, probably due to nvidia existing
19:55 mhenning[d]: gfxstrand[d]: I mean, ideally control flow wouldn't wait on all pending variable latency instrs - that's more of a compiler limitation right now
19:56 mhenning[d]: rinlovesyou[d]: Nvidia adds a blacklist for the nouveau kernel driver by default. You need to disable that for nouveau to load
19:56 tiredchiku[d]: /usr/lib/modules-load.d has a blacklist, yeah
19:56 tiredchiku[d]: on arch, that's where it is
19:56 tiredchiku[d]: part of the nvidia-utils package
19:57 rinlovesyou[d]: i don't see any blacklist there
19:57 rinlovesyou[d]: is it just this `nvidia-uvm` that prevents nouveau from starting?
19:58 tiredchiku[d]: oh, sorry
19:58 tiredchiku[d]: /usr/lib/modprobe.d
19:58 rinlovesyou[d]: i see
19:58 rinlovesyou[d]: thta's a bit annoying
19:58 tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1342586309092311122/Eo4ElMv.png?ex=67ba2c70&is=67b8daf0&hm=cf3b169104199a08ea406eca7c5e9369fc875506bad534c9de654974ae11a0d8&
19:58 gfxstrand[d]: mhenning[d]: Yes. There are a lot of parallel problems going on right now.
19:59 rinlovesyou[d]: a lot more convenient if i didn't have to touch anything and just boot into a different kernel configuration but i guess i'll have to take care of that when i want to switch (otherwise i have to manually run `sudo modprobe nouveau`)
20:00 tiredchiku[d]: I wrote a pacman hook to disable that blacklist
20:00 rinlovesyou[d]: yeah i guess just removing it would be fine, if the nvidia module isn't blacklisted i'm sure it would overtake nouveau anyways
20:00 tiredchiku[d]: and just to early module loading for nvidia
20:00 tiredchiku[d]: to be extra sure
20:01 rinlovesyou[d]: surely `nouveau.modeset=0` would also do the trick in the config for nvidia
20:01 tiredchiku[d]: ยฏ\_(ใƒ„)_/ยฏ
20:01 tiredchiku[d]: dunno
20:03 rinlovesyou[d]: i'll throw `module_blacklist=nouveau .` in to be safe
20:04 mhenning[d]: tiredchiku[d]: No need for a pacman hook. You can just `touch /etc/modprobe.d/nvidia-utils.conf` and that file will override the one from the package (and disable the blacklist)
20:04 tiredchiku[d]: huh
20:04 tiredchiku[d]: TIL
20:09 rinlovesyou[d]: mhenning[d]: looking at what the pkgbuild does, it would overwrite it
20:10 mhenning[d]: That's not what happens? I have that setup on my machine
20:11 mhenning[d]: Note that /etc/modprobe.d/nvidia-utils.conf is in a different directory than the one in the package
20:11 rinlovesyou[d]: well the line is `echo "blacklist nouveau" | install -Dm644 /dev/stdin "${pkgdir}/usr/lib/modprobe.d/${pkgname}.conf"`
20:12 rinlovesyou[d]: and if i do this on an arbitrary text file that file gets overwritten
20:12 mhenning[d]: Right, and that's a different file
20:12 rinlovesyou[d]: ohhh i see
20:13 rinlovesyou[d]: `/etc/modprobe.d` takes priority over `/usr/lib/modprobe.d`
20:13 rinlovesyou[d]: i misread your first message as saying that the pkgbuild would fail to overwrite the file
20:13 rinlovesyou[d]: tiredchiku[d]: i too am tired
20:25 gfxstrand[d]: In theory, predication now works. \o/
20:26 gfxstrand[d]: Of course I haven't tried it. ๐Ÿ˜…
20:26 gfxstrand[d]: And RA is terrifying
20:26 gfxstrand[d]: (But RA is always terrifying. Predication only makes it a tiny bit worse.
20:26 tiredchiku[d]: ~~what's RA again?~~
20:27 mhenning[d]: register allocation
20:27 mhenning[d]: gfxstrand[d]: pfft, who needs testing? just ship it
20:27 tiredchiku[d]: I see, thanks
20:28 gfxstrand[d]: That's not what you said when I was trying to "just ship" `MemScope::GPU`. ๐Ÿ˜›
20:28 mhenning[d]: ๐Ÿ˜›
21:12 gfxstrand[d]: Of course RA is broken. ๐Ÿคก
21:13 gfxstrand[d]: Kinda beautiful seeing this madly branchy shader be a single block, though. ๐Ÿ˜„
21:20 gfxstrand[d]: r15 = mov ur0 // delay=6
21:20 gfxstrand[d]: @p0 r15 = ld.global.a64.constant.b32 [r12..14] // delay=4 rd:0 wr:3
21:20 gfxstrand[d]: Beautiful!
21:33 karolherbst[d]: you know what's another good use for predicates? loop merging ๐Ÿ™ƒ
21:33 karolherbst[d]: turning the entire thing into a state machine, no branching except for the outer loop
21:34 karolherbst[d]: I saw nvidia doing that kind of optimization
21:34 karolherbst[d]: perfectly converged threads
22:14 gfxstrand[d]: Holy predicated instructions, batman! CTS passed on the first try and *Dragon Age: The Veilguard* runs and appears to be 5-10% faster.
22:15 gfxstrand[d]: I'm now at a solid 27 FPS
22:15 gfxstrand[d]: I wonder what happens if I combined this with mhenning[d]'s scheduler.
22:17 gfxstrand[d]: https://tenor.com/view/batman-and-robin-holy-gif-21474355
22:19 mhenning[d]: gfxstrand[d]: The postpass scheduler should just need adjustments to generating the dep graph. The prepass scheduler might be more of a headache (at least, if you want the pressure heuristic to be accurate)
22:29 gfxstrand[d]: One of my tasks for next week is to do better dependency graph generation. I want to make a thing that we generate as part of `calc_instr_deps` that's a generic graph of dependencies, including latency information. Right now it tries to do everything in one pass and it's not broken but it's suboptimal in a bunch of cases.
22:30 mhenning[d]: That sounds a bit like what I already wrote for the scheduler, please take a look at that before you write a new one
22:30 gfxstrand[d]: Will do
22:30 gfxstrand[d]: It could be that what you wrote is already what we need. I glanced at it but haven't read through it in depth
22:31 gfxstrand[d]: I guess that means I should review the scheduler next week, too.
22:31 snowycoder[d]: If we keep gaining 10% every few days we'll reach up with nvidia fast๐Ÿ‘€
22:32 gfxstrand[d]: heh
22:32 gfxstrand[d]: Maybe
22:32 gfxstrand[d]: There's a lot of little 10%s
22:32 mhenning[d]: snowycoder[d]: https://xkcd.com/605/
22:32 snowycoder[d]: Lol
22:32 gfxstrand[d]: Also, my predication MR is nowhere near something I'd want to land.
22:32 gfxstrand[d]: My NIR pass at the end is impressively hacky
22:33 gfxstrand[d]: (And weirdly effective.)
22:33 snowycoder[d]: What's the long-term plan for nouveau/nvk?
22:33 gfxstrand[d]: What do you mean?
22:34 snowycoder[d]: Do you plan to ever reach performance/feature parity with nvidia closed?
22:34 mhenning[d]: We're certainly trying to move in that direction
22:35 karolherbst[d]: gfxstrand[d]: yeah.. not diverging too hard helps a lot
22:35 gfxstrand[d]: snowycoder[d]: We're going to try.
22:35 karolherbst[d]: how many instructions are you "flattening"?
22:36 gfxstrand[d]: karolherbst[d]: We shouldn't be diverging. NAK hard-core re-converges at every opportunity.
22:36 karolherbst[d]: well
22:36 gfxstrand[d]: karolherbst[d]: One and only if it's `load_global`. ๐Ÿ˜›
22:36 karolherbst[d]: you still diverge within the if
22:36 karolherbst[d]: well
22:36 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33676/diffs?commit_id=cda1f443c553f2a1470560b84a981f8a49535222
22:36 karolherbst[d]: you want that to be a more generic optimization
22:36 gfxstrand[d]: Well, yes. I know that
22:36 snowycoder[d]: gfxstrand[d]: Hope I can help a little bit then, I love working on nvk
22:37 gfxstrand[d]: But the thing that has me really worried at the moment isn't the general case of "oops, they typed an if". NIR will handle a bunch of flattening for us if we turn on the peephole. (I should do that...)
22:38 karolherbst[d]: nir has a pass for that?
22:38 karolherbst[d]: or is it the bcsel stuff?
22:38 gfxstrand[d]: The thing we're doing right now that's catestrophic is
22:38 gfxstrand[d]: if %in_bounds {
22:38 gfxstrand[d]: %0 = load_global(...)
22:38 gfxstrand[d]: } else {
22:38 gfxstrand[d]: }
22:38 gfxstrand[d]: %1 phi %0 0
22:38 gfxstrand[d]: for every single buffer load.
22:39 karolherbst[d]: mhhhhhhhh
22:39 karolherbst[d]: I just had a cool idea
22:39 karolherbst[d]: sooooooo
22:39 karolherbst[d]: if you turn ifs into bcsels.. you could in theory convert the bcsel into guard predicates on its sources and follow the chain a bit
22:40 karolherbst[d]: and nuke the bcsel
22:44 karolherbst[d]: not sure how simple that would be though, because that's certainly easier done in SSA, but adding guard predicates while in SSA can cause all other funky issues
22:44 gfxstrand[d]: It actually wouldn't be too hard with the predication stuff I just typed today.
22:45 gfxstrand[d]: You'd have to do a prepass to count the number of uses of each SSA value and make sure you only do it when the `sel` is the only user of the two things and when the predicate for the `sel` reaches both of them.
22:45 gfxstrand[d]: But it's not intractable.
22:45 karolherbst[d]: yeah
22:45 karolherbst[d]: that would be the idea
22:48 karolherbst[d]: I think a scheduler could even interleave the p0 and !p0 arms for extra benefits
22:49 karolherbst[d]: reordering might even be able to kill most of the waits
22:49 karolherbst[d]: *with
22:57 gfxstrand[d]: I mean, that's the point of the bcsel thing. You can interleave the ALU better if it's flat than if it's in two sides of an if.
22:57 gfxstrand[d]: It costs a bit more register pressure but you get better latency
22:58 gfxstrand[d]: The really tricky part is trying to get both by having your register interference model predicate-aware.
22:58 gfxstrand[d]: But then you start getting a funny interference graph and all the niceness of SSA-based register allocation goes poof
23:10 gfxstrand[d]: Okay, time to add a clearly broken NIR optimization which is probably not too broken and which torches half my address calculation math.
23:12 gfxstrand[d]: Thanks to nvidia giving us a "free" +constant on loads and stores, I'm not sure 32-bit offsets is actually better than full 64-bit math.
23:15 gfxstrand[d]: Shaders also compile way faster now that we have fewer basic blocks going into NAK.
23:16 gfxstrand[d]: Okay, getting rid of half my address math doesn't seem to help much
23:19 gfxstrand[d]: Eh, it hits 28 FPS sometimes now. <a:shrug_anim:1096500513106841673>
23:27 karolherbst[d]: gfxstrand[d]: I think you can safely use the same registers on each side
23:28 gfxstrand[d]: gfxstrand[d]: Yes, and then RA becomes a hard problem again.
23:28 karolherbst[d]: I see...
23:30 karolherbst[d]: I wonder if a focused opt pass that tries to find values only used there and just use the same register, but ....
23:30 karolherbst[d]: sounds like a "good enough" thing running later should do it
23:30 gfxstrand[d]: Yes, and then you would lower to the exact same predication thing I just built which SSA allocates quite nicely.
23:31 karolherbst[d]: fair enough
23:31 gfxstrand[d]: I have thought about this problem. A lot.
23:45 gfxstrand[d]: Like, you roughly have three options:
23:45 gfxstrand[d]: 1. Do all predication post-RA. This means you can do anything you want but you're restricted by whatever RA handed you. If you're manipulating control-flow, you're restricted by RA's usage of phis. If anything you do results in dead code or wants further optimization, you're now writing non-SSA compiler passes that can't even allocate registers. It works for very limited things but it starts to
23:45 gfxstrand[d]: suck pretty quickly.
23:45 gfxstrand[d]: 2. Play some sort of phi-like game. For the sake of argument, we'll call them "pi" nodes. You emit things like
23:45 gfxstrand[d]: ```
23:45 gfxstrand[d]: @p op1 %1 %2 %3; @!p op2 %4 %5 %6; pi %7 p %1 %4
23:45 gfxstrand[d]: and you hope that someone is able to come along and coalesce `%1`, `%4`, and `%7` into a single register. How does that work? Good question! There is very little prior art on how to do this. You can do it by treating everything as just regular SSA things and `pi` as `sel` but then you aren't getting any register pressure benefits and you might as well be using `sel`. The other option is somehow
23:46 gfxstrand[d]: teach RA, liveness, and friends how to reason about predicated interference. Is this theoretically possible? Maybe. Do I know how to do it? No. Will I accept "Oh, I'm sure it's not that hard!" as a reason to not worry about this complication? No, not without proof.
23:46 gfxstrand[d]: 3. You make your predication SSA friendly. That's exactly what I've done. Each predicated instruction has a source embedded in each destination and the source says what value to use if the predicate is false. In this way, all your live ranges are naturally split, everything is SSA friendly, copy propagation works naturally, and your liveness and RA work just as well as they do without predication.
23:46 gfxstrand[d]: RA is a bit more complicated because there are now more source types to reason about but it's fundamentally the same.
23:49 gfxstrand[d]: Now one potential 2b option, and what I just suggested above would be to start off with the form in 2 and then have a pass which looks at it and turns it into
23:49 gfxstrand[d]: @%p op1 %1 %2 %3
23:49 gfxstrand[d]: @!%p op2 %4?%2 %5 %6
23:49 gfxstrand[d]: and, voila! You have 2 turned int 3 and SSA-based RA works again.
23:51 gfxstrand[d]: The downside to 3 is that you can't move things past each other. It kind of ties scheduling's hands. That's necessary if one predicated thing is actually overwriting the values coming from another (maybe not predicated) thing but it's not necessary if the two things have opposite predicates.
23:52 gfxstrand[d]: This isn't too bad for the dependency tracker as those are typically going to be WaW hazards which are cheaper and often free. If you're trying to actually free schedule instructions, it can be more limiting.