14:08 kuter7639[d]: Wanted to ask why Mesa backends general use register allocation in SSA form (Braun-Hack et al algorithm) instead of more Chaitin-esque allocators. Is it because SSA based allocation is faster? Remember seeing somewhere that vector register (adjacent registers being written to by a instruction) usage had something to do with it but I don't remember where.
14:10 kuter7639[d]: Any pointers to papers or discussions would be appreciated.
14:17 kuter7639[d]: if anyone else is interested, there is this talk that talks about this a bit. https://www.youtube.com/watch?v=lHhV6KyNCG0
14:19 karolherbst[d]: yeah.. I think the tldr is that SSA reg alloc is practically O(n log n)
14:46 gfxstrand[d]: O(n)
14:46 gfxstrand[d]: Well, data-flow gets to be a bit more like n^2 worst case
14:46 gfxstrand[d]: But traditional graph coloring is like O(n^4) or something stupid like that.
15:00 karolherbst[d]: ahh
15:02 karolherbst[d]: I thing the worst part of graph coloring is, that you retry after spilling
15:02 dadschoorse[d]: I think another advantage of SSA reg alloc is that you can compute a fixed maximum register pressure beforehand. That allows you to do spilling as a separate step and since you have that register pressure info, it can also be used to schedule memory loads without causing additional spilling
15:03 karolherbst[d]: yeah.. that's somewhat useful on nvidia, where the amount of registers used actually impacts how many threads you can run in parallel
15:03 karolherbst[d]: though the question is if an additional spill offsets the perf gain
15:04 gfxstrand[d]: karolherbst[d]: Yeah, graph coloring itself is only about O(n^2), maybe a little more. But then you retry every spill and things go downhill fast.
15:04 karolherbst[d]: yeah...
15:04 karolherbst[d]: though codegen was smart enough to spill everything at once and then retry
15:05 karolherbst[d]: but there is a bug
15:05 karolherbst[d]: and uhm..
15:05 karolherbst[d]: it's an annoying one
15:05 gfxstrand[d]: Also, there's no way that you can spill up-front with graph coloring unless you literally spill everything.
15:06 gfxstrand[d]: SSA-based allocators re-shuffle the register file to defragment as they go. Graph coloring can't. This can lead to shaders which theoretically consume less than half of the register file failing to allocate. I did a bunch of experiments with this when I was working on IBC.
15:07 karolherbst[d]: I have an example where I made it fail with 1/4
15:07 karolherbst[d]: codegen that is
15:07 dadschoorse[d]: did you ever write a ssa reg alloc for IBC?
15:07 gfxstrand[d]: There are other RA strategies which aren't technically SSA such as linear scan with 2nd chance bin-packing but they tend to be equivalent to SSA-based in the end.
15:07 gfxstrand[d]: dadschoorse[d]: No I didn't
15:07 gfxstrand[d]: I wanted to but I spent most of my time just getting the thing to work.
15:07 gfxstrand[d]: And I didn't end up going all-in on SSA, either, which was a mistaek.
15:10 dadschoorse[d]: seems like intel is still on that path, even with the recent brw MRs that moved more things to ssa
15:12 dadschoorse[d]: kuter7639[d]: as far as I understand, vector registers are a problem that you have to solve when you want to do ssa regalloc, not something where ssa regalloc has some major advantage
15:12 gfxstrand[d]: IDK what Intel is doing
15:12 gfxstrand[d]: I've washed my hands of it at this point.
15:13 gfxstrand[d]: dadschoorse[d]: Vector registers are also annoying for graph coloring but a lot of the graph coloring research was done on really strange architectures so the standard papers handle it okay if you know how to set up your register classes.
15:13 dadschoorse[d]: dadschoorse[d]: aco's vector handling isn't the greatest for code gen for example
15:15 dadschoorse[d]: I think daniel had some ideas for how to improve vector handling in aco, but he always finds something else to work on instead
15:20 gfxstrand[d]: Yeah, his new plan is basically what I did for NAK. IDK if it's actually better or not, though.
15:21 gfxstrand[d]: It has its own set of problems
20:02 gfxstrand[d]: Are NVIDIA VRAM pages 16KiB or 64KiB?
20:03 karolherbst[d]: 64 k
20:04 karolherbst[d]: though there is also support for 4k pages
20:04 karolherbst[d]: some archs also have 128k
20:04 karolherbst[d]: anyway... 4k, 64k, 2M are supported everywhere, 128k is pre-ada and 512M is GA100
20:05 gfxstrand[d]: So 4k is supported even for VRAM?
20:06 gfxstrand[d]: Then why are we bumping things to 64K various places? Or is that just to satisfy max image alignment requirements?
20:06 karolherbst[d]: sparse seems to be 4k and 64k only
20:06 karolherbst[d]: gfxstrand[d]: I don't really know the details here, there are also some notes about things being decided at boot time and such
20:07 gfxstrand[d]: 😢
20:07 karolherbst[d]: but I think that's related to what pages sizes are supported
20:07 karolherbst[d]: not one being choosen
20:07 gfxstrand[d]: I know that Intel only supported 16k (or was it 64k?) and bigger for VRAM for $reasons
20:08 gfxstrand[d]: But if NVIDIA is 4k pages always, that makes things simpler.
20:08 karolherbst[d]: but yeah.. I suspect it has to do with alignment, because the GPU does have random alignment requirements for various things
20:08 gfxstrand[d]: As long as those are virtual alignments and not physical, we're fine
20:23 gfxstrand[d]: Looks like nouveau just uses the OS page size
20:28 notthatclippy[d]: Does Mesa take a big chunk from the kmod and then suballocate, or does it let the kernel handle all the individual allocations?
20:30 notthatclippy[d]: The proprietary driver does a lot of suballocating, on the graphics but especially the compute side, so larger page sizes make a lot more sense there. Particularly when you're dealing with 64+GB of VRAM that needs to be arbitrarily shared between applications
21:16 gfxstrand[d]: Vulkan apps should.
21:16 gfxstrand[d]: NVK doesn't currently sub-allocate for things like pushbufs but we could.