01:33fdobridge_: <redsheep> I've been attempting to set up tile based raster again, and I think I will have to leave this to one of the real experts. Deko3d definately has some of what is needed, but there's enough undocumented that I don't think it will be trivial to make an implementation that works across multiple different gpus.
01:34fdobridge_: <redsheep> If it helps anybody else though here's a table of tile sizes I gathered and calculations I did to try to sort this out.
01:34fdobridge_: <redsheep> https://cdn.discordapp.com/attachments/1034184951790305330/1195903753790959656/Trianglebin_Results.ods?ex=65b5af29&is=65a33a29&hm=9569e09649d1468728ae1a46276fa2116029d0bde716322540bbb04d25e567ca&
01:35fdobridge_: <redsheep> There's a really weird inconsistency where AD102 has twice the tile cache size at 4x and 8x msaa, but GP102 only doubles its cache size at 8x.
01:37fdobridge_: <redsheep> The effect is that tile sizes don't increase when going from 2 to 4 msaa on AD102, and don't increase when going from 4 to 8 on GP102
01:37fdobridge_: <redsheep> *don't decrease
01:41fdobridge_: <redsheep> All I can figure is that nvidia set these based on profiling the different chips, and likely what tile size gets used is not actually strict.
17:41fdobridge_: <gfxstrand> Yeah, probably. If it's anything like image tiling, the hardware probably automatically clamps if the tile size doesn't fit or something.
17:42fdobridge_: <gfxstrand> At least I hope so. That'd make it way easier to at least not crash the GPU.
18:06fdobridge_: <redsheep> Nvidia's 2017 gdc suggests that the rops are using the l2 cache, even calling out having a tile buffer as a "traditional architecture"
18:07fdobridge_: <redsheep> So if that hasn't changed then I expect tiles that are too large would just result in poor data locality
18:08fdobridge_: <redsheep> The part that's really puzzling about that is that pascal and ampere seem to use tile sizes that could not possibly fit in their l2 cache, though the original Maxwell implementation would and ada would, by my math
18:10fdobridge_: <redsheep> Maybe their driver knows the test program has no shaders that would pick out single fragments, so it optimizes away storing them all? It's weird.
18:16fdobridge_: <redsheep> On second thought that doesn't explain it at all, it still seems impossibly large at 1 sample. Maybe Nvidia added a tile buffer after all?
18:40fdobridge_: <karolherbst🐧🦀> pascal has a L2 cache of up to 4MB, right?
18:41fdobridge_: <pac85> That's what happens on amd fwiw
18:41fdobridge_: <pac85> ROPs are L2 clients
18:43HdkR: ROPs and L2 interaction has changed since 2017, don't worry about that :)
19:59EisNerd: is there a reason, that mesa with opencl and nouveau stopped building with 23.2?
20:00EisNerd: see here https://bugs.gentoo.org/921658
20:44EisNerd: hi
21:42fdobridge_: <redsheep> GP102 has 256KB per 32 but memory controller, so 2.75 MB on the 1080ti I was testing
21:56EisNerd: someone an idea, why mesa (nouveau) does not build, when opencl is enabled during configure?
22:27RSpliet: EisNerd: no idea, but the missing references are all vl
22:36RSpliet: and either way this sounds like a distro support question rather than an upstream issue at this point
22:42fdobridge_: <redsheep> Oh I guess I should also mention, part of my logic in ending my journey down the tile based raster rabbit hole, at least for now, is that if my math is right then it is not a big factor, at least on Ada right now. With 32 BPP and no MSAA then an entire 4096x4096 "tile" could fit in the L2, assuming that is what it is even doing, meaning that rendering immediate mode at 4k should not be an issue at all.
22:42fdobridge_: <redsheep> The worse data locality could still explain higher MSAA being a bigger than expected performance hit though.
22:43fdobridge_: <airlied> You should look at zcull
22:43fdobridge_: <airlied> Seems more tractable and possibly more useful
22:43fdobridge_: <redsheep> Yeah that stuff looks way more documented, at first glance.
22:44fdobridge_: <redsheep> Both in the class headers and over at deko3d
22:48fdobridge_: <redsheep> My biggest hurdle with zcull is that I don't understand it as well, but I can probably figure it out.
22:49fdobridge_: <redsheep> Also I have no idea how I would validate that it is working, except with performance.
22:57fdobridge_: <bylaws> Could probably just read the buffers
22:57fdobridge_: <bylaws> If the GPU is writing data you can be at least halfway sure it's working
23:12fdobridge_: <mhenning> The biggest hurdle with zcull is that it needs kernel support for context switching
23:13fdobridge_: <mhenning> which is probably tractable with gsp but nobody ever figured out on the old firmware
23:13fdobridge_: <mhenning> There used to be commented-out code to turn it on in the gl driver
23:16fdobridge_: <mhenning> This is the commit that deleted the old zcull support, if it's useful: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19271/diffs?commit_id=f804f8065dd13f5a8fb07f0597f7de210e7385db
23:20fdobridge_: <mhenning> Actually, my hunch is that a full implementation would also need to switch that off depending on the shader in use and depth testing functions
23:32fdobridge_: <redsheep> Sounds like looking through the context switching code in openRM would be of some value then to see how that is being done on GSP
23:48fdobridge_: <airlied> @mhenning on newer gpus is that true though?
23:48fdobridge_: <airlied> I might be off but I though newer ZCULL was just an extra VRAM buffer
23:51fdobridge_: <airlied> NV2080_CTRL_GR_SET_CTXSW_ZCULL_MODE_SEPARATE_BUFFER I think is what we want to use
23:51fdobridge_: <airlied> and yes the kernel probably has to set that for userspace so needs an ioctl
23:53fdobridge_: <marysaka> I thnk there is also an equivalent on nvgpu for Maxwell at least of that yeah