IRC Logs of #nouveau on irc.freenode.net for 2023-01-12

00:07 fdobridge: <gfxstrand> One pixel ends up getting lit in slice 6
00:08 airlied: viewports?
00:08 fdobridge: <gfxstrand> Shouldn't have anything to do with layered rendering but maybe
00:10 fdobridge: <gfxstrand> Specifically, pixel (6, 0, 6) gets lit
00:17 fdobridge: <gfxstrand> Also `BLOCK_DEPTH = ONE_GOB` so there's no weird Z-tiling going on.
03:54 airlied: more like it Pass: 186021, Fail: 41391, Crash: 318, Warn: 4, Skip: 1656704, Flake: 28684, Duration: 4:47:27, Remaining: 0
06:06 fdobridge: <gfxstrand> @airlied where do I find the new kernel stuff in a branch?
06:07 airlied: 05:13 < fdobridge> <airlied> https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/new-uapi-drm-next
06:07 airlied: I did reply earlier :-)
06:07 airlied: just forgot to name you
06:09 fdobridge: <gfxstrand> 👍
12:11 fdobridge: <TriΔng3l> How are UAVs bound on Fermi/Kepler, and what are the hardware limits on their count?
12:12 fdobridge: <TriΔng3l> How are writable storage images and storage buffers bound on Fermi/Kepler, and what are the hardware limits on their count? (edited)
12:22 fdobridge: <TriΔng3l> The reason I'm asking is that I eventually want to propose an extension that would make it possible to raise (via additional features/limits overlapping them):
12:22 fdobridge: <TriΔng3l> • vertexPipelineStoresAndAtomics
12:22 fdobridge: <TriΔng3l> • maxPerStageDescriptorStorageBuffers
12:22 fdobridge: <TriΔng3l> • maxPerStageDescriptorStorageImages
12:22 fdobridge: <TriΔng3l> if the shaders stay within three new limits:
12:22 fdobridge: <TriΔng3l> • Total storage buffers + images per stage
12:22 fdobridge: <TriΔng3l> • Total storage buffers + images per pipeline (like in D3D12, or previous * 6 if not limited)
12:22 fdobridge: <TriΔng3l> • Total storage buffers + images + attachment outputs per pipeline (like in D3D11.0 and OpenGL, or previous + max color outputs if not limited)
12:23 fdobridge: <karolherbst🐧🦀> storage buffers are just plain memory writes
12:24 fdobridge: <karolherbst🐧🦀> hw doesn't support them natievly
12:24 fdobridge: <karolherbst🐧🦀> hw doesn't support them (edited)
12:24 fdobridge: <karolherbst🐧🦀> so there is nothing to bind
12:25 fdobridge: <TriΔng3l> Ah, so they don't contribute to any limits? Well, I guess NVK doesn't need it then :frog_shrug: Is robust buffer access enforced by shader code performing bounds checking, by the way?
12:25 fdobridge: <karolherbst🐧🦀> any hardware which has storage buffer limits these days is stupid anyway
12:25 fdobridge: <karolherbst🐧🦀> I'd totally ignore those exist
12:25 fdobridge: <karolherbst🐧🦀> just assume it's unlimited on every hardware
12:26 fdobridge: <karolherbst🐧🦀> because hardware having limited buffers can't do compute anyway
12:26 fdobridge: <karolherbst🐧🦀> (or not in any sane way)
12:26 fdobridge: <TriΔng3l> AMD TeraScale… 12 combined render targets (though at most 8 slots usable for RTs) and UAVs of all types
12:26 fdobridge: <karolherbst🐧🦀> yes, and TeraScale is 15 years old
12:27 fdobridge: <karolherbst🐧🦀> does it even support vulkan?
12:27 fdobridge: <karolherbst🐧🦀> images are a different topic though
12:28 fdobridge: <TriΔng3l> No, at least not yet — but it supports full D3D11.0 with all the compute functionality. Though UAV access is done in a somewhat unusual way there, using the color output (or depth output for some faster uncached two-dword loads/stores) hardware
12:28 fdobridge: <karolherbst🐧🦀> we can do bindless images on kepler+ and that's fine
12:28 fdobridge: <karolherbst🐧🦀> but we still need to add them to some lookup table
12:28 fdobridge: <karolherbst🐧🦀> so "bindless" on nvidia doesn't mean you put the raw memory address into them, but it's more of an index
12:29 fdobridge: <karolherbst🐧🦀> just... different
12:31 fdobridge: <karolherbst🐧🦀> anyway.. on nvidia all (or almost? all) limits are independent from each other
12:34 fdobridge: <TriΔng3l> I was thinking about the same idea of using a single binding exposing 34 bits of addresses for storage buffers, this way I could expose 11 storage images in compute, or 11 combined color/storage images in fragment shaders (Vulkan already has the limit I want, but only within fragment shaders), but virtual memory is available only on the R900, while the R800 still requires command buffer patching done by the kernel driver, which parses the c
12:34 fdobridge: <karolherbst🐧🦀> I'd say ignore anything before r900
12:35 fdobridge: <TriΔng3l> The differences between R800 and R900 are pretty minor though as far as I know, I don't see why not include it too. The OpenGL driver doesn't use virtual memory on R900 too due to some hang by the way also, so I'm not planning to rely on it
12:35 fdobridge: <TriΔng3l> The differences between R800 and R900 are pretty minor though as far as I know, I don't see why not include it too. The OpenGL driver doesn't use virtual memory on R900 too due to some hang by the way also, so I'm not planning to rely on it heavily (edited)
12:36 fdobridge: <karolherbst🐧🦀> all the GPU archs prior having a sane compute model are just a huge pita to deal with
12:36 fdobridge: <TriΔng3l> Yes… not always really convenient for emulation, those 12-bit sampler indices :pandaCry:
12:36 fdobridge: <karolherbst🐧🦀> also.. not sure if GPUs without a VM are much fun with vulkan anyway :/
12:37 fdobridge: <karolherbst🐧🦀> don't even support bda
12:37 fdobridge: <TriΔng3l> In our work-in-progress games we're using that only for some debug output, so I wouldn't expect it to be a serious issue
12:38 fdobridge: <karolherbst🐧🦀> we do have a proper VM on tesla GPUs even, but I'm considering ignoring anything before Kepler for nvk for various reasons
12:38 fdobridge: <TriΔng3l> and DXVK likely doesn't need buffer device address as well
12:38 fdobridge: <karolherbst🐧🦀> yeah.. it's more of a compute thing or more modern games
12:39 fdobridge: <🌺 ¿butterflies? 🌸> Duh. CUDA 0.x mandated implementing _that_
12:39 fdobridge: <karolherbst🐧🦀> but it's an indication of how much painful the hardware is to deal with 😄
12:39 fdobridge: <karolherbst🐧🦀> if you don't have a VM, vulkan is probably not for you anyway
12:40 fdobridge: <TriΔng3l> Aside from descriptor indexing, what are the other prominent inconvenient things there?
12:40 fdobridge: <TriΔng3l> TeraScale has uniform resource indexing, by the way, and seems like subgroup operations are possible via the LDS
12:40 fdobridge: <karolherbst🐧🦀> fermi doesn't have the compute engine we use on kepler+, so it would require a fermi only path for a lot of basic functionalities like memory copies
12:40 fdobridge: <karolherbst🐧🦀> tesla has a different command buffer format
12:41 fdobridge: <karolherbst🐧🦀> tesla is different in a lot of areas anyway, so you'd probably want to have a tesla only vulkan driver 🙃
12:41 fdobridge: <karolherbst🐧🦀> (and it probably doesn't even support al vk 1.0 features)
12:41 fdobridge: <karolherbst🐧🦀> (and it probably doesn't even support all vk 1.0 features) (edited)
12:41 fdobridge: <🌺 ¿butterflies? 🌸> True. Didn't prevent some mobile GPU vendors from shipping without it
12:42 fdobridge: <karolherbst🐧🦀> sounds like pain though
12:42 fdobridge: <TriΔng3l> The common non-uniform index lowering from Mesa should work likely
12:42 fdobridge: <karolherbst🐧🦀> yeah.. I guess
12:43 fdobridge: <TriΔng3l> Do early Tesla GPUs even have any way of implementing atomics with group-shared memory?
12:43 fdobridge: <🌺 ¿butterflies? 🌸> For example, PowerVR drivers on Android 12 ship without VK_KHR_buffer_driver_address
12:43 fdobridge: <karolherbst🐧🦀> some do
12:43 fdobridge: <TriΔng3l> aside from using global memory instead
12:44 fdobridge: <karolherbst🐧🦀> I think 2nd gen tesla added that
12:44 fdobridge: <🌺 ¿butterflies? 🌸> (Yes, really)
12:44 fdobridge: <🌺 ¿butterflies? 🌸> At least on GE8320...
12:44 fdobridge: <🌺 ¿butterflies? 🌸> Vulkan 1.1 without bda
12:44 fdobridge: <karolherbst🐧🦀> yeah..
12:44 fdobridge: <TriΔng3l> "Can it run Doom?"
12:44 fdobridge: <karolherbst🐧🦀> GT2xx have shared atomics
12:45 fdobridge: <karolherbst🐧🦀> CUDA compute model 1.2+
12:45 fdobridge: <karolherbst🐧🦀> just in 32 bits though
12:45 fdobridge: <karolherbst🐧🦀> but...
12:45 fdobridge: <TriΔng3l> But totally :RIP: more or less on the 8800, or do you have some ideas?
12:45 fdobridge: <🌺 ¿butterflies? 🌸> For the 8800 I'm afraid it's a RIP
12:45 fdobridge: <karolherbst🐧🦀> yeah...
12:46 fdobridge: <karolherbst🐧🦀> it's not really worth it
12:46 fdobridge: <karolherbst🐧🦀> I don't think it's worth it for fermi even
12:46 fdobridge: <karolherbst🐧🦀> kepler+ is a nice target, because a lot of nouveau users got kepler cards because they can reclock them
12:46 fdobridge: <karolherbst🐧🦀> so gaming is somewhat possible
12:46 fdobridge: <karolherbst🐧🦀> and being able to use dxvk would be huge
12:46 fdobridge: <🌺 ¿butterflies? 🌸> ...OK
12:46 fdobridge: <karolherbst🐧🦀> but AMD doesn't have that issue
12:47 fdobridge: <karolherbst🐧🦀> so I wouldn't be surprised if people wouldn't care if terrascale gets vulkan or not
12:47 fdobridge: <🌺 ¿butterflies? 🌸> Only PowerVR Series9XM onwards have VK_KHR_buffer_device_address exposed
12:47 fdobridge: <🌺 ¿butterflies? 🌸> As far as vulkan.gpuinfo.org tells
12:47 fdobridge: <karolherbst🐧🦀> yeah.. and Is suspect the gens before that to be a huge pita
12:47 fdobridge: <karolherbst🐧🦀> bda essentially means you can do raw global load/stores
12:47 fdobridge: <karolherbst🐧🦀> if your hardware can't do that....
12:48 fdobridge: <karolherbst🐧🦀> even nvidia tesla could implement bda 🙃
12:49 fdobridge: <🌺 ¿butterflies? 🌸> Tesla came with CUDA 1.0. Was ahead in quite some compute aspects because of the requirements imposed by that...
12:50 fdobridge: <karolherbst🐧🦀> I mean...
12:50 fdobridge: <karolherbst🐧🦀> sure, but I think they already knew what they were doing soo...
12:51 fdobridge: <🌺 ¿butterflies? 🌸> Wow wait
12:51 fdobridge: <🌺 ¿butterflies? 🌸> Arm Midgard
12:52 fdobridge: <🌺 ¿butterflies? 🌸> Doesn't have bda either?
12:52 fdobridge: <karolherbst🐧🦀> I'm sure the hw can do it
12:52 fdobridge: <🌺 ¿butterflies? 🌸> Yeah but vulkan.gpuinfo doesn't convey it
12:52 fdobridge: <karolherbst🐧🦀> right...
12:52 fdobridge: <karolherbst🐧🦀> but it's arm
12:52 fdobridge: <karolherbst🐧🦀> they want to sell new gpus
12:53 fdobridge: <karolherbst🐧🦀> midgard does officially support OpenCL, so I'm curious if it's a super pita, or they just don't advertise it
12:54 fdobridge: <karolherbst🐧🦀> maybe bda has some weird req?
12:54 fdobridge: <🌺 ¿butterflies? 🌸> PowerVR supports opencl too
12:54 fdobridge: <🌺 ¿butterflies? 🌸> Since way back to SGX
12:54 fdobridge: <karolherbst🐧🦀> I'm sure it's broken
12:55 fdobridge: <🌺 ¿butterflies? 🌸> https://blog.imaginationtech.com/fun-with-powervr-and-the-beaglebone-black-low-cost-development-made-easy/
12:55 fdobridge: <🌺 ¿butterflies? 🌸> They ended up releasing on SGX530 in... 2021 even
18:14 fdobridge: <nanokatze> I'd just assume it's just a hw gen they didn't want to bother much with drivers for, by the time buffer device address was out
18:15 fdobridge: <nanokatze> I'd assume it's just a hw gen they didn't want to bother much with drivers for, by the time buffer device address was out (edited)
18:15 fdobridge: <nanokatze> I'd assume it's just a hw gen they didn't want to bother much with updating driver for, by the time buffer device address was out (edited)
18:16 fdobridge: <karolherbst🐧🦀> yeah.. that would be my assumption as well
18:17 fdobridge: <airlied> @TriΔng3l adding vulkan terascale support would also run into kernel and firmware issues
18:17 fdobridge: <Esdras Tarsis> Is Ben Skeggs on the server? I wanted to know where to put the GSP firmware and how to do the reclocking 🐸
18:18 fdobridge: <airlied> /lib/firmware/nvidia/<gpu>/gsp/
18:18 fdobridge: <airlied> as gsp.bin I think
18:19 fdobridge: <airlied> and it has to be from a 515 driver
18:24 fdobridge: <Esdras Tarsis> I think now it's the r525 version which has the seperate files for turing and ampere
18:24 fdobridge: <Esdras Tarsis> But thanks
18:25 fdobridge: <Esdras Tarsis> What about reclocking? Will it work automatically?
18:25 fdobridge: <TriΔng3l> What kind of, like BO management?
18:26 fdobridge: <TriΔng3l> Can you bind multiple images/buffers to the same BO in `radeon`?
18:27 fdobridge: <TriΔng3l> Command submission seems fine overall to me if you copy command buffers to an intermediate buffer so it can be played multiple times even with relocations
18:28 fdobridge: <airlied> @Esdras Tarsis reclocking just happens
18:29 fdobridge: <airlied> @TriΔng3l radeon driver has no split bo/va so can't do multiple bindings, and relocations would be a nightmare to navigate properly
18:30 fdobridge: <airlied> you might be able to use the VA hw on cayman, but that is the last card in the terascale family
18:30 fdobridge: <TriΔng3l> oh waaaaait
18:30 fdobridge: <TriΔng3l> why do split BOs matter at all?
18:31 fdobridge: <TriΔng3l> Shouldn't they only have effect on eviction priorities?
18:32 fdobridge: <TriΔng3l> making the kernel be aware of actual textures/buffers
18:32 fdobridge: <airlied> that's the multiple images/buffesr in the same BO
18:32 fdobridge: <TriΔng3l> making the kernel aware of actual textures/buffers (edited)
18:33 fdobridge: <airlied> radeon can't do that
18:33 fdobridge: <airlied> and it's a pretty basic feature of vulkan
18:33 fdobridge: <TriΔng3l> Yes, but can you use a buffer BO for relocation of a texture constant?
18:34 fdobridge: <TriΔng3l> so all binding is purely offsets in descriptors
18:35 fdobridge: <airlied> but vulkan memory allocation needs to be able to allocate VA space separate from BO space, maybe you can hack around it
18:36 fdobridge: <TriΔng3l> basically like binding the same vertex buffer with different offsets or something like that, but with heterogeneous resource types
18:36 fdobridge: <TriΔng3l> Vulkan doesn't need a concept of VA space if I understand correctly :frog_donut:
18:36 fdobridge: <TriΔng3l> just VkDeviceMemory handles
18:37 fdobridge: <TriΔng3l> though even D3D12 virtual addresses can be used as just some range tree search keys I think
18:38 fdobridge: <TriΔng3l> unless you need to be able to cross the boundaries of allocations in a defined way in a single resource
18:38 fdobridge: <airlied> It really does want to be able to do multiple images/buffers per vkdevicememory
18:38 fdobridge: <airlied> and if a vkdevicememory is equiv to a kernel bo
18:38 fdobridge: <airlied> then the kernel driver just isn't sufficient, esp when it comes to tiling flags for imagse
18:38 fdobridge: <airlied> if you have multiple images in the same bo with different tiling requirements it won't work
18:39 fdobridge: <TriΔng3l> Yes, I want to do that by referencing the whole VkDeviceMemory that backs the resource when I'm using one, and an offset inside the resource constant
18:39 fdobridge: <airlied> generally in cards with VA space the tiling is attached to the VA not the physical memory allocation
18:39 fdobridge: <TriΔng3l> oh, is tiling needed for validation?
18:39 fdobridge: <airlied> tiling is needed to work at all
18:39 fdobridge: <TriΔng3l> in the BO itself, not the texture descriptor
18:39 fdobridge: <airlied> yes but it has to be able to have multiple different tilings in the same BO
18:40 fdobridge: <TriΔng3l> I always found it weird that the BO level is aware of tiling at all
18:40 fdobridge: <airlied> which just isn't possible with the current kernel driver
18:40 fdobridge: <airlied> hence why amdgpu was written and we are rewriting nouveau
18:42 fdobridge: <TriΔng3l> tiling is a property of relocations in `radeon` if I understand correctly
18:42 fdobridge: <TriΔng3l> though not sure if you can have multiple relocations for one BO?
18:43 fdobridge: <TriΔng3l> in one command buffer
18:45 fdobridge: <airlied> oh maybe you could have different relocs for each thing, might be hackable
18:46 fdobridge: <airlied> though uggh for using relocs at all in a vk driver
18:49 fdobridge: <TriΔng3l> By the way, I think it shouldn't matter at all, Gallium Radeon winsys sets `RADEON_CS_KEEP_TILING_FLAGS` for all graphics and compute work (I hope I'm not reading some GCN-specific code though…)
18:50 fdobridge: <TriΔng3l> do we need something like #r600/#terascale or (preliminary, but probably final name) #terakan 🪳 already :ayymd:
18:50 fdobridge: <gfxstrand> Seems to actually be a problem with instanced rendering. Forcing `gl_Layer` seems to work fine.
18:51 fdobridge: <gfxstrand> What's that I hear? Is that Haswell quietly sobbing in the background?
18:51 fdobridge: <TriΔng3l> do we need something like #r600/#terascale or (preliminary, but probably final name) #terakan 🪳 already :ayymd: as this is clearly not a #ID:1034184951790305330 topic… (edited)
18:52 fdobridge: <airlied> @TriΔng3l yeah probably a bit off topic for in here, but that flag and the tracking the kernel does I think will make it pretty impossible to get working
18:54 fdobridge: <airlied> my only good suggestion if someone was considering Vulkan on Terascale (apart from don't) is get as fast as problem to submitting a command and multiple images with hard coded shaders and work out how limiting the kernel is going to be
18:55 fdobridge: <TriΔng3l> Exactly, I already dropped one idea (one global memory binding for SSBO) because of the kernel :jPAIN:
19:58 fdobridge: <gfxstrand> Found it! Turns out it was an MME builder bug.
20:50 fdobridge: <marysaka> oh no
21:53 fdobridge: <gfxstrand> `Pass: 232028, Fail: 556, Crash: 371, Warn: 4, Skip: 1325924, Timeout: 4, Flake: 1676, Duration: 47:15`
21:54 fdobridge: <gfxstrand> It's ok. It was a really small bug. But it made me realize we have a hole in the test plan: Tests that don't care which GPU you're on.
21:54 fdobridge: <gfxstrand> Like, it's great to have Fermi tests and Turing tests. What we also need are generic tests which don't use the simulator to ensure the two are consistent. I'm typing some of those now.