06:58 fdobridge: <J​oshie with Max-Q Design> Probably didn't exist bsck then?
08:08 Ojus1_: Hi
08:09 Sid127: hi
08:14 Ojus1_: Hi , Thanks Sid127 for the revert. I was checking that my message is sending to the channel or not !
08:14 Sid127: figured, yeah ^^'
08:15 Ojus1_: :-)
08:30 Ojus1_: Hello everyone, I trust you're all doing well. I recently discovered EVoC and as an open-source enthusiast, I'm truly excited about this program. I believe it could serve as a catalyst for my journey into open-source driver development. Currently, I'm in search of a mentor for the EVoC program who can provide guidance throughout this endeavor. I am interested in Nouveau project as I have proper Nvidia GPU hardware so i believe i can do proper development
08:31 Ojus1_: I would appreciate if anyone is willing to mentor and assist me in this endeavor.
08:32 Ojus1_: I am open to learning new skills and facing new challenges.
08:43 fdobridge: <m​agic_rb.> As someone who recently joined, id recommend following the discussion here for a month or two, looking out for open issues and any opportunities people might mention here of the style "its annoying but not hard to fix"
08:49 Ojus1_: Hi Magic_rd, thanks for the suggestion. Sure, I will do that. Could you kindly please confirm if the mailing list provided on the Xorg (https://www.x.org/wiki/SummerOfCodeIdeas/) is updated? I've tried to reach out on the mailing list several times before but didn't receive any response. Maybe I'm using an old mailing list, I guess?
12:42 fdobridge: <t​riang3l> Does it guarantee that without freeing, allocations will succeed for the entire pool?
12:42 fdobridge: <t​riang3l> I'm considering using it for shaders, but for descriptors idk
12:42 fdobridge: <t​riang3l> (RADV also has a custom allocators for shaders if I recall correctly :blobcatnotlikethis:)
12:43 fdobridge: <t​riang3l> allocator *
12:43 fdobridge: <t​riang3l> (RADV also has a custom allocator for shaders if I recall correctly :blobcatnotlikethis:) (edited)
14:03 fdobridge: <g​fxstrand> I don't know what you mean by that
14:04 fdobridge: <g​fxstrand> Could be. It was certainly very new.
14:58 fdobridge: <t​riang3l> That if the application only does `vkAllocateDescriptorSets` within a pool and never `vkFreeDescriptorSets`, it will surely be able to allocate all `VkDescriptorPoolSize::descriptorCount` descriptors across `maxSets`, regardless of the layout of each allocation
14:59 fdobridge: <g​fxstrand> Yes. You have to take alignment into account (which we do) but yes.
15:00 fdobridge: <t​riang3l> That if the application only does `vkAllocateDescriptorSets` within a pool and never `vkFreeDescriptorSets`, it will surely be able (at least won't get `VK_ERROR_OUT_OF_POOL_MEMORY`) to allocate all `VkDescriptorPoolSize::descriptorCount` descriptors across `maxSets`, regardless of the layout of each allocation (edited)
15:01 fdobridge: <t​riang3l> That if the application only does `vkAllocateDescriptorSets` within a pool and never `vkFreeDescriptorSets`, it will surely be able (at least without getting `VK_ERROR_OUT_OF_POOL_MEMORY`) to allocate all `VkDescriptorPoolSize::descriptorCount` descriptors across `maxSets`, regardless of the layout of each allocation (edited)
15:02 fdobridge: <g​fxstrand> And it provides the usual allocator guarantees:
15:02 fdobridge: <g​fxstrand> - The first time you allocate, you can allocate all the memory, regardless of how you partition it
15:02 fdobridge: <g​fxstrand> - If you free everything one chunk at a time, it re-combines so it's like you never allocated
15:02 fdobridge: <g​fxstrand> Everything else is best-effort
15:05 fdobridge: <g​fxstrand> There was a little performance issue on ANV back in the day where it aligned offsets but didn't align up sizes and large descriptor pools could end up with a giant pile of holes and slow the allocator down. That's easily fixed by just making sure everything is always aligned the same and the sizes are aligned. Now ANV is fine. The NVK MR doesn't have that bug.
16:22 fdobridge: <!​DodoNVK (she) 🇱🇹> What else is RADV hand-rolling?
16:32 fdobridge: <g​fxstrand> Almost everything?
16:33 fdobridge: <g​fxstrand> That happens with old drivers, though. They made some decision like 6 years ago when the only other prior art was ANV and then someone eventually made a common thing and maybe ANV moved over or maybe it didn't and it's never been worth the effort or regression risk to switch to it.
16:33 fdobridge: <g​fxstrand> That's kinda the story of big software projects.
16:35 fdobridge: <g​fxstrand> NVK is using All The Things because a) It's a brand new driver, b) I wrote most of All The Things so of course I'm going to use them, and c) Part of the internal justification for NVK is as a user of All The Things so we can add more Things.
16:35 fdobridge: <S​id> nvk is a victory lap of sorts for faith
16:36 fdobridge: <S​id> finally getting to use all the things she put in so much effort into
16:39 fdobridge: <m​agic_rb.> other mesa devs: why do we need this thing which is supposed to work for all but no one uses itfaith: \*proceeds to write nvk\* uh, nvk uses it :)
16:40 fdobridge: <g​fxstrand> You joke but it's kinda true. NVK is something of a showcase of the Mesa runtime
16:40 fdobridge: <g​fxstrand> It's also taught me that the Mesa Vulkan runtime needs to be improved and is hitting structural limits.
16:41 fdobridge: <k​arolherbst🐧🦀> ~~valium when~~
16:41 fdobridge: <g​fxstrand> I'm currently noodling what runtime2 should look like to get rid of those limits so we can make it even easier to write drivers.
16:41 fdobridge: <m​agic_rb.> from a user perspective, its doing great, already surpassed the proprietary driver in my eyes (it doesnt make zfs segfault >:( )
16:42 fdobridge: <g​fxstrand> I'm currently noodling what runtime2 should look like to get rid of those limitations so we can make it even easier to write drivers. (edited)
16:47 fdobridge: <g​fxstrand> IDK if it's a victory lap or a chance at redemption. 😅 There's a whole lot of "I was an idiot when I build ANV. If I ever build a new Vulkan driver, I'm going to do things differently."
16:47 fdobridge: <S​id> it's both :D
16:48 fdobridge: <g​fxstrand> For instance, NVK doesn't have any lock-free data structures. It turns out that uncontested locks are cheap and getting your lock-free data structure wrong is not.
16:50 fdobridge: <g​fxstrand> Well, `simple_mtx` is cheap, anyway. And some of the lock-free data structures actually use more atomic operations to do a thing than are required by `simple_mtx_lock()` and `simple_mtx_unlock()`.
16:51 fdobridge: <n​anokatze> are you referring to anv_allocator.c
16:51 fdobridge: <g​fxstrand> Well, `simple_mtx` is cheap, anyway. And some of the lock-free data structures in ANV actually use more atomic operations to do a thing than are required by `simple_mtx_lock()` and `simple_mtx_unlock()`. (edited)
16:51 fdobridge: <g​fxstrand> Yes
16:51 fdobridge: <g​fxstrand> It's a work of art and also should never have existed.
16:51 fdobridge: <n​anokatze> I remember opening it up for the first, not understanding anything and wondering "why would anyone ever care to make vkAllocateMemory fast"
16:52 fdobridge: <n​anokatze> I remember opening it up for the first time, not understanding anything and wondering "why would anyone ever care to make vkAllocateMemory fast" (edited)
16:52 fdobridge: <g​fxstrand> Oh, it's not about `vkAllocateMemory()`. It's about all the silly little state objects that Intel requires that all have to come from the same base address. :blobcatnotlikethis:
16:52 fdobridge: <!​DodoNVK (she) 🇱🇹> ~~netborg won't be happy with this~~
16:52 fdobridge: <g​fxstrand> And some of it is absolutely necessary thanks to Intel's love of base addresses.
16:52 fdobridge: <g​fxstrand> Why plumb 64-bit addresses through places when you can just have 32-bit addresses and global base addresses for everything.
16:53 fdobridge: <g​fxstrand> Why plumb 64-bit addresses through places when you can just have 32-bit addresses and global base addresses for everything? (edited)
16:53 fdobridge: <g​fxstrand> Why plumb 64-bit addresses through your hardware when you can just have 32-bit addresses and global base addresses for everything? (edited)
16:53 fdobridge: <g​fxstrand> I mean, that's kinda what NVIDIA did with texture/sampler heaps (and the shader heap, pre-Volta) but they're way more intentional and selective about it.
16:54 fdobridge: <g​fxstrand> Intel has lots of indirect state and lots of base addresses and juggling it all is a giant headache.
16:54 fdobridge: <g​fxstrand> Oh, you want to change blend state? Yeah, go allocate some from `DYNAMIC_STATE_BASE_ADDRESS`.
16:55 fdobridge: <g​fxstrand> Having all those base addresses also means we can't effectively cache those allocations in the `VkCommandPool`, they have to be allocated with a global allocator. :blobcatnotlikethis:
16:55 fdobridge: <p​ac85> Can't all driver objects be kept in the lower 32bits? Should be enough right?
16:55 fdobridge: <n​anokatze> IIUC there's stuff that is stricter than just "low 4G"
16:56 fdobridge: <g​fxstrand> Yeah...
16:56 fdobridge: <g​fxstrand> You'd like to think it was as simple as "low 32 bits". 🙃
16:56 fdobridge: <p​ac85> Uhm
16:56 fdobridge: <g​fxstrand> Binding tables are especially dumb
16:56 fdobridge: <p​ac85> Afaik amd has 64bits for everything but the driver keeps stuff in 4GBs as an optimization
16:56 fdobridge: <g​fxstrand> Binding tables and surface states come from the same base address but binding table offsets are only 16 bits so you have to somehow put all the binding tables at the bottom of the base address and surface states above.
16:57 fdobridge: <p​ac85> Ah I see
16:57 fdobridge: <g​fxstrand> Iris handles this by only allowing 64K for binding tables and throwing away the bottom part whenever it runs out.
16:57 fdobridge: <n​anokatze> weird choice to go for, putting state into memory
16:58 fdobridge: <g​fxstrand> ANV can't do that because it's Vulkan and you can't do immediate mode things. We have to be able to bake binding tables in a command buffer and have it live forever.
16:58 fdobridge: <g​fxstrand> ANV does a very careful and frankly insane juggle to make it all work.
16:58 fdobridge: <p​ac85> Yeah sounds like what tilers do
16:58 fdobridge: <n​anokatze> yeah that recent discussion we had is kinda like a new perspective on these things
16:59 fdobridge: <g​fxstrand> Intel used to be 100% stateless back in the Iron Lake days. They've been slowly moving more and more state to be hardware-managed but there are still a few stragglers.
16:59 fdobridge: <g​fxstrand> For tilers it makes a lot more sense. They need to iterate over the state multiple times. Once for VS and binning and then once per tile.
17:00 fdobridge: <g​fxstrand> For immediate mode, it makes a lot less sense because you can keep moving forward and never look back. As long as you've sized all your internal rings right, you can just do a little stall to flush until you have a new slot free whenever something changes.
17:00 fdobridge: <n​anokatze> I guess it would've been less of a problem if you could just use any random address instead of having stuff live in a random few M region
17:00 fdobridge: <p​ac85> Though I don't know whether nvidia has something equivalent to the state roll thing amd has
17:01 fdobridge: <n​anokatze> I don't see how you wouldn't
17:01 fdobridge: <n​anokatze> I guess they just might bank some stuff differently
17:03 fdobridge: <n​anokatze> btw, per pac85, adreno does stuff immediate mode style
17:03 fdobridge: <p​ac85> Yeah it'# necessary to dome extent to have some banking
17:04 fdobridge: <p​ac85> But like, I wonder how they do it
17:04 fdobridge: <n​anokatze> so I guess you can do that even if you're tiler, as long as you have sufficiently big tiles
17:05 fdobridge: <g​fxstrand> Yes but the bunner references the command buffer so it can replay per-tile.
17:06 fdobridge: <g​fxstrand> Yes but the binner references the command buffer so it can replay per-tile. (edited)
17:06 fdobridge: <g​fxstrand> So it's kinda different but kinda the same.
17:06 fdobridge: <n​anokatze> right
17:06 fdobridge: <n​anokatze> do we have the same thing on csf mali too
17:06 fdobridge: <g​fxstrand> It goes very well with Qualcomm's whole MO of "We make a D3D but it's a tiler."
17:06 fdobridge: <p​ac85> They are just indirect buffers being called right?
17:07 fdobridge: <g​fxstrand> It goes very well with Qualcomm's whole MO of "We make a D3D12 GPU but it's a tiler." (edited)
17:07 fdobridge: <g​fxstrand> IDK what they're called
17:07 fdobridge: <p​ac85> So it still pays the cost od setting state it's all sequential afaik
17:07 fdobridge: <p​ac85> So it still pays the cost of setting state it's all sequential afaik (edited)
17:08 fdobridge: <g​fxstrand> Yeah. You can look at it as indirect state with a form of compression where they only record state that changes between draws. As opposed to Arm where all the state is there all the time.
17:09 fdobridge: <p​ac85> My thinking is that I'd they had tiles as small as Mali they'd go really slow with this design
17:09 fdobridge: <p​ac85> So they take the kind of compromises you'd take on an immediate mode gpu then make it work with bigger tiles
17:09 fdobridge: <g​fxstrand> Intel's plan back on the ILK days was that the state was actually a tree and you could swap out some of that leaves but leave others identical between draws. But that's a lot of pointer chasing...
17:09 fdobridge: <p​ac85> Mmm makes sense
17:10 fdobridge: <g​fxstrand> But also the Adreno design literally is "What if we took r600 and made it a tiler?"
17:12 fdobridge: <p​ac85> Yeah it feels very familiar knowing amd hw, however the details tend to differ
17:12 fdobridge: <p​ac85> But lots of ideas where brought over
17:12 fdobridge: <!​DodoNVK (she) 🇱🇹> Is it possible to make a tiler-unfriendly application/game? 🍩
17:12 fdobridge: <p​ac85> Which makes it really interesting
17:13 fdobridge: <!​DodoNVK (she) 🇱🇹> That's in the scope of @triang3l
17:13 fdobridge: <n​anokatze> very easily
17:16 fdobridge: <p​ac85> Sounds like it would be the default
17:16 fdobridge: <p​ac85> At a glance
17:17 fdobridge: <p​ac85> Like, the "obvious way" is the non optimal way for tilers
17:17 fdobridge: <n​anokatze> why
17:17 fdobridge: <p​ac85> You need to use sub passes right?
17:17 fdobridge: <n​anokatze> no
17:18 fdobridge: <p​ac85> Mmm, are vk drivers smart enough to keep stuff on tile even when not doing subpasses?
17:18 fdobridge: <n​anokatze> no
17:18 fdobridge: <n​anokatze> I don't believe that's the biggest issue
17:19 fdobridge: <n​anokatze> in my understanding the big issue with tiler is actually pretty close to "memory you manage vs cache"
17:19 fdobridge: <n​anokatze> on tiler (unless you use sysmem rendering on adreno) you (the driver) is in charge of scheduling where copy to/from tile happens
17:19 fdobridge: <n​anokatze> and you might end up putting unnecessary copies in there
17:20 fdobridge: <n​anokatze> because e.g. rendering commands, even though they use same render targets, span two cmdbufs
17:20 fdobridge: <n​anokatze> and you have to copy to global memory at the end of one cmdbuf and copy back in the next
17:20 fdobridge: <n​anokatze> basically I think big issue is unnecessary render pass breaks
17:22 fdobridge: <n​anokatze> and you have to break render pass at cmdbuf boundaries or if the user changes render target set or whatever
17:23 fdobridge: <n​anokatze> while if you have a cache and the user was rendering to e.g. render targets A, B, C and now they're rendering to just A, stuff that was stored to B, C gets evicted over time automagically probably
17:23 fdobridge: <n​anokatze> while A stays in caches
17:23 fdobridge: <p​ac85> Mmm I see
17:23 fdobridge: <n​anokatze> that's just a hypothesis
17:24 fdobridge: <n​anokatze> I haven't verified it
17:24 fdobridge: <p​ac85> I'm thinking
17:26 fdobridge: <n​anokatze> opts like on-tile resolve and non-coherent fbfetch (the one where you do need a barrier before reading) and avoiding writes back to global memory are just something that kinda allows this hw to punch above its weight
17:26 fdobridge: <t​riang3l> I'll have to make it a tiler when depth and stencil tile row pitch alignments don't match :blobcatnotlikethis:
17:26 fdobridge: <n​anokatze> though I don't see why you couldn't put at least some of that stuff into hw with cache
17:26 fdobridge: <t​riang3l> well, a tiler with one tile
17:27 fdobridge: <n​anokatze> I suspect fast clears kinda work by telling the hw to not load the actual target mem, so it's kinda like not reloading tile contents, except you instruct the hw using stuff next to an image, instead of something in the cmdbuf or w/e your equivalent of that is
17:28 fdobridge: <p​ac85> Like?
17:28 fdobridge: <n​anokatze> not writing stuff back like depth
17:28 fdobridge: <n​anokatze> not writing stuff back at least (edited)
17:28 fdobridge: <p​ac85> Keeping the entire frames in cache?
17:28 fdobridge: <p​ac85> Because like
17:28 fdobridge: <n​anokatze> well more like "I don't care if this gets written back"
17:28 fdobridge: <p​ac85> OK I see
17:28 fdobridge: <p​ac85> So not flushing cache
17:28 fdobridge: <n​anokatze> discarding
17:28 fdobridge: <p​ac85> Just letting it evict if it needs
17:29 fdobridge: <n​anokatze> ye
17:29 fdobridge: <p​ac85> Yeah I had this thought
17:29 fdobridge: <g​fxstrand> That's pretty accurate
17:29 fdobridge: <p​ac85> You would save on some write back when something else would cause the unneeded target to be evicted
17:30 fdobridge: <g​fxstrand> You can guarantee that in vkCreateImage because Vulkan only has combined depth/stencil.
17:31 fdobridge: <g​fxstrand> On NVK, though, we have to do that for linear rendering. That hardware hates linear images.
17:32 fdobridge: <g​fxstrand> Die space.
17:33 fdobridge: <!​DodoNVK (she) 🇱🇹> Is the blitter engine still present on Intel GPUs? A person is getting a warning about it 🐸
17:34 fdobridge: <g​fxstrand> Tilers are all about getting the maximum performance possible with a small die and shit memory bandwidth. They have a very small tile buffer that's incredibly fast (within an order of magnitude of registers, usually) and you try to do absolutely as much work as you can within that tiny little bit of memory. Every time you have to spill to RAM has a huge cost.
17:35 fdobridge: <t​riang3l> The issue here is that for mips, texture sampling computes pitches implicitly, but the alignment of the pitch depends on the number of bits per pixel, but for depth and stencil attachments the hardware has only one pitch registers :blobcatnotlikethis: :blobcatnotlikethis: :blobcatnotlikethis:
17:35 fdobridge: <t​riang3l> pitch register *
17:35 fdobridge: <t​riang3l> The issue here is that for mips, texture sampling computes pitches implicitly, but the alignment of the pitch depends on the number of bits per pixel, but for depth and stencil attachments the hardware has only one pitch register :blobcatnotlikethis: :blobcatnotlikethis: :blobcatnotlikethis: (edited)
17:35 fdobridge: <g​fxstrand> Kinda? It's been morphed into the copy engine which tries to be NVIDIA's DMA engine except it sucks because Intel image layouts are nuts and people didn't want to spend the silicon to make it competent.
17:35 fdobridge: <p​ac85> Anyway going back to this, I think another thing is the general structure of a renderer. Whereas on non tilers things like deferred and tons of passes make sense trivially I feel tilers would favor forward techniques and doing less passes even when that implies doing more work/doing it less optimally because memory bandwidth is the bottleneck?
17:35 fdobridge: <g​fxstrand> Oh, well that's pain...
17:35 fdobridge: <t​riang3l> https://gitlab.freedesktop.org/Triang3l/mesa/-/issues/3
17:37 fdobridge: <g​fxstrand> Intel has some stupid like that, particularly around compressed images. Depth/stencil are okay but the reason Vulkan didn't allow arrayed uncompressed views of compressed images until Maintenance6 is thanks to Broadwell. :blobcatnotlikethis:
17:38 fdobridge: <r​inlovesyou> i'm certainly looking forward to ditching nvidia proprietary for good. kernel 6.8.2 and 550.67 seem to disagree a good bit. Had to roll back so my openxr runtime could actually run haha
17:40 fdobridge: <g​fxstrand> Yes. And it's not that they mind multipass inherently. You can make multipass renderers that are *very* efficient on tilers. You just have to be careful how you structure them. If you just throw Skyrim's 8-pass renderer at one, though, you're going to feel the pain.
17:49 fdobridge: <n​anokatze> I'm talking about hw with cache gaining abilities like not doing a write back
17:50 karolherbst: Lyude: https://rust-for-linux.com/coccinelle-for-rust
17:50 fdobridge: <p​ac85> I wonder whether that's already possible
17:50 fdobridge: <n​anokatze> yeah I think the issue there is just fundamental lack in perf, not that hw hates it
17:52 fdobridge: <n​anokatze> like Faith said to overcome that you need to do gymnastics like I suppose avoiding writebacks to global memory
17:52 fdobridge: <p​ac85> Yeah right if you manage to do multiple passes while keeping things in tile memory it would be very efficient. In my understanding subpasses are meant to do this explicitly
17:52 fdobridge: <n​anokatze> like Faith said to overcome that you need to do gymnastics like I suppose round tripping through global memory when you can just not do that (edited)
17:52 fdobridge: <n​anokatze> like Faith said to overcome that you need to do gymnastics like I suppose avoiding round tripping through global memory when you can just not do that (edited)
17:53 fdobridge: <p​ac85> I mean yeah memory bandwidth is really important no matter what
17:54 fdobridge: <n​anokatze> on big hw it's also not like your bw is stellar but we now have chungus caches high up in the hierarchy to soak up a lot of traffic
17:55 fdobridge: <t​riang3l> TeraScale is also fun in this aspect, the width/height/depth of the base level aren't padded to powers of two, but for mips they are. And while you can override the array/3D layer pitch for color and depth/stencil attachments, for sampled images it's always computed from the height. To create an uncompressed view of a compressed mip, I make that mip level 1, specify that 1 is the most detailed mip in the descriptor, and multiply its dimensio
17:56 fdobridge: <t​riang3l> for kernel validation, the BO also needs to be padded to that fake base level 🤷‍♂️
17:56 fdobridge: <t​riang3l> don't even know how that's going to work with magnification vs. minification, it's a very scary subject overall
17:57 fdobridge: <t​riang3l> and Vulkan didn't require uncompressed views of multiple compressed array layers until maintenance6, but it did require them for 3D textures
17:58 fdobridge: <n​anokatze> mmuless hw was a mistake
17:59 fdobridge: <t​riang3l> so far `dEQP-VK.api.copy_and_blit.core.image_to_buffer.2d_images.mip_copies_bc*_universal` tests pass at least 🐸
18:02 fdobridge: <t​riang3l> and Vulkan didn't (conditionally) require uncompressed views of multiple compressed array layers until maintenance6, but it did require them for 3D textures (edited)
18:10 fdobridge: <g​fxstrand> Tilers also tend to have small caches which they reserve primarily for texturing because that's where you just can't avoid crazy scattered access.
18:11 fdobridge: <n​anokatze> yeah but I mean caches for targets
18:11 fdobridge: <g​fxstrand> I don't remember what we did there. Broadwell may have just not supported that somehow
18:11 fdobridge: <n​anokatze> actually nvm misread what you were replying to
18:14 fdobridge: <!​DodoNVK (she) 🇱🇹> ~~Every time you send the 🥵 emote I assume you like hot gay men~~
18:15 HdkR: I really like AMD's RDNA2/3 concept. Everything is raw memory, we're going to give the GPU a wacking huge cache and let that deal with it :D
18:16 HdkR: Which isn't too far off from NVIDIA's RTX 4090 having 64MB of L2 cache
18:16 fdobridge: <g​fxstrand> No worries
18:29 fdobridge: <g​fxstrand> Let's not be making comments about people's sexualities here, please. It can be fun to joke around and I like being gay online as much as the next girl but you never know how it'll be taken.
18:31 fdobridge: <t​riang3l> Anyway my kinda NIL (code stolen from R800 AddrLib and slightly adjusted 😝) seems kinda working 🥳
18:31 fdobridge: <t​riang3l> https://cdn.discordapp.com/attachments/1034184951790305330/1223701238630187028/image.png?ex=661acf96&is=66085a96&hm=560f79aeb8cd95060940a63027e121aeccf226952b313dfcfd51f3b727ef624b&
18:33 fdobridge: <p​ac85> NIL?
18:33 fdobridge: <g​fxstrand> I'm sorry you have to deal with AMD image layouts. Only having a few pieces of hardware to worry about probably helps but I think AMD gets an award for the most insane image layouts of any HW vendor.
18:33 fdobridge: <g​fxstrand> And I've worked on Intel. 😅
18:34 fdobridge: <t​riang3l> Northern Islands Image Layout
18:34 fdobridge: <!​DodoNVK (she) 🇱🇹> Unfortunately I had to get this thought out because I had it for way too long (maybe I should make a private chat somewhere where I can say my unfiltered thoughts)
18:34 fdobridge: <p​ac85> Right
18:34 fdobridge: <p​ac85> I see
18:37 fdobridge: <t​riang3l> (excuse that "non-display 0", I didn't set VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT)
18:40 fdobridge: <t​riang3l> although it's probably preferable to assume it regardless of the image create info because of bäd games
18:40 fdobridge: <t​riang3l> although it's probably preferable to assume it regardless of the usage specified in the image create info because of bäd games (edited)
19:07 fdobridge: <g​fxstrand> Wouldn't that be NIIL? Ya'know, to be different from NIL.
19:12 fdobridge: <g​fxstrand> Not that anything would ever include both. 🙃
19:13 fdobridge: <t​riang3l> It would be `struct terakan_image_surface` 🤷‍♂️
19:43 fdobridge: <t​riang3l> By the way, I'm glad that I checked the spec now, because suddenly in maintenance4:
19:43 fdobridge: <t​riang3l> > The size memory requirement of a buffer or image is never greater than that of another buffer or image created with a greater or equal size.
19:43 fdobridge: <t​riang3l> By the way, I'm glad that I've checked the spec now, because suddenly in maintenance4:
19:43 fdobridge: <t​riang3l> > The size memory requirement of a buffer or image is never greater than that of another buffer or image created with a greater or equal size. (edited)
19:52 fdobridge: <g​fxstrand> Woof
19:56 fdobridge: <n​anokatze> ye, that's a very useful feature
19:56 fdobridge: <n​anokatze> it also kills amdvlk
19:56 fdobridge: <n​anokatze> or used to
21:17 fdobridge: <r​edsheep> Just catching up on the tiler discussion earlier and I gotta ask, what is the actual difference between a true tiler gpu and a desktop gpu that can do tile based rasterization?
21:17 fdobridge: <r​edsheep> Is it just a matter of whether or not you have a dedicated tile cache?
21:17 fdobridge: <J​oshie with Max-Q Design> NIH when?
21:19 fdobridge: <r​edsheep> Both AMD and NVIDIA now do what I have heard some call "Tile based immediate mode rasterization" and without knowing exactly what either term means in this context that just sounds like a contradiction
21:21 fdobridge: <r​edsheep> So, basically I am confused, and there really aren't resources where you can read up on the details because this is so deep in the weeds
21:23 fdobridge: <r​edsheep> Also it's relevant because at some point I want to take another crack at seeing if I can do the work for tile based rasterization for nvk, but I don't really know what that will take if it might be more than just programming a few registers to turn it on.
21:24 HdkR: Primary improvement is the rasterizer executes in groups of tiles to improve cache residency
21:25 HdkR: Ideally executing in some sort of hilbert or moton curve to improve cache hits around the tile edges
21:25 HdkR: morton*
21:27 HdkR: There's a video on Youtube from David Kanter that hits the concept at a really high level
21:28 fdobridge: <r​edsheep> Yeah, I have watched that video and spent a good amount of time playing with trianglebin on various pieces of hardware
21:29 fdobridge: <r​edsheep> That's the only reason I even knew maxwell started doing anything clever here
21:29 HdkR: The best thing with tiled rasterization is that you don't ned to care about the binning step that "proper" tilers have to do
21:32 fdobridge: <r​edsheep> When you talked about the hilbert or morton curves did you mean the ordering that the tiles get rasterized in? Trianglebin doesn't seem to show any behavior like that that I am aware of
21:33 HdkR: Yea, some tilers do a curve, some do not
21:34 HdkR: NV may just waterfall from top-left to bottom-right
21:34 fdobridge: <r​edsheep> I just retested, and yeah it's exactly that on ada
21:36 fdobridge: <r​edsheep> But also ada's tiles are absolutely massive, so the ordering is unlikely to be very important
21:37 HdkR: Big caches ease that burden a bit
21:37 fdobridge: <m​ohamexiety> Ada's tiles are very very massive due to the bigger cache yeah. iirc someone tried playing around with it and found that ~ half of the L2 gets dedicated to the framebuffer, but wasn't sure
21:37 HdkR: If you can get away with big wacking caches, it's likely cheaper to keep the hardware scheduler basic
21:37 fdobridge: <m​ohamexiety> https://cdn.discordapp.com/attachments/418571802193690626/1081611748714496070/trianglebin_2023-03-04_17-18-21.mp4?ex=661ab070&is=66083b70&hm=503048a9aa43488bb1ea9589ace195f6d03f2af035093bf878fd25aa59007c5b& this is how it looks like on AMD fwiw, RX 6600
21:38 fdobridge: <m​ohamexiety> and this is Ampere, smol tiles compared to Ada
21:38 fdobridge: <m​ohamexiety> https://cdn.discordapp.com/attachments/418571802193690626/1220306074121404457/-865252084843810321620230304-1622-26.4841654.mov?ex=6617b018&is=66053b18&hm=2859b6b87f327f042589ac262b7edd1d2f7035e49c050205e0202dc250510dbd&
21:38 fdobridge: <r​edsheep> I might not be the only one but yes, I have done that testing and it's approximately taht
21:40 HdkR: Will be interesting to see if NV keeps the tile size the same with the 5090 and its 1.6TB/s memory BW when it launches
21:41 fdobridge: <r​edsheep> Well assuming it's about as much faster as the more bandwidth I don't see any motivation to change it unless the cache is bigger again
21:42 fdobridge: <r​edsheep> Which, considering the L2 is already so much of the die it would be weird to increase it
21:43 fdobridge: <m​ohamexiety> the 4090 gets smaller slices so they can increase it (assuming same bus width) to 96MB
21:43 HdkR: You're not wrong
21:45 fdobridge: <r​edsheep> Yeah with 5090 if they decide to unlock the entire cache of the GB202 die it could easily use 64 MB for tiles instead of 32 that AD102 usually does
21:46 fdobridge: <r​edsheep> Just depends what the performance improvement is for having the rest of the cache doing other stuff vs switching tiles more times per frame
21:47 fdobridge: <r​edsheep> For the curious here are the results from the last time I did full testing on the two gpus I have access to
21:47 fdobridge: <r​edsheep> https://cdn.discordapp.com/attachments/1034184951790305330/1223750589154857191/Trianglebin_Results.ods?ex=661afd8c&is=6608888c&hm=f8c656a283eb4ef021255cdd97a184d813530b55eaf8ebcd02ddf643b0699838&
21:51 fdobridge: <r​edsheep> So the 4090 leaves 40 MB of cache for other stuff normally, and then only 8MB for other stuff if you are running 8x msaa, and the 1080ti... Doesn't make any sense. Somehow my numbers must be wrong or maxwell/pascal has some other cache besides the L2
21:53 fdobridge: <r​edsheep> Or the publicly marketed size of the L2 might have been wrong. Dunno. Pascal isn't all that important anymore so I haven't dug on that point.
21:56 fdobridge: <m​ohamexiety> I remember David Kanter speculating that there was some on board memory dedicated to the tiling, maybe that's what you encountered with the 1080Ti
21:57 fdobridge: <r​edsheep> That makes sense. As I remember more of that research I think I determined maxwell tiles were small enough it was probably L2, but pascal probably has something else going on.
21:58 fdobridge: <r​edsheep> For being two architectures that are usually talked about being practically identical they seem to be really pretty different in this area.
21:58 fdobridge: <m​ohamexiety> yep
22:16 fdobridge: <!​DodoNVK (she) 🇱🇹> So is it possible to identify a GPU based on tiling patterns?
22:36 HdkR: Probably would be easier just to query the GPU strings
22:39 xps420: Hello! Im having trouble getting h264 encoding to work on my G92 nvidia card. I followed the instructions on VideoAcceleration.html, I also did a fresh install and did the troubleshooting steps.
22:39 xps420: vdapuinfo still shows that encoding is not supported on h264
23:16 Lyude: karolherbst: I haven't but I wonder if this person is using MST and we end up probing some of te ports in a different order on each resume or something silly like that
23:16 Lyude: could even be userspace honestly
23:17 Lyude: because if userspace isn't paying attention to the path attribute of MST connectors it won't be able to reliably tell what is what after resume
23:17 karolherbst: Lyude: it's worse than that
23:17 karolherbst: like the cursoer interacts with the wrong display
23:17 Lyude: Oh wtf haha
23:17 Lyude: that's wild
23:17 karolherbst: at least that's how I read it
23:17 karolherbst: :D
23:18 Lyude: mhhh, I'm not sure
23:18 Lyude: it sounds like they might be implying that the virtual layout of the displays is getting resert
23:18 Lyude: *reset
23:18 karolherbst: but "but then clicking on left monitor controls right monitor and vice versa." gives it away kinda
23:19 Lyude: so like: imagine they have monitor 1 on the left and 2 on the right, but they resume and 2 is on the left and 1 is on the right
23:19 Lyude: that's what it sounds like to me at least
23:19 Lyude: i guess we'll have to ask
23:20 karolherbst: yeah, I asked
23:21 Lyude: btw: this slimbook is nice
23:21 Lyude: and i've only found one small issue with nouveau that needs fixing on it so far :)
23:32 Lyude: https://queer.party/@Lyude/112187237911812357 karolherbst: i love rust
23:34 fdobridge: <m​arysaka> What a pretty masterpiece
23:40 karolherbst: heh :D
23:40 karolherbst: wait a second...
23:41 karolherbst: is that a circular type dep between C and S?
23:41 karolherbst: I mean it makes sense, but also...
23:41 Lyude: i know, it's kind of wild i'm able to do that
23:42 karolherbst: yeah.. I'm surprised this is legal
23:42 karolherbst: :D
23:58 fdobridge: <!​DodoNVK (she) 🇱🇹> ~~Anything is possible with a legalization pass~~