10:49 fdobridge_: <!​DodoNVK (she) 🇱🇹> I just discovered that using NAK for vertex shaders causes severe freezing issues (in GTA games at least)
10:50 fdobridge_: <!​DodoNVK (she) 🇱🇹> Also in the same games I discovered that fragment shaders are the main performance bottleneck for codegen
14:02 montjoie: hello, I have a GT1030 and according to feture matrix power management is TODO, does this means my card will run the same than with nvidia-drivers but always at maximum lock ?
14:21 fdobridge_: <!​DodoNVK (she) 🇱🇹> montjole: Your GPU will run at very low speeds
15:37 montjoie: fdobridge_: so going to nouveau will lead to degraded performance ?
15:41 f_: montjoie: yes
15:43 montjoie: this is sad, does there is something that can be done ?
15:43 f_: You can help if you wish.
15:43 montjoie: I fear the answer is "reverse engeener of the firmware"
15:43 f_: I guess it's not necessarily that.
15:43 montjoie: or there is some part more easy
15:44 f_: https://nouveau.freedesktop.org/TestersWanted.html ?
15:44 f_: https://nouveau.freedesktop.org/HardwareDonations.html ?
15:44 montjoie: I can test at least
15:44 montjoie: but what to test ?
15:45 montjoie: just booting with nouveau as a start ?
18:16 fdobridge_: <g​fxstrand> What kind of freezing? A stutter or `VK_ERROR_DEVICE_LOST`?
18:16 fdobridge_: <g​fxstrand> I'm just gonna leave this here... https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26615
18:16 fdobridge_: <g​fxstrand> I'm just gonna leave this here... https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26615 🏎️ (edited)
18:16 fdobridge_: <!​DodoNVK (she) 🇱🇹> A long stutter (DEVICE_LOST is a crash)
18:17 fdobridge_: <!​DodoNVK (she) 🇱🇹> If I don't use NAK for vertex shaders they're less severe (RADV is even better of course)
18:18 fdobridge_: <g​fxstrand> Yeah, we have zero caching right now. Still odd that codegen is faster. I wonder why.
18:22 fdobridge_: <!​DodoNVK (she) 🇱🇹> I still have a hacked up version of the pipeline caching MR applied
18:23 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand in regards to that UBO MR, what's nvidia doing? Something similar? only bindless UBOs?
18:23 fdobridge_: <k​arolherbst🐧🦀> only ldg.constant?
18:28 fdobridge_: <g​fxstrand> They're promoting
18:28 fdobridge_: <g​fxstrand> IDK if they're doing the fancy macro but they're definitely promoting
18:28 fdobridge_: <g​fxstrand> They're also promoting textures. I need to figure out how that works.
18:29 fdobridge_: <k​arolherbst🐧🦀> ohh.. textures are trivial since kepler
18:29 fdobridge_: <k​arolherbst🐧🦀> you just set the index into the sampler/texture desc buffer
18:29 fdobridge_: <g​fxstrand> Yeah, it's just a table of bindless indices isn't it?
18:30 fdobridge_: <k​arolherbst🐧🦀> the indicies point into the tsc/tic buffer _if_ you set the indexing mode to independent
18:30 fdobridge_: <k​arolherbst🐧🦀> like there is a `QMD` option for that
18:31 fdobridge_: <k​arolherbst🐧🦀> not sure how the other mode works
18:31 fdobridge_: <g​fxstrand> Yeah, I need to poke at all that after a bit.
18:31 fdobridge_: <g​fxstrand> If it weren't faster NVIDIA wouldn't be doing it.
18:31 fdobridge_: <k​arolherbst🐧🦀> it's like the bindless handle just constant
18:31 fdobridge_: <g​fxstrand> Yeah. The hardware probably mega-caches them, though.
18:31 fdobridge_: <k​arolherbst🐧🦀> yeah...
18:32 fdobridge_: <k​arolherbst🐧🦀> or rather..
18:32 fdobridge_: <k​arolherbst🐧🦀> fetches them quicker
18:32 fdobridge_: <k​arolherbst🐧🦀> if you have the constant index in the encoding you don't have to wait until you read the register
18:32 fdobridge_: <k​arolherbst🐧🦀> there are also indirect forms
18:33 fdobridge_: <k​arolherbst🐧🦀> I think they call it "mixed bindless"
18:33 fdobridge_: <k​arolherbst🐧🦀> ehh partial bindless
18:34 fdobridge_: <k​arolherbst🐧🦀> like.. you can have the texture constant and the sampler be indirect
18:34 fdobridge_: <k​arolherbst🐧🦀> there is also a uniform bindless form if using uniform registers
18:35 fdobridge_: <k​arolherbst🐧🦀> but that one is kinda weird...
18:37 fdobridge_: <k​arolherbst🐧🦀> ehh wait.. there is a second buffer for all those handles...
18:37 fdobridge_: <k​arolherbst🐧🦀> right... one which maps from tex instruction to the actual indicies.. I forgot about that one 🙂
18:43 fdobridge_: <g​fxstrand> Yeah, right now NAK doesn't know about anything but bindless. 😅
18:46 fdobridge_: <g​fxstrand> It shouldn't be too hard to hook up the other modes, though. It's just a bit more code in nak_nir_lower_tex.
18:46 fdobridge_: <g​fxstrand> And maybe a few more bits packed into my flags struct.
18:47 fdobridge_: <!​DodoNVK (she) 🇱🇹> I just compiled an updated kernel (hopefully it won't explode)
19:02 fdobridge_: <!​DodoNVK (she) 🇱🇹> And I'm back (hopefully vkd3d tests will be a little more stable now)
19:02 fdobridge_: <g​fxstrand> \o/
19:03 fdobridge_: <m​arysaka> ... now I want to do some testing tonight on some games :vibrate:
19:03 fdobridge_: <m​arysaka> totally wasn't considering playing a Hat In Time DLCs tonight
19:05 fdobridge_: <g​fxstrand> Go for it!
19:05 fdobridge_: <g​fxstrand> I'd love some before/after FPS numbers on some things.
19:05 fdobridge_: <g​fxstrand> I imagine it'll be... substantive.
19:06 fdobridge_: <g​fxstrand> IDK how much faster but it should easily show up in FPS numbers
19:06 fdobridge_: <g​fxstrand> I'll figure out robustness2 on Monday
19:06 fdobridge_: <g​fxstrand> IDK what nvidia is doing there
19:06 fdobridge_: <k​arolherbst🐧🦀> what's the issue with robustness2 btw?
19:06 fdobridge_: <g​fxstrand> We need tight bounds checking
19:07 fdobridge_: <k​arolherbst🐧🦀> in what sense?
19:07 fdobridge_: <g​fxstrand> 16B granularity
19:07 fdobridge_: <k​arolherbst🐧🦀> I meant.. how should the bound checking behave
19:07 fdobridge_: <g​fxstrand> anything OOB returns 0
19:07 fdobridge_: <k​arolherbst🐧🦀> on ubos?
19:07 fdobridge_: <g​fxstrand> yup
19:07 fdobridge_: <k​arolherbst🐧🦀> they already do
19:07 fdobridge_: <g​fxstrand> size has alignment requirements
19:08 fdobridge_: <g​fxstrand> or at least I thought it did
19:08 fdobridge_: <g​fxstrand> but maybe it was my base pointers that weren't aligned. 🤔
19:08 fdobridge_: <k​arolherbst🐧🦀> do you use LDC anywhere?
19:08 fdobridge_: <g​fxstrand> For indirect, yes
19:08 fdobridge_: <!​DodoNVK (she) 🇱🇹> Anyway retesting vkd3d(-proton) tests soon
19:08 fdobridge_: <k​arolherbst🐧🦀> okay...
19:08 fdobridge_: <k​arolherbst🐧🦀> that actually impacts OOB behavior then
19:09 fdobridge_: <k​arolherbst🐧🦀> mhh.. default `LDC` should return 0 on OOB as well (and doesn't move into the next cb)
19:09 fdobridge_: <k​arolherbst🐧🦀> what does fail with robustness2? or how does it fail?
19:10 fdobridge_: <k​arolherbst🐧🦀> (default mode as in `.IA`)
19:15 fdobridge_: <!​DodoNVK (she) 🇱🇹> I only get vkd3d library segfaults now (probably because of some missing Vulkan functions)
19:15 fdobridge_: <!​DodoNVK (she) 🇱🇹> And my system has survived
19:17 fdobridge_: <!​DodoNVK (she) 🇱🇹> The amount of failures is definitely high :triangle_nvk:
19:17 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1183125293414744084/message.txt?ex=6587324b&is=6574bd4b&hm=0780d096e7bac9e30c150b628dcfffea965b05e385bcc0fb935bf216ea0cb50e&
19:26 fdobridge_: <!​DodoNVK (she) 🇱🇹> vkd3d-proton needs a HOST_CACHED and DEVICE_LOCAL memory heap (is it easy to support?)
19:40 fdobridge_: <!​DodoNVK (she) 🇱🇹> Actually this is wrong (let me debug this further)
19:46 fdobridge_: <!​DodoNVK (she) 🇱🇹> Actually it needs both HOST_VISIBLE and HOST_CACHED properties 🤔
20:08 fdobridge_: <g​fxstrand> We might just not be advertising GART memory properly.
20:08 fdobridge_: <!​DodoNVK (she) 🇱🇹> After adding one extra memory property the tests are better now
20:09 fdobridge_: <!​DodoNVK (she) 🇱🇹> 📈
20:09 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1183138462916956230/message.txt?ex=65873e8f&is=6574c98f&hm=8fd652883f8c426aac57d0cc0495fc1df7cd31d4ef002b7de4bd5858098778a9&
20:10 fdobridge_: <g​fxstrand> Feel free to make that an MR. If nothing else it'll give me an excuse to think about that.
20:12 fdobridge_: <!​DodoNVK (she) 🇱🇹> I'm not sure how to debug the rest of these issues though (RenderDoc captures are going to be difficult here) :ferris:
20:18 fdobridge_: <g​fxstrand> Those sampler feedback fails are interesting. I would expect that to work. Maybe we have LODs wrong or something? 🤷🏻‍♀️
20:19 fdobridge_: <!​DodoNVK (she) 🇱🇹> This is not native Vulkan though (it's Direct3D 12 running through the vkd3d-proton layer)
20:23 fdobridge_: <k​arolherbst🐧🦀> ohh.. one advantage of const buffers over (cached) ldg.constant is, that you won't need barriers in the scoreboard (as they have in inherent cost)
20:23 fdobridge_: <k​arolherbst🐧🦀> what's the status in nak with all the scoreboarding anyway?
20:24 fdobridge_: <k​arolherbst🐧🦀> but mhh.. not sure if I'm disappointed about the 10% or not...
20:26 fdobridge_: <g​fxstrand> Should be scoreboarding okay
20:26 fdobridge_: <g​fxstrand> What's 10%?
20:26 fdobridge_: <k​arolherbst🐧🦀> well.. the ubo change is more relevant on hardware without ldg.constant anyway
20:26 fdobridge_: <k​arolherbst🐧🦀> your ubo MR
20:27 fdobridge_: <g​fxstrand> Oh, for sure. It's going to take Maxwell from shit up fine (mostly)
20:27 fdobridge_: <g​fxstrand> Oh, for sure. It's going to take Maxwell from shit to fine (mostly) (edited)
20:27 fdobridge_: <k​arolherbst🐧🦀> however, others done way more code for way smaller perf improvements :ferrisUpsideDown:
20:27 fdobridge_: <g​fxstrand> Yeah, 10% is nothing to spit at.
20:28 fdobridge_: <g​fxstrand> Does anyone know where we are vs. the blob?
20:28 fdobridge_: <g​fxstrand> I've literally never cared to try
20:28 fdobridge_: <k​arolherbst🐧🦀> also smaller shaders and all that
20:28 fdobridge_: <k​arolherbst🐧🦀> phoronix did recently, no?
20:28 fdobridge_: <!​DodoNVK (she) 🇱🇹> ~~Except Maxwell v2~~
20:29 fdobridge_: <k​arolherbst🐧🦀> kepler 780 ti would be a cool think to try
20:29 fdobridge_: <k​arolherbst🐧🦀> aparently not...
20:29 fdobridge_: <g​fxstrand> Well, yeah. There's no helping Pascal and Maxwell 2
20:29 fdobridge_: <k​arolherbst🐧🦀> which I even have one.. well.. not the 780 ti, but a titan 😄
20:30 fdobridge_: <k​arolherbst🐧🦀> yeah... I think I'll give that ubo MR a try on that one
20:31 fdobridge_: <g​fxstrand> @karolherbst How's Volta? Did you get a chance to play with it?
20:31 fdobridge_: <k​arolherbst🐧🦀> sooo.. a lot of things to do for tomorrow 😄
20:31 fdobridge_: <e​sdrastarsis> theres no perf boost here :blobcatnotlikethis:
20:32 fdobridge_: <k​arolherbst🐧🦀> wanted to try tomorrow
20:32 fdobridge_: <g​fxstrand> Okay, cool
20:32 fdobridge_: <g​fxstrand> No rush
20:33 fdobridge_: <g​fxstrand> I just keep writing if statements and not knowing which side is Volta I should cut on. It really is a cursed GPU.
20:39 fdobridge_: <g​fxstrand> What were you testing with?
20:40 f_: hm what is this bridged to?
20:41 fdobridge_: <!​DodoNVK (she) 🇱🇹> Discord
20:41 f_: oh ok. Thought it was a bridge to Matrix for some reason.
20:43 fdobridge_: <g​fxstrand> There are also matrix bridges but they're typically single user. We don't have a channel bridge.
20:44 fdobridge_: <e​sdrastarsis> No Man's Sky and Strange Brigade
20:44 fdobridge_: <g​fxstrand> Yeah, it's going to depend on the app. If they're using lots of textures or SSBOs or something, it may not be much faster.
20:45 fdobridge_: <g​fxstrand> Hrm... Wasn't DXVK using SSBOs for structure buffers for a while? Does anyone know if it's still doing that?
20:46 fdobridge_: <g​fxstrand> We really want those to be UBOs or at least .constant.
21:00 fdobridge_: <g​fxstrand> The thing that scares me the most about trying to get good perf is when we start fighting things like state change overhead. There's the obvious "don't more than necessary" but there's a lot more subtle things like "what does rebinding a UBO cost?" that I don't fully understand.
21:01 fdobridge_: <g​fxstrand> Right now I'm assuming MME is free and that state changes are close to free unless they're obvious WFI cases.
21:03 fdobridge_: <g​fxstrand> @karolherbst @airlied I'm about to buy a laptop with an Ada card in it. Are those good and stable now as long as you have GSP?
21:07 fdobridge_: <!​DodoNVK (she) 🇱🇹> We need the `shader*Int64Atomics` stuff for those tests ⚛️
21:08 fdobridge_: <g​fxstrand> Those have been hooked up since a while. Well, buffer. I haven't hooked up shared yet.
21:08 fdobridge_: <g​fxstrand> Because 64-bit shared atomics are dumb
21:10 fdobridge_: <g​fxstrand> Well, my "a while" I mean like a week. 😅
21:11 fdobridge_: <!​DodoNVK (she) 🇱🇹> minStorageBufferOffsetAlignment needs to be 16 too
21:13 fdobridge_: <g​fxstrand> Yeah, I need to look into that. IDK how that's going to interact with CBufs.
21:15 fdobridge_: <!​DodoNVK (she) 🇱🇹> Both AMD and Intel have 4 (as well as Panfrost and Turnip)
21:16 fdobridge_: <g​fxstrand> Oh, storage? Yeah, that I can bring down easily. Make an MR and I'll merge it.
21:16 fdobridge_: <!​DodoNVK (she) 🇱🇹> v3dv is not meeting the requirement though with 32 (Dozen is OKish at 16)
21:17 fdobridge_: <g​fxstrand> We want 16 so we can pull vec4s but I think it's at like 64 right now
21:17 fdobridge_: <!​DodoNVK (she) 🇱🇹> Which it is (I checked the NVK code)
21:19 fdobridge_: <a​irlied> Should be fine, though you will probably find a cursed one, also not sure we fully getting laptop panels to light up properly yet, dual GPU laptops seem fine
21:20 fdobridge_: <g​fxstrand> It's dual GPU with Intel
21:21 fdobridge_: <!​DodoNVK (she) 🇱🇹> This is some weird NVK building behavior 🦬
21:21 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1183156472624992286/message.txt?ex=65874f55&is=6574da55&hm=454fedd6cb914cee25d05a4911155e28ba46439790b8db6e07fad38d110555c9&
21:21 fdobridge_: <g​fxstrand> Which might be a little broken but that's a good excuse to fix it all.
21:31 fdobridge_: <a​irlied> I've been using a dual GPU Ada for last few weeks, seems stable as ampere
21:33 fdobridge_: <!​DodoNVK (she) 🇱🇹> I crashed my system with NVK_MIN_SSBO_ALIGMENT set to 16 :cursedgears:
21:35 fdobridge_: <!​DodoNVK (she) 🇱🇹> At first I had a MMU fault but then I got this: `watchdog: Watchdog detected hard LOCKUP on cpu 0` (I couldn't see this message until I rebooted my laptop)
21:39 fdobridge_: <!​DodoNVK (she) 🇱🇹> The proprietary driver has minStorageBufferOffsetAlignment set to 16 (so there must be some bug with NVK/NAK)
21:42 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand anyway, any further thoughts on that spir-v for that movs within a deref chain thing?
21:44 fdobridge_: <k​arolherbst🐧🦀> mhhhh cursed idea
21:46 fdobridge_: <k​arolherbst🐧🦀> always depends on the hardware, but the initial Ada enablement I did on a Intel+Nvidia laptop
21:46 fdobridge_: <k​arolherbst🐧🦀> *was
21:48 fdobridge_: <g​fxstrand> Well, we'll find out in about a week. 😅
21:48 fdobridge_: <g​fxstrand> Thanks for the SPIR-V. I'll look Monday.
21:48 fdobridge_: <g​fxstrand> @karolherbst Do you know when the cbuf alignment rules changed and what they are?
21:48 fdobridge_: <g​fxstrand> 64B seems to work on Turing.
21:56 fdobridge_: <k​arolherbst🐧🦀> goooood question. I don't know for Volta, for Turing+ it's 16 bytes, btw Volta has banks 0-17 available, turing seems to have have cb banks 0-17 + 24-31, ampere seems to have 0-17 and maybe also 24-31 (unknown condition)
21:57 fdobridge_: <k​arolherbst🐧🦀> apparently if the `SPA` version is 7.3 there is 24-31 available for Volta...
21:58 fdobridge_: <k​arolherbst🐧🦀> there is this `SET_SPA_VERSION` command in 3D and compute, wondering if it also returns the spa version used.. and if volta can be forced to use 7.3?
22:04 fdobridge_: <!​DodoNVK (she) 🇱🇹> How about Pascal?
22:04 fdobridge_: <g​fxstrand> `minUniformBufferOffsetAlignment = 256` on Titan V. 🤡
22:04 fdobridge_: <k​arolherbst🐧🦀> I guess the change was with turing then 🙂
22:05 fdobridge_: <t​riang3l> I hope there are no fun things like no dynamic descriptor indexing for slots 14 and 15 on TeraScale 😝
22:05 fdobridge_: <k​arolherbst🐧🦀> is it actually important though?
22:06 fdobridge_: <k​arolherbst🐧🦀> but anyway.. should be 16b then 🙂 no idea what nvidia reports
22:07 fdobridge_: <t​riang3l> And in some cases 0 probably, in some cases NaN, for out-of-bounds reads (although dynamic data addressing with the constant cache on TeraScale only works with the D3D9-ish scalar loop counter, so that's not really an interesting case)
22:07 fdobridge_: <g​fxstrand> It changes what limit we can advertise.
22:07 fdobridge_: <g​fxstrand> And getting this one right is kinda important...
22:08 fdobridge_: <t​riang3l> It's 64 on proprietary Turing drivers if I recall correctly, but maybe I recall incorrectly
22:09 fdobridge_: <g​fxstrand> Yeah, I'm looking at gpuinfo.org
22:10 fdobridge_: <g​fxstrand> It's the easiest way to figure this stuff out.
22:25 fdobridge_: <k​arolherbst🐧🦀> oh sure, I was just wondering how important it is to report a smaller number