01:28 airlied[d]: of course now I can't get talos to hang with transfer queue
03:34 mangodev[d]: gfxstrand[d]: nope, have had it the entire time i've used nvk
03:34 mangodev[d]: the problem is the reproducability
03:34 mangodev[d]: and the fact that it changes over time
03:34 mangodev[d]: lately, it's been real bad
03:34 mangodev[d]: about a month ago, it was okay
03:35 mangodev[d]: youtube is another site that's egregious with the flicker (lots of `backdrop-filter: blur()` and translucency in their new layout)
03:36 mangodev[d]: it's always an upper fragment of the screen that flickers
03:36 mangodev[d]: on a given page, it only flickers in one corner
03:36 mangodev[d]: sometimes it's the top left like in the screen recording, and sometimes (oftentimes) it's in the top right
04:02 gfxstrand[d]: mangodev[d]: And you're on 25.2?
04:31 mangodev[d]: gfxstrand[d]: yes
04:31 mangodev[d]: used it since 25.0
06:48 mohamexiety[d]: I do get flickers in discord too but it’s really hard to reproduce it in a way that I could report it
06:49 mohamexiety[d]: It’s not full screen tho it’s just certain elements that end up flickering at times, like say icons or server avatars or such
11:50 mohamexiety[d]: gfxstrand[d]: thought to test this theory out so tried out cyberpunk on 1440p instead of 4K and this actually didnt matter, it still MMU faulted at the exact same place. so I tried horizon remastered and I guess the mmu faults arent as determinstic as I thought because I actually got a full run 😮
11:51 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1409505527125512192/image.png?ex=68ad9fd1&is=68ac4e51&hm=32629a9a0dd4d084aac06ba0caaad7482047deeefa55104322805852b1376522&
11:51 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1409505527687544922/image.png?ex=68ad9fd1&is=68ac4e51&hm=fddd010478f3abe94ac9d062567a4047b8e386ef387cdb9263f936713bfe5d5f&
11:51 mohamexiety[d]: the numbers are sad but it did complete
11:52 mohamexiety[d]: now it did mmu fault at first at a different area so I thought maybe that was a fluke and thought to try it again to see if it'd mmu fault at the usual spot but it just didnt
12:00 phomes_[d]: have you tried `NVK_DEBUG=zero_memory` and `NVK_DEBUG=trash_memory`?
12:02 phomes_[d]: mmu faults happening non-deterministically is my general experience. Sometimes those options help with that
12:03 mohamexiety[d]: yeah generally they're not determinstic but this one was just always in the exact same spot in the 3 games I tried so this was surprising
12:03 mohamexiety[d]: will try those
12:03 chikuwad[d]: :o
12:04 chikuwad[d]: trash_memory isn't mentioned in docs/drivers/nvk,rst
12:04 chikuwad[d]: what does it do?
12:05 mohamexiety[d]: I assume the opposite of zero memory, where the memory instead has random values
12:05 phomes_[d]: yes
12:05 phomes_[d]: https://gitlab.freedesktop.org/mesa/mesa/-/commit/0dad7857d8996e8e17d075b43e067950ed3776a2
12:06 chikuwad[d]: interesting :birdnotes:
12:10 phomes_[d]: for cyberpunk do you get anything like this when launching a new game? like right after configuring the player character
12:10 phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1409510168710090752/Screenshot_From_2025-08-20_16-00-25.png?ex=68ada423&is=68ac52a3&hm=6e588d6303af5984c2557299a0df50eb893adae9a80b6390d935a4994dafc524&
12:10 mohamexiety[d]: on the first run the benchmark looks like that yeah
12:10 mohamexiety[d]: second run it loads properly
12:11 mohamexiety[d]: didnt actually try playing/loading saves/etc
12:14 phomes_[d]: ok. I will focus on other games so we do not duplicate efforts
12:15 phomes_[d]: I see it on Ada so that specific issue is not blackwell only btw
12:16 mohamexiety[d]: phomes_[d]: oh you can look at this, I am focusing on the mmu faults in specific
12:20 phomes_[d]: I should finish my work on VK_EXT_device_fault so we can get the crash diagnostics layer working. It might help. The games that were faulting suddently stopped doing so, so my test cases disappeared
12:23 mohamexiety[d]: driver got scared
12:24 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1409513743821312021/IMG_0540.mov?ex=68ada778&is=68ac55f8&hm=88341df55ffcdbbe40fce2a8bc362a2805715d162624d6cd0037104e044929bc&
12:24 mohamexiety[d]: mhenning[d]: sorry about the ping but since you do test HZD does this corruption show up in the normal game as well or is it exclusive to the remastered? The weird lines
12:29 tekinvasily: in my opinion powers of two's carry sum is one way, once the bit is already counted it can not be turned into smaller result, so hence the higher entropy, it can only delay the carry to get interpreted as a bigger value anyhow, since power of two finally outputed is bigger by two-1 of the previous. In other words it does unneeded work by seeking into the spot where the value really starts
12:29 tekinvasily: to produce something incremental. So you might had been landing in the correct sweetspot already, i dunno i have enjoyed summer, and have messed up few times due to this :P. The answer set through multiple so to speak elimination constants hence multiple carry permutes however can yield a smaller value during execution. in other words it can output previous bits into any state wanted
12:29 tekinvasily: without waiting for the carry's decision about the higher bits to handle only the last bit or current one. But this is all only fiction of interpratation, hardware wise we still use numbers that are bigger which have alomst random looking combination set in binary.
12:33 mohamexiety[d]: ok so having done 6 runs with `zero_vram` now, 3 1440p and 3 4K, there are no more mmu faults with blackwell :thonk:
12:33 mohamexiety[d]: will see other games
12:37 gfxstrand[d]: mohamexiety[d]: I'll give DA:TV the similar treatment and see how that goes. Still surprising since they play fine on other GPUs.
12:39 mohamexiety[d]: cyberpunk 2077 _really_ hates zero_vram. now it does it in menues before even getting to the benchmark
12:39 mohamexiety[d]: [ 4203.239188] nouveau 0000:01:00.0: gsp: mmu fault queued
12:39 mohamexiety[d]: [ 4203.240743] nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:40 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003f22610000 fault_type:00000002
12:39 mohamexiety[d]: [ 4203.240747] nouveau 0000:01:00.0: fifo:d00000:0028:0028:[GameThread[19284]] errored - disabling channel
12:39 mohamexiety[d]: [ 4203.240754] nouveau 0000:01:00.0: GameThread[19284]: channel 40 killed!
12:39 mohamexiety[d]: [ 4366.788104] nouveau 0000:01:00.0: gsp: mmu fault queued
12:39 mohamexiety[d]: [ 4366.789661] nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:42 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003f47ad0000 fault_type:00000002
12:39 mohamexiety[d]: [ 4366.789665] nouveau 0000:01:00.0: fifo:d00000:002a:002a:[GameThread[20523]] errored - disabling channel
12:39 mohamexiety[d]: [ 4366.789672] nouveau 0000:01:00.0: GameThread[20523]: channel 42 killed!
12:39 mohamexiety[d]: maybe third time is the charm though
12:41 mohamexiety[d]: (it wasnt)
12:41 mohamexiety[d]: [ 4488.245588] nouveau 0000:01:00.0: gsp: mmu fault queued
12:41 mohamexiety[d]: [ 4488.247163] nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:42 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003f479a0000 fault_type:00000002
12:41 mohamexiety[d]: [ 4488.247167] nouveau 0000:01:00.0: fifo:d00000:002a:002a:[GameThread[21494]] errored - disabling channel
12:41 mohamexiety[d]: [ 4488.247173] nouveau 0000:01:00.0: GameThread[21494]: channel 42 killed!
12:44 mohamexiety[d]: also typo; zero_memory not zero_vram (I did type it correctly in the launch args tho)
12:46 phomes_[d]: faking the vendorID is also worth a try. Some games enable things based on that. We don't have an env var for it. You can add a driconf or just hardcode it on
12:50 phomes_[d]: no mmu fault even with zero_memory here. What proton version are you on? My tests are on proton experimental bleeding edge
12:54 mohamexiety[d]: yeah the fault is blackwell exclusive and runs fine on ada for me. for proton, how do I check? I am just using whatever came with cachy
12:55 tekinvasily: So i think it is round about done, to some degree bitcoin understandings and flows still point out to some correct paradigm of crypotgraphy, it's just that they do not cause collisions to happen in the rng, it's fully determinisitc, hence harvesting is not a thing but archival with smaller rewards is. But this theory to remove collisions in some round of permuting the previous answer in
12:55 tekinvasily: next round is valid.
12:58 karolherbst[d]: Anybody wants to review some coop matrix stuff so I can make progress? trivial refactor: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36391 and ldsm: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36841
12:58 karolherbst[d]: ehh
12:58 karolherbst[d]: ldsm is here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36363
12:59 mohamexiety[d]: no dice for avatar. the pushbuf fenc stuff might be a good pointer tho? gfxstrand[d]
12:59 mohamexiety[d]: [ 5578.198266] nouveau 0000:01:00.0: gsp: mmu fault queued
12:59 mohamexiety[d]: [ 5578.199814] nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:44 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003e82200000 fault_type:00000002
12:59 mohamexiety[d]: [ 5578.199818] nouveau 0000:01:00.0: fifo:d00000:002c:002c:[afop.exe[26377]] errored - disabling channel
12:59 mohamexiety[d]: [ 5578.199826] nouveau 0000:01:00.0: afop.exe[26377]: channel 44 killed!
12:59 mohamexiety[d]: [ 5578.199843] nouveau 0000:01:00.0: afop.exe[26377]: error fencing pushbuf: -19
12:59 mohamexiety[d]: [ 5578.199848] nouveau 0000:01:00.0: afop.exe[26377]: error fencing pushbuf: -19
12:59 mohamexiety[d]: [ 5578.199855] nouveau 0000:01:00.0: afop.exe[26377]: error fencing pushbuf: -19
13:15 phomes_[d]: mohamexiety[d]: in steam you can set it globally in Settings->Compatability->Default compatability tool. You can also set it per game in the game properties->Compatability. You can install the experimental proton by searching for it in library. Then go to its properties->Betas and set to bleeding edge. The use that as compatibility tool for the game
13:18 gfxstrand[d]: `zero_vram` doesn't help DA:TV but I also have a new theory: I don't think the MMU faults are the cause of the crash. I think the crash might be causing the MMU faults
13:18 mohamexiety[d]: phomes_[d]: got it. thanks!
13:19 mohamexiety[d]: gfxstrand[d]: `trash_memory` doesnt do anything for cyberpunk and avatar, will see horizon now
13:19 mohamexiety[d]: what sort of crash could cause a fault though :thonk:
13:22 mohamexiety[d]: actually hm. do these flags do anything for release mesa?
13:22 mohamexiety[d]: I run horizon in debug but the other 2 were on the system mesa install
13:24 gfxstrand[d]: That parent I'm seeing in DA:TV is that it hangs for 5s and then a MMU fault gets reported and the game's crash window pop up simultaneously. This indicates to me that the real bug is some sort of timeout and then when the fence times out that game cleans something up and that causes the fault.
13:24 mohamexiety[d]: I see
13:26 gfxstrand[d]: There's a few places that could be coming from. A sync bug in the kernel still looks the most likely. But it could also be a shader difference that's leading to infinite looping somewhere, though I'm not sure what that would be.
13:26 mohamexiety[d]: could see vkd3d logs for signs of a crashing something. on cyberpunk what I saw was that the last thing was a failed VkQueueSubmit with a ERROR_DEVICE_LOST but I am not sure if that’s due to the fault or this is the cause of the fault
13:26 gfxstrand[d]: My experiments with `NVK_DEBUG=sync` still have me suspecting the kernel.
13:27 gfxstrand[d]: mohamexiety[d]: Look at timing. Monitor dmesg at the same time (from SSH if you have to) and see when the fault pops up vs. when it freezes.
13:28 mohamexiety[d]: hmm yeah
13:30 mohamexiety[d]: for cyberpunk at least the fault pops up exactly as the screen freezes so will check the logs then
13:36 mohamexiety[d]: nah not really clear there 😐
13:37 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1409532013974786068/cp2077.log?ex=68adb87b&is=68ac66fb&hm=e175af0311267f263510e221fc579e6be4db3a2efa268608694561ad10ce97ab&
13:37 mohamexiety[d]: there is some weirdness with swapchain semaphores though but no clue how important this is
13:43 viktorhaugos: well if 5powers count makes first collision, when you add spacers offsets, that 4 next powers start from, and offset which was max for 4 first/last powers, you can have collision on 8 powers, next up you permute so that sum of 2first fields max is offset to next field, hence now already you can only have collision on 16powers, so two levels of indirections for 64bit and only one for 32bit
13:43 viktorhaugos: for arithmetic answers, and this is not very hard to be programmed. Page tables work similarly, but with ahrdware provided memory spaces that are very high entropy levels in the hierarchy of memory management, hashes are the lowest level of abstraction.
13:46 mohamexiety[d]: kinda annoying and also weird thing here: while at first I could get into the benchmark and mmu fault there, right now on cyberpunk I cant even get past the beginning intro splash screens with the RED ENGINE title etc before it mmu faults there
13:47 viktorhaugos: So if you demand performance from the apps you need to write new compiler for the cpu based compilation , it can work very low power in 1megehertz clock, or clockless overall would be best, but i do not know how to handle the last on all hardware. You need not much more from nvidia driver than memory management ant +-arithmetic and some minor other things in opencl.
13:48 mohamexiety[d]: mohamexiety[d]: (it isnt)
14:15 karolherbst[d]: mhhh
14:15 karolherbst[d]: `.maxComputeWorkGroupSize = {1024, 1024, 64},` those limits came from the GL driver, right?
14:16 karolherbst[d]: probably best to not tough it, but I _think_ the X axis can be 2048 actually...
14:17 karolherbst[d]: though nvidia also only reports 1024...
14:22 lucabanucci: I do not listen anything at your side of critics, cause i tend to know that my technology is correct, nvidia offering you those firmwares samewise is correctly doing those things, they hide powering fw and they have rights and are correct to do this. It's that you are trolls, and i am easily despite your terror and teams against me the biggest contributor to all of the projects, you simply
14:22 lucabanucci: never understood what is a gulf between you and me such as how smart i am in general that i broke the silence about what i have. I get things done after a while i get employed and unterrored by force, however i gave you rights to improve your skills if you are smart enough to capitilize on my words and text and pseudo, cause if you indeed are that smart, i consider you not as big danger as
14:22 lucabanucci: expected, i was never a danger to others samewise being smarter all the way along.
14:41 mohamexiety[d]: yeah NV reports 1024x1024x64 as well
14:43 gfxstrand[d]: karolherbst[d]: We probably copy+pasted from NV, actually.
14:43 gfxstrand[d]: We usually do that over copying from GL
14:43 karolherbst[d]: right...
14:44 karolherbst[d]: thing is.. there is a bit more in the ISA available, but not sure if it's just a bit always being 0 or if there is a way to increase the size.. maybe nvidia never bothered? who knows
14:44 karolherbst[d]: bit X is one bit wider than Y
14:45 karolherbst[d]: but cuda docs also say 1024, so whatever
14:46 karolherbst[d]: I'm just confused every time I see it
14:48 gfxstrand[d]: There are lots of cases where there are more bits in the hardware than are actually used.
14:48 gfxstrand[d]: There are also lots of cases where they don't store N-1 so you can go up to 2N-1 in theory but the hardware only supports N
14:51 karolherbst[d]: right
15:14 OftenTimeConsuming: traps: firefox.real[12958] trap int3 ip:7faaae845eaf sp:7ffcff5470b0 error:0 in libglib-2.0.so.0.8400.1[6aeaf,7faaae7f9000+b2000] hmm
15:15 jja2000[d]: steel01[d]: I see your stuff on the tegra ml. If you need me to test anything on Pixel C (smaug) or on a Jetson T186, lmk
15:16 jja2000[d]: Pixel has been a tad annoying to get working properly, but if I can get USB boot working, it suddely becomes much less so :^)
15:18 OftenTimeConsuming: Okay, now that glib is recompiled that crash no longer happens.
15:23 gfxstrand[d]: I suppose it could also be something funny with subc switches... 🤔
15:24 gfxstrand[d]: The thing that bugs me is just how many of the stalls I saw were tiny command buffers that didn't do anything besides flush or invalidate some caches.
15:24 gfxstrand[d]: I've got two going theories on that: One is that they're small enough that it causes two back-to-back signals that are close enough together that the kernel misses the second while handling the IRQ for the first.
15:25 gfxstrand[d]: Second is that there's something wedging earlier and those command buffers are just the trigger somehow.
15:25 gfxstrand[d]: But given that I'm syncing after literally every submit, that second one seems unlikely.
15:26 gfxstrand[d]: Actually... Not every submit...
15:26 gfxstrand[d]: Let me add a print
15:29 karolherbst[d]: gfxstrand[d]: any immediate plans to work on the UGPR/GPR/offset IO stuff? Otherwise I'll have time for it like this week or so.
15:31 gfxstrand[d]: karolherbst[d]: I did rebase predicates. That's what I was planning to work on if I can get Blackwell stable.
15:31 gfxstrand[d]: I want to see how predicates mix with pre-RA scheduling.
15:31 gfxstrand[d]: I'm hopefully we can get some real benefit there
15:31 karolherbst[d]: predicates? you mean for small if/else or what?
15:32 gfxstrand[d]: Specifically for bounds-checked `load_global`
15:32 karolherbst[d]: right
15:32 karolherbst[d]: but yeah, I implied that 😄
15:32 gfxstrand[d]: But why I bring that up is because I add a `ldg_nv` intrinsic in that series
15:32 karolherbst[d]: right...
15:32 gfxstrand[d]: `stg_nv` would probably also be helpful but it's loads that are killing us
15:33 karolherbst[d]: yeah
15:33 gfxstrand[d]: Especially with how much VKD3D loves SSBOs
15:33 karolherbst[d]: I mostly just want to fold constants into global load/stores because that almost never happens because it's a bit of a pain
15:33 karolherbst[d]: but might as well do the full thing
15:33 gfxstrand[d]: But yeah, I think loads and offsets and bounds checking is my next area for all this stuff.
15:33 karolherbst[d]: it works out great for STS and LDS as well
15:34 karolherbst[d]: (- the 2nd ugpr source)
15:34 karolherbst[d]: the shared ops are quite capable because they have this stride thing (x4, x8, x16 factor on the GPR address) + UGPR + imm24 offset
15:34 gfxstrand[d]: karolherbst[d]: Yeah, I've always hated the hack I did with trying to pull it out at NIR->NAK time.
15:35 karolherbst[d]: yeah...
15:35 karolherbst[d]: though the issue is that on the nir side a few things get optimized to lea as well..
15:35 gfxstrand[d]: Yeah but if we do the translation to `ldg` in NIR, we can do it before LEA
15:35 karolherbst[d]: in the coop things I've seen plain LDS and STS turn into ones with a stride and a constant offset, where before we had nothing of any
15:36 karolherbst[d]: nir_opt_offset does it in my MR
15:36 karolherbst[d]: and I ran that before the lea opts
15:36 karolherbst[d]: soo.. done and done
15:36 karolherbst[d]: and then it's just folding in a shift in
15:36 karolherbst[d]: though I'd also like to make the ugpr offset work
15:37 karolherbst[d]: anyway, I rebased that MR, and just wanted to deal with the other sup optimal address calcs
15:38 karolherbst[d]: and I think the ugpr thing would be pretty trivial to handle inside `nir_opt_offset`
15:38 karolherbst[d]: or maybe in its own pass if we have nv specific intrinsics
15:38 gfxstrand[d]: karolherbst[d]: Yeah, same. I'm just not sure how much we want to add to the core NIR ops for this.
15:39 karolherbst[d]: yeah...
15:39 gfxstrand[d]: Even adding BASE has me nervous, given how different HW has different things in terms of signedness and the number of bits.
15:39 karolherbst[d]: right
15:40 gfxstrand[d]: Even NAK has the potential to screw it up if we're not super careful because `rZ` vs. an actual register with zero are different because they affect signedness.
15:41 gfxstrand[d]: I don't love that bit of the hardware... I get why they did it that way but still. 😠
15:43 karolherbst[d]: I mean.. either it's a constant address, or it's signed
15:44 karolherbst[d]: but yeah.. I can prototype based on your MR and see where I end up with
15:46 karolherbst[d]: but yeah.. the biggest limiting factor of `nir_opt_offset` is that it treats the offset as unsigned, which is fine if you limit the range.. mhhh but I think it's all doable
15:58 mhenning[d]: mohamexiety[d]: Yes, that happens in the original game. I spent a few hours trying to track it down a month or two ago and I suspect a compiler bug but I didn't manage to fix it.
15:58 mohamexiety[d]: I see, thanks!
15:58 gfxstrand[d]: karolherbst[d]: As long as running `nir_opt_offsets` and then maybe inserting an add later is no worse than running `nir_opt_offsets`, it's still worth doing.
15:59 mohamexiety[d]: will open an issue for it later then and mention both games
15:59 gfxstrand[d]: gfxstrand[d]: Unfortunately, we might lose some CSE opportunities with offsets pulled out. But also, we might gain some.
15:59 karolherbst[d]: gfxstrand[d]: I mean.. how many shaders with have such big offsets anyway, I don't think it practically matters
15:59 karolherbst[d]: but yeah..
15:59 karolherbst[d]: naks own offset handling will take negative offsets into account
15:59 gfxstrand[d]: Yeah, it's always a matter of whether the fixup for the odd case ends up being worse for some reason.
16:00 karolherbst[d]: so it's really just niche corner cases where it matters
16:00 karolherbst[d]: well.. the stats on nir_opt_offsets are pretty promising so far
16:01 karolherbst[d]: but I want to get it working for global and scratch as well
16:01 karolherbst[d]: which is annoying, because opt offsets only deals with 32 bit alu...
16:01 karolherbst[d]: though.. not sure actually...
16:02 karolherbst[d]: ahh yeah, 32 bit only
16:02 gfxstrand[d]: Only 32-bit offsets or only 32-bit addresses?
16:02 karolherbst[d]: addresses
16:02 gfxstrand[d]: Also, for the UGPRs, are those 64 or 32-bit?
16:02 karolherbst[d]: both
16:03 karolherbst[d]: ehh wait...
16:03 gfxstrand[d]: Oh, sweet. I suspect UGPR will help a lot when the descriptor is uniform but the offset isn't.
16:03 karolherbst[d]: the GPR is both
16:03 karolherbst[d]: let me read
16:03 karolherbst[d]: ohh okay.. so
16:04 karolherbst[d]: if you have a GPR _and_ UGPR, the .E modifier affects the UGPR only
16:04 karolherbst[d]: and the GPR has its own .64 modifier
16:05 gfxstrand[d]: Okay, so if you do `.e` but not `.64`, you get 64-bit UGPR + 32-bit GPR?
16:05 karolherbst[d]: correct
16:05 gfxstrand[d]: Okay, that's also pretty useful
16:05 karolherbst[d]: yeah
16:05 karolherbst[d]: per thread offset done in 32 bit is huge
16:05 gfxstrand[d]: Because the base address will usually be a UTPR
16:06 karolherbst[d]: yeah
16:06 karolherbst[d]: not sure how that will all fit together, but I can prototype something and see if I get something sane out of it
16:09 karolherbst[d]: LDSM also has the UGPR nice.. so all IO ops are like that
16:09 karolherbst[d]: except shared, which also has this stride on top
16:10 karolherbst[d]: well except constant which is different anyway
16:19 gfxstrand[d]: Yeah, but shared is always 32-bit so that makes things a little different anyway
16:25 karolherbst[d]: do we have an opt that rebalances iadds to group uniform inputs? Might become useful here
16:27 karolherbst[d]: mhhh apparently not
16:29 karolherbst[d]: that should be super trivial in opt algebraic
16:55 gfxstrand[d]: Okay, it's a kernel bug. I've got a really easy reproducer case now
16:56 mohamexiety[d]: ooh?
16:56 chikuwad[d]: 👀
16:56 gfxstrand[d]: I'll post in a couple minutes
16:58 mohamexiety[d]: <a:vibrate:1066802555981672650>
17:02 gfxstrand[d]: That or I just found a bug in `NVK_DEBUG=push_sync` which I think might be the case
17:02 gfxstrand[d]: Because when I do a `vkWaitForFences()` through Vulkan it's fine but when I do `NVK_DEBUG=push_sync` it very much isn't
17:02 gfxstrand[d]: Wait... Never mind. I'm not resetting
17:02 gfxstrand[d]: But that shouldn't matter
17:03 gfxstrand[d]: Bingo! Got it to fail
17:03 gfxstrand[d]: Okay, we've got a repro case
17:05 gfxstrand[d]: https://gitlab.freedesktop.org/gfxstrand/crucible/-/commits/bench/submit-empty-cmd-wait
17:13 gfxstrand[d]: Doesn't always fail. Sometimes you have to run it a couple times.
17:13 gfxstrand[d]: airlied[d]: ^^
17:15 gfxstrand[d]: Okay, just pushed again. Now there are two versions: One which waits and one which doesn't. The non-wait version needs to be run with `NVK_DEBUG=push_sync` to repro.
17:45 nadiatorres: Terrorist against chosen individual like me on the grounds of without any reason from greed and jelousy can not be negotiated with, there is no point in asking a terrorist why do you do that shit to me, as why and how was orthopedics easiest procedure turned into overall halt in my life, 5 times in a row with 100% percent occurance overall like that to end my chapters in most activities
17:45 nadiatorres: thinkeable. I knew i had lectures and justification from home that i never deserved what i had. You feel very powerful and proper force to justify such cases so obviously you like to play with ruining lives in exchange of helping your crooked feelings like the negative spectrum of those sticking out all from your broken born body. However i say that nvidia chip engineering has nothing to do
17:45 nadiatorres: with this what your crooked understandings are, deep hiding the isa of powering and frequency scaling fw is wise for safety, and yes through such narccisist people who think they are so big to ruin others body soul and life overall exactly such measures should be don, cause i was beamed almost to death daily with becterial infection organized on the grounds of total fellony and fraud
17:45 nadiatorres: against my existence and was nearly dead once already, NVIDIA does things correctly i say.
17:47 HdkR: You've really got to pick a better time when I'm not around bud.
18:06 gfxstrand[d]: So that raises the next question: Do I want to go hunting through IRQ code or do I want to plug in a different GPU and hope airlied[d] feels so inclined when he gets online in a couple hours?
18:17 airlied[d]: I will start hunting!
18:18 airlied[d]: I'll try and reproduce when I get going
18:18 gfxstrand[d]: gfxstrand[d]: Easiest way is to do `MESA_VK_ABORT_ON_DEVICE_LOSS=true NVK_DEBUG=push_sync _build/crucible run bench.queue-submit-empty-cmd` with that crucible branch
18:19 gfxstrand[d]: For some reason `NVK_DEBUG=push_sync` is a more reliable reproducer than `vkWaitForFences()`, probably because it burns less CPU on extra stuff and is therefore a tighter loop.
18:35 gfxstrand[d]: This could just be bad headers but I don't see ACQ_CIRC_GEQ in the Blackwell headers in OGK.
18:35 gfxstrand[d]: Or NON_STALL_INTERRUPT
18:36 gfxstrand[d]: They must be stripped. We'd be seeing things blow up if NON_STALL_INTERRUPT didn't exist anymore
18:38 gfxstrand[d]: But I do wonder if Blackwell added some sort of memory ordering bit and that's missing
18:39 gfxstrand[d]: Such that the write is arriving after the intterrupt sometimes
18:43 airlied[d]: have you tried to reproduce on non blackwell?
18:43 gfxstrand[d]: Not with my crucible test, no.
18:43 gfxstrand[d]: I can throw it at Ampere easy enough
18:45 gfxstrand[d]: My Ampere box is running a bit of an older kernel but it seems good
18:46 gfxstrand[d]: Let me plug an Ada into my Blackwell box
18:49 gfxstrand[d]: Ada breaks, too
18:49 karolherbst[d]: mhhhh
18:50 karolherbst[d]: I looked at pred_ldg_nv and I'm kinda wondering if that's a good idea. Like we want to be able to turn smaller if/else trees into predication, and I don't think adding a predicate to all instructions just so we can do it in nir is a great approach there
18:50 gfxstrand[d]: karolherbst[d]: Yeah, Mel and I have had that argument
18:51 gfxstrand[d]: On the other hand, it cuts compile times by half
18:51 karolherbst[d]: nothing against adding IO intrinsics to deal with multiple sources, but the predication I feel needs its own solution
18:51 gfxstrand[d]: I'm not convinced it does
18:51 karolherbst[d]: do you want to add pred sources to all alus?
18:52 gfxstrand[d]: No. But 70% of our control-flow in your average DX shader is just load/store_global
18:52 karolherbst[d]: sure, but you still want to do it on anything else as well
18:52 gfxstrand[d]: After you get rid of that, most stuff is pretty straight-line.
18:52 karolherbst[d]: like it's beneficial for if/else clauses with like 5 or 6 instructions each
18:53 gfxstrand[d]: That's what we have `peephole_select` for
18:53 karolherbst[d]: nah, predication is better
18:53 gfxstrand[d]: Well, yes, predication avoids the select. I know this.
18:54 karolherbst[d]: not just that
18:54 karolherbst[d]: it also uses less cycles
18:54 karolherbst[d]: and it also has benefits with .reuse
18:54 gfxstrand[d]: .reuse is gone
18:54 karolherbst[d]: since when?
18:54 gfxstrand[d]: Blackwell
18:54 karolherbst[d]: heh
18:54 karolherbst[d]: still predication needs fewer cycles
18:54 gfxstrand[d]: fewer cycles than what?
18:54 karolherbst[d]: bcsel
18:54 gfxstrand[d]: Yes. Like I said, it gets rid of the sel
18:55 karolherbst[d]: with bcsel you calculate both branches and need to properly schedule
18:55 karolherbst[d]: with predication you only need each branch to properly wait, can even interleave things
18:55 gfxstrand[d]: Yes but that means we need to be able to reason about predication at a fairly high level.
18:55 gfxstrand[d]: Which means we need all the stuff my predication branch adds plus some pretty clever passes.
18:56 karolherbst[d]: I mean, you can do it in nak, it's not that complicated from a higher level pov
18:56 gfxstrand[d]: Like we need the scheduler to know that implicit dependencies don't matter if the two instructions have opposite predicates.
18:56 karolherbst[d]: well.. should do that anyway
18:57 gfxstrand[d]: In any case, yes, I'm aware of all the pieces we need here and starting with `ldg_pred` gets us part-way there. It's not a final solution but it lets us get a bunch of benefit and prove out pieces.
18:57 karolherbst[d]: we can also remove the predicate later if we have a proper solution, that's fine
18:58 gfxstrand[d]: Longer-term, we need something to lower control-flow and then clean up so we don't end up with a bunch of sel and can schedule stuff around. But control-flow lowering is a lot easier to do in NIR except that we can't lower load/store in NIR without adding a predicate to the op because they have side-effects.
18:58 karolherbst[d]: I'll probably have to add special nv variants for scratch and shared as well anyway..
18:58 karolherbst[d]: fun idea
18:59 karolherbst[d]: predicates in nir, but only allowed with unstructured CF 🙃
19:01 karolherbst[d]: but anyway... not sure if it's worth depending on your branch there, given I'll add intrinsics but with different parts added.. can always rebase one or the other way
19:02 karolherbst[d]: but I'm also not feeling like adding... 20 intrinsics to cover all the aaddress space + atomics...
19:03 karolherbst[d]: maybe I just add one load/store pair with nir_var_mode as a index?
19:04 gfxstrand[d]: Could, I guess. <a:shrug_anim:1096500513106841673>
19:04 gfxstrand[d]: There's really only 4 per address space and two for local
19:04 gfxstrand[d]: So that's 10 total
19:04 karolherbst[d]: and atomics
19:04 gfxstrand[d]: that includes atomics
19:04 gfxstrand[d]: load, store, atomic, atomic_cas
19:04 karolherbst[d]: atomics need 2 each
19:05 karolherbst[d]: yeah, so local, global and shared, oh right.. local doesn't have atomics
19:05 karolherbst[d]: guess it's 10 then
19:05 karolherbst[d]: yeah maybe it's not too bad
19:05 karolherbst[d]: could then also handle the stride on local at the nir level
19:06 karolherbst[d]: dunno.. I'll see how I'll feel about it all
19:08 gfxstrand[d]: gfxstrand[d]: So I'm wondering if something regressed. My ampere box is on a slightly older version of Ben's 570 branch because that one was more stable for massive CTSing. My Blackwell is on a more recent version because it needed the GB20 patches.
19:08 karolherbst[d]: we also have load_shared_lock_nv, but I guess that one is ... kepler only or something?
19:08 gfxstrand[d]: Assuming Ada and Ampere are roughly equivalent here, that might indicate something got regressed during Blackwell enabling.
19:08 gfxstrand[d]: karolherbst[d]: Yeah, it's kepler-only.
19:09 gfxstrand[d]: And no need to try and make that fancy. I don't care too much about Kepler atomic perf
19:09 karolherbst[d]: it doesn't have a ugpr variant anyway
19:09 karolherbst[d]: and the stride is turing+
19:09 gfxstrand[d]: Yeah so no point
19:10 gfxstrand[d]: We can still use the new intrinsics on Kepler with some asserts just to cut down on the size of the from_nir match but we don't need to worry too much about modding out the shared locking things to be fancy
19:10 karolherbst[d]: ohh that reminds me.. fmul has it too
19:10 karolherbst[d]: but different
19:11 karolherbst[d]: would be a nice thing for somebody to hook it up. Fmul can do a constant /8, /4, /2, x2, x4 or x8 in the encoding
19:11 karolherbst[d]: (but not ffma obviously)
19:11 gfxstrand[d]: IDK how much that comes up but, sure, could be interesting.
19:12 karolherbst[d]: yeah....
19:12 karolherbst[d]: would be a good starter project
19:12 gfxstrand[d]: `nir_op_fmul3_small_pow2_nv`
19:12 karolherbst[d]: maybe I should throw it into an issue with a label
19:12 karolherbst[d]: if somebody wants to start hacking on nak, would be small enough to figure it all out or so
19:14 gfxstrand[d]: gfxstrand[d]: I'm gonna shut down both boxes and pull an Ampere out of my CTS box and plug it into my Blackwell box and see if it fails.
19:14 gfxstrand[d]: Unfortunately, IDK if I have a known-good git commit to start bisecting from
19:21 mohamexiety[d]: on games at least Ada didnt MMU fault on my end
19:21 mohamexiety[d]: but that was with 6.16rc1 and mesa 25.1
19:21 mohamexiety[d]: didnt try mesa 25.2 or higher
19:22 gfxstrand[d]: Okay, so I've run it 5 times on Ampere with no fails. I'm gonna plug Ada into my Ampere box and run with the older kernel.
19:23 gfxstrand[d]: It definitely seems worse on Blackwell than Ada
19:28 gfxstrand[d]: mohamexiety[d]: Yeah, I'm suspecting we might have multiple bugs. This IRQ bug is definitely a thing but it may not be responsible for everything we're seeing.
19:28 gfxstrand[d]: gfxstrand[d]: Okay, run like 8 or 10 just failed on Ampere
19:29 airlied[d]: I also experienced the wierd fault on delay in blackwell, and I think the fault actually happens when it does, we just get a delay in getting it
19:31 gfxstrand[d]: Okay, it's not a regression. First try of my test on Ada failed, even on the kernel that previously seemed solid on Ampere.
19:32 gfxstrand[d]: I can get Ampere to fail, it just takes a lot of tries.
19:32 gfxstrand[d]: Blackwell seems the easiest to get to screw up
19:33 gfxstrand[d]: Yeah, Ada is failing every time on my previously super stable gsp570 kernel
19:34 gfxstrand[d]: And Ampere is failing 1 in 10 or so on my Blackwell kernel. So it's not the kernel.
19:34 gfxstrand[d]: IDK if that makes me feel better or worse about the state of things
19:37 karolherbst[d]: mhhh, we could do this: https://gist.githubusercontent.com/karolherbst/17d6bd473fb86439d9de51df224a128f/raw/78f3fce9cb25287c0b7cff9c44b2d6b5c4c12a1c/gistfile1.txt I think
19:37 karolherbst[d]: ehh *.b64
19:37 karolherbst[d]: but as far as I understand things, we can load constant addresses with uldc directly into ugprs
19:38 gfxstrand[d]: Yeah, that's just the ugpr heuristics being dumb
19:39 karolherbst[d]: mhh sure? it's a load_global_constant on the nir side
19:39 karolherbst[d]: I think this will need work in from_nir to make that more optimal
19:39 gfxstrand[d]: Oh, do you mean the `ldg` can write UGPRs?
19:39 karolherbst[d]: dunno
19:39 karolherbst[d]: no
19:39 karolherbst[d]: it's uldc
19:39 karolherbst[d]: but
19:39 karolherbst[d]: you can give it a VA
19:39 marysaka[d]: gfxstrand[d]: hmm is it behaving the same way with 535? :aki_thonk:
19:40 karolherbst[d]: which isn't bindless ubo
19:40 karolherbst[d]: with a 32 bit offset even
19:40 gfxstrand[d]: What is it that `uldc` can do that we're not doing now that's in that snippet?:
19:40 karolherbst[d]: pull from a random VA
19:41 karolherbst[d]: that isn't a bindless ubo
19:41 karolherbst[d]: basically doing ldg.constant, just into a ugpr
19:41 gfxstrand[d]: Oh. Neat. Yeah, we should look into that
19:42 karolherbst[d]: I saw it in the coop matrix shader, so maybe I'll just add this
19:42 gfxstrand[d]: IDK how all it differs from `ldg.constant`. Might have caching implications.
19:42 karolherbst[d]: well
19:42 karolherbst[d]: it writes 0 on a page fault 😄
19:43 karolherbst[d]: I think...
19:43 karolherbst[d]: but yeah.. 64bit ugpr + 32 bit offset loading into a ugpr
19:43 gfxstrand[d]: That's kinda fine
19:43 gfxstrand[d]: I'm a little more concerned about whether or not it loads through the constant cache.
19:43 karolherbst[d]: which is funny, because ldg only knows 24 bit offsets
19:44 karolherbst[d]: right...
19:44 gfxstrand[d]: Right now we assume anything in UBO space may get promoted to the data cache but we assume the reverse never happens.
19:44 karolherbst[d]: let's see if something is mentioned there
19:45 gfxstrand[d]: If it's coherent with the data cache, then we can do it as an optimization whenever we would normally emit `ldg.constant`. If not, we'll have to tag what can and cannot be promoted so as to avoid landing SSBOs there when they shouldn't be.
19:45 karolherbst[d]: this raw uldc form also has an input predicate that sets the result to 0 when true
19:45 karolherbst[d]: so you can do the bound check and feed it into it
19:45 gfxstrand[d]: Yeah
19:46 gfxstrand[d]: Can also do that with a predicate but the 0 behavior has the advantage of not having to zero the destination first, which is neat.
19:46 karolherbst[d]: yeah
19:46 karolherbst[d]: also gets rid of a second op writing the 0
19:46 karolherbst[d]: well.. predicated
19:47 karolherbst[d]: mhhh
19:47 karolherbst[d]: anyway, not seeing it mentioning anything cache related
19:47 karolherbst[d]: but .constant also doesn't
19:48 karolherbst[d]: but it should be fine? I doubt it has different semantics compared to ldg.constant
19:49 karolherbst[d]: but yeah.. I don't think it's a good idea to use it for anything except load_global_constant
19:50 gfxstrand[d]: karolherbst[d]: I'm not comfortable with that assumption. They're different instructions, the normal variants of which go through different caches that are flushed/invalidated differently on the state side. Why should we assume they magically cross over because you set a bit?
19:51 karolherbst[d]: mhh, fair
19:54 gfxstrand[d]: But also, there are like 2 places where load_global UBO loads come from and we can probably detect both of them.
19:54 gfxstrand[d]: We just can't use it for `load_global` with `CAN_REORDER` set
19:55 gfxstrand[d]: airlied[d]: That's entirely possible, especially if everyone is choosing the same 5s timeout.
19:55 karolherbst[d]: not sure it's a good idea to use it for load_global anyway
19:55 gfxstrand[d]: marysaka[d]: I haven't tried with 535 but generally 570 is more stable than 535
19:56 marysaka[d]: right the MMU fault would likely not be related to GSP firmware anyway
19:56 gfxstrand[d]: The different caches are annoying. A lot of the advancements in NIR in recent years in this area have come from the RADV folks who are just screwed anyway because their different caches are for uniform vs. vector and UBO vs. SSBO doesn't matter.
19:57 gfxstrand[d]: But we have separate constant and data caches.
19:57 gfxstrand[d]: And Intel just got a mega data cache on DG2 and just use that for everything.
20:03 karolherbst[d]: right... I wonder what's the question here is when load_global_constant is even emitted? Like where does it come from, because I don't really know that vulkan spirv has the concept of constant memory besides UBO?
20:04 karolherbst[d]: the weird part is that ldc doesn't have this variant
20:04 gfxstrand[d]: I think it's roughly 3 places:
20:04 gfxstrand[d]: 1. `nir_lower_explicit_io()` on UBOs
20:04 gfxstrand[d]: 2. `nvk_nir_lower_descriptors()`
20:04 gfxstrand[d]: 3. `nak_nir_lower_non_uniform_ldcx()`.
20:04 gfxstrand[d]: karolherbst[d]: Yeah, that's.... odd. That also makes me call into question caching behavior
20:05 karolherbst[d]: yeah...
20:05 gfxstrand[d]: But it's also possible that it's internally going through the bindless path in which case it makes sense that it can only do it uniform.
20:06 karolherbst[d]: possibly.. just won't prefetch more data I guess
20:06 karolherbst[d]: if the bindless ubo form even does it
20:06 karolherbst[d]: but I think it's fair to assume it does
20:06 gfxstrand[d]: But I think if we keep it to `load_global_constant`, we should be safe WRT caches
20:06 karolherbst[d]: yeah
20:08 karolherbst[d]: now with the new shader-db repo, I might just play around and see what's impacted the most and check what difference it makes.. but anyway, got enough ideas to play around for the next couple of weeks 😄
20:10 karolherbst[d]: I can never remember which is the arg order of lea in the nak prints 🙃
20:12 gfxstrand[d]: We try to keep stuff in HW order but sometimes it's a little different
20:14 karolherbst[d]: the first source gets shifted, right?
20:14 gfxstrand[d]: Look at `Foldable`
20:15 karolherbst[d]: yeah
20:15 snowycoder[d]: We might not need worst_latency at all in the instruction delay calcs and it is probably better for performance.
20:15 snowycoder[d]: If we just assume nobody uses the registers and update that assumption later (i.e. we use an optimistic algorithm) the merge operation becomes idempotent.
20:16 karolherbst[d]: I see a bunch of ldg that operate on the an address lea a shift same_bindless_ubo, so I was wondering..
20:16 karolherbst[d]: but prolly not much I can do about it
20:19 gfxstrand[d]: Okay, so I don't have `push_sync` but I wonder how much `NAK_DEBUG=push` would dump before DA:TV crashes.
20:25 gfxstrand[d]: How do I turn off this split lock stuff?
20:26 gfxstrand[d]: gfxstrand[d]: About 1GB
20:27 sonicadvance1[d]: gfxstrand[d]: `echo 0 > /proc/sys/kernel/split_lock_mitigate` or `split_lock_detection=off`
20:29 gfxstrand[d]: sonicadvance1[d]: The later is a boot param?
20:29 sonicadvance1[d]: yea, kernel boot parameter
20:33 gfxstrand[d]: [ 0.000000] Command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-6.15.0-rc5-bskeggs-03.01-gb20x+ root=UUID=19e3c1c6-9ca5-4c6a-9688-ee72072aa817 ro rootflags=subvol=root rhgb quiet nvidia_drm.modeset=1 split_lock_detection=off modprobe.blacklist=nvidia
20:33 gfxstrand[d]: [ 0.000000] x86/split lock detection: #DB: warning on user-space bus_locks
20:33 sonicadvance1[d]: oop, typo'd it
20:33 sonicadvance1[d]: 😄
20:36 airlied[d]: okay reprodued on blackwell laptop
20:36 airlied[d]: my ad106 is not being so reproducy
20:36 gfxstrand[d]: Yeah, Blackwell is definitely easier to repro on
20:37 gfxstrand[d]: [ 0.000000] x86/split lock detection: disabled
20:37 gfxstrand[d]: Much better
20:38 sonicadvance1[d]: Might as well as be the default option for desktop distros now that both AMD and Intel support it :BlobSweat:
20:39 gfxstrand[d]: At least if you plan on running closed-source Windows apps
20:39 sonicadvance1[d]: Don't worry, Steam Linux also does it by itself 😄
20:39 gfxstrand[d]: Does what? Has it disabled?
20:40 sonicadvance1[d]: Uses split-locks
20:40 gfxstrand[d]: Oh. Fantastic!
20:40 gfxstrand[d]: ANV used to have a split lock if you built it in 32-bit mode.
20:40 gfxstrand[d]: Paulo spent like a week hunting that down. Made a huge difference.
20:41 gfxstrand[d]: "OMG! Why does 32-bit perform so much worse than 64-bit?!?" Split locks...
20:41 sonicadvance1[d]: ryanh@orion-o6:~/.fex-emu/Telemetry$ grep "Split Locks: 1" ./*.telem | wc -l
20:41 sonicadvance1[d]: 41
20:41 sonicadvance1[d]: :BlobSweat:
20:42 sonicadvance1[d]: They definitely aren't great to have around, but making them slower doesn't help anyone but VPS providers.
20:43 karolherbst[d]: what's that split lock stuff anyway?
20:43 airlied[d]: https://lwn.net/Articles/806466/
20:43 sonicadvance1[d]: Atomic that crosses a cacheline, because x86 gives no flips about aligned atomics.
20:44 sonicadvance1[d]: (only cmpxchg8b/16b care about alignment)
20:44 gfxstrand[d]: Well, it gives flips. It punishes you quite badly. But it lets you get away with it because Intel loves their unaligned accesses.
20:44 gfxstrand[d]: And on a big machine, it's a good way to DOS the memory controller
20:45 karolherbst[d]: heh
20:45 asuasuasu[d]: yeah, for a desktop system it's kind of meaningless, and i believe a fair number of windows games tend to do this
20:45 karolherbst[d]: wouldn't be almost faster to align them all in code?
20:45 sonicadvance1[d]: Oh definitely
20:45 gfxstrand[d]: For sure.
20:45 gfxstrand[d]: Good luck doing that on a Windows game from 2020
20:46 karolherbst[d]: fair
20:46 gfxstrand[d]: Or 2024, in this case. 🙃
20:46 karolherbst[d]: amazing
20:46 gfxstrand[d]: I should tell Laura that Frostbite has split locks. Maybe she can report that somewhere. 🤔
20:46 karolherbst[d]: heh
20:47 sonicadvance1[d]: ./ACOdyssey.exe.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./Borderlands2.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./Buddha.bin.x86.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./BurnoutPR.exe.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./Cities.x64.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./DaveTheDiver.exe.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./EASteamProxy.exe.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./Fable Anniversary.exe.telem:64byte Split Locks: 1
20:47 sonicadvance1[d]: ./FC3UpdaterSteam.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./ffxv_s.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./FORSPOKEN.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./GameApp_PcDx11_x64Final.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./GoW.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./GRW.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./GTA5.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./helldivers2.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./HoseItDown.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./mgsvtpp.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./MonsterHunterRise.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./NFS13.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./OldFriend.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./RelicDoW3.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./StreetFighter6.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./TheCrewMotorfest.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./Three_Kingdoms.exe.telem:64byte Split Locks: 1
20:48 gfxstrand[d]: heh
20:48 sonicadvance1[d]: ./trialsrising.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./UnravelTwo.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./upc.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./UplayCrashReporter.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: ./UplayService.exe.telem:64byte Split Locks: 1
20:48 sonicadvance1[d]: It's definitely pervasive.
20:48 sonicadvance1[d]: Oh god, IRC hated that.
20:50 gfxstrand[d]: Okay, so this is interesting... Right before we hit the MMU fault, I see something submitted which uses a QMD at `0x3e662f9f00`. The fault address is `0x3e662fa000`.
20:50 gfxstrand[d]: Maybe QMDs pre-fetch or overrun some now?
20:51 karolherbst[d]: mhhh
20:51 karolherbst[d]: is the QMD even that big?
20:52 karolherbst[d]: ehh wait..
20:52 karolherbst[d]: so that's at +0x100
20:53 gfxstrand[d]: Yes, it's at +1 QMD
20:53 gfxstrand[d]: Which could also mean nothing
20:54 karolherbst[d]: mhhhh
20:54 karolherbst[d]: what QMD version is used?
20:54 gfxstrand[d]: Whatever the Blackwell QMD is
20:54 karolherbst[d]: that's blackwell, right?
20:54 gfxstrand[d]: Pretty sure it's still 256B
20:54 karolherbst[d]: just aasking, because `#define NVCEC0_QMDV04_01_OUTER_STICKY_OVERFLOW MW(3071:3071)`
20:54 gfxstrand[d]: uh...
20:55 karolherbst[d]: yeah...
20:55 karolherbst[d]: 5.0 also has it
20:55 karolherbst[d]: `#define NVCEC0_QMDV05_00_OUTER_STICKY_OVERFLOW MW(3071:3071)`
20:55 karolherbst[d]: but 5.1 is weird
20:56 karolherbst[d]: ends with `#define NVCEC0_QMDV05_01_SUB_TASK_GRID_DEPTH(i) MW((1183+(i)*416):(1168+(i)*416))`
20:56 gfxstrand[d]: So, yeah, definitely more than 256B
20:56 gfxstrand[d]: I think we found our culpret!
20:56 karolherbst[d]: you can read out the used QMD version via mme
20:56 karolherbst[d]: but yeah..
20:56 karolherbst[d]: probably just need more space there
20:56 gfxstrand[d]: Annoying
20:57 karolherbst[d]: anyway, easy to verify
20:57 karolherbst[d]: 5.1 looks really odd tho
20:57 karolherbst[d]: so I doubt it's used by default
20:57 gfxstrand[d]: We're using 5.0
20:57 karolherbst[d]: okay
20:58 karolherbst[d]: well.. that looks to be 0x180 sized now
20:58 gfxstrand[d]: But it still goes past bit 256
20:58 karolherbst[d]: yeah.. it's bigger since 4.1
20:59 gfxstrand[d]: 4.0 is also bigger
21:00 karolherbst[d]: yeah.., that's hopper?
21:00 gfxstrand[d]: yup
21:01 karolherbst[d]: so what's the new fancy stuff 😄
21:01 mohamexiety[d]: we should be using 5.1 btw
21:01 karolherbst[d]: ohhh..
21:01 karolherbst[d]: cga...
21:01 mohamexiety[d]: we arent
21:01 mohamexiety[d]: but we should be
21:01 karolherbst[d]: uhm..
21:01 karolherbst[d]: why?
21:01 karolherbst[d]: 5.1 looks cursed
21:01 mohamexiety[d]: it
21:02 mohamexiety[d]: 's the blackwell_b one
21:02 mohamexiety[d]: while 5.0 was for blackwell_a
21:02 karolherbst[d]: yeah but it looks cursed
21:02 karolherbst[d]: I don't think you wanna use it
21:02 karolherbst[d]: it kinda does weird sub task stuff?
21:03 mohamexiety[d]: I wonder if I confused the two blackwells then
21:03 karolherbst[d]: the hardware supports multiple versions
21:03 karolherbst[d]: 4.1, 5.0 and 5.1
21:03 mohamexiety[d]: but given it's CE97 it should be the one for B
21:03 mohamexiety[d]: but yeah
21:03 karolherbst[d]: it just looks like 5.1 is a bit of work figuring out how it all works
21:15 gfxstrand[d]: Ugh... There's a lot of QMD size assumptions baked in.
21:15 gfxstrand[d]: Oh, well.
21:32 mohamexiety[d]: there's something kinda :thonk: though
21:32 mohamexiety[d]: QMD overflow would explain blackwell faulting
21:33 mohamexiety[d]: but Ada and older use QMD v3 (or 2.something for turing iirc)
21:33 mohamexiety[d]: so why do those fault then
21:36 gfxstrand[d]: Ada still suffers from the fencing issue in the kernel
21:36 gfxstrand[d]: IDK that it suffers from the faulting
21:36 mohamexiety[d]: ahh so that's unrelated yeah
22:12 gfxstrand[d]: Okay, got some patches. I'm CTSing now while I walk home. I'll check DA:TV once I get home if things look good.
22:20 mohamexiety[d]: nice! <a:vibrate:1066802555981672650>
22:45 gfxstrand[d]: Okay, CTS on Blackwell looks good. Fired up DA:TV. I'm headed home for real onw
22:47 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36995
22:48 gfxstrand[d]: mohamexiety[d]: Feel free to throw some more games at it
22:48 mohamexiety[d]: will do tomorrow. very great work and nice catch! ❤️
23:00 airlied[d]: The sync one might not be fencing it might the push buffer tracking code
23:01 gfxstrand[d]: airlied[d]: That's entirely plausible
23:02 airlied[d]: The timeout I'm seeing is on job exec not fence waits which means no space to submit I think
23:03 gfxstrand[d]: How's that possible? We're waiting after ever submit in userspace. There shouldn't be anything to wait on unless it's the Kernel's internal pushbuf pool.
23:06 gfxstrand[d]: But yeah, I could believe those aren't getting tagged correctly.
23:07 gfxstrand[d]: I've also seen weird allocation fails after I've gotten things good and wedged. I guess that could also explain that, potentially.
23:18 x512[m]: What should happen if queue is full when doing Vulkan queue submit? Block? Silently grow queue memory?
23:22 gfxstrand[d]: block
23:23 gfxstrand[d]: We have a submit thread that'll take care of queuing things up behind you.
23:23 gfxstrand[d]: Might need to enable the submit thread before you block, though.
23:24 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1409679810061533235/snapshot.png?ex=68ae4221&is=68acf0a1&hm=ba6c012887b1bfcb50377c04ae39cfc0e42256bed73f71c07aef4f3f620333e2&
23:25 gfxstrand[d]: And yes, 42 FPS is terrible. We need to figure that out.
23:47 gfxstrand[d]: But 42 FPS is a hell of a lot better than faulting before you even get to the menu.
23:48 airlied[d]: I should test vs bound textures sometime
23:48 gfxstrand[d]: Yeah, this would be a good excuse to rebase that and see how it goes.
23:48 airlied[d]: though it would depend on how the app uses them in flow control
23:49 gfxstrand[d]: I need to make sure a couple things stay on the rails tomorrow but then I'm gonna start trying to revive some perf compiler stuff again.
23:51 airlied[d]: I'll have to hassle you about the coopmat2 nir changes at some point 😛
23:51 airlied[d]: indirect function and funky call
23:55 mhenning[d]: gfxstrand[d]: have you considered getting more fps via code review
23:55 mhenning[d]: (it doesn't need to happen all at once but some incremental progress would be nice)
23:57 gfxstrand[d]: Yes. Your pre-RA scheduler is top of the list
23:58 gfxstrand[d]: And point taken. I need to be better about keeping on top of your stuff.