02:36fdobridge_: <Sid> just confirmed, device lost error happens on apps using vulkan natively as well
02:39fdobridge_: <Sid> in fact it just happened to me with vkcube :frog_gears:
02:46fdobridge_: <Sid> so if I'm reading this right, job timeout is a nouveau scheduler error
02:49fdobridge_: <Sid> https://github.com/torvalds/linux/blob/c0f65a7c112b3cfa691cead54bcf24d6cc2182b5/drivers/gpu/drm/nouveau/nouveau_exec.c#L202-#L217
03:10fdobridge_: <Sid> hm, fascinating
03:10fdobridge_: <Sid> https://github.com/torvalds/linux/blob/c0f65a7c112b3cfa691cead54bcf24d6cc2182b5/drivers/gpu/drm/nouveau/nouveau_sched.c#L15-#L25
03:11fdobridge_: <Sid> (I don't know much about what's going on I'm just throwing shit at the wall until something sticks)
03:19fdobridge_: <Sid> this appears to be where it arises from: https://github.com/torvalds/linux/blob/c0f65a7c112b3cfa691cead54bcf24d6cc2182b5/drivers/gpu/drm/nouveau/nouveau_sched.c#L385C33-L385C33
03:20fdobridge_: <Sid> now to figure out why
03:26fdobridge_: <Sid> ok, time to compile NVK manually :frog_gears:
03:30fdobridge_: <Sid> DEVICE_LOST errors are arising from the nvk queue handling
03:57fdobridge_: <Sid> added a few _debug_printf's to wherever nvk throws VK_ERROR_DEVICE_LOST, and in every game I try it's happening from here
03:57fdobridge_: <Sid> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/vulkan/nvk_queue_drm_nouveau.c#L259
04:00fdobridge_: <esdrastarsis> But ENODEV is a kernel return value, so the issue is from the kernel space I think
04:01fdobridge_: <Sid> the nvk <-> kernel interaction happens on the few lines above it
04:02fdobridge_: <Sid> however, finding the kernel-side counterpart/handler for that interaction is gonna be a challenge for me e-e
04:07fdobridge_: <Sid> I think this is the moment I should ping someone smarter than me (psst airlied)
04:15fdobridge_: <Sid> nouveau kernel module throws -ENODEV in 107 places, using my approach isn't gonna be ideal
04:17fdobridge_: <Sid> but that's not gonna stop me 😈
04:22fdobridge_: <Sid> ok, since it's a drm related issue, I only need to pick the ENODEVs thrown within nvkm_* functions
04:33fdobridge_: <Sid> basically, the kernel is returning an -ENODEV to the drmCommandWriteRead on nvk_queue_drm_nouveau.c#L253
04:34fdobridge_: <Sid> but where in the kernel that comes from is where I'm currently stuck
04:46fdobridge_: <mhenning> @tiredchiku It isn't too surprising that you'd get the error in push_submit on the userspace side - a lot of interactions with the kernel pass through that function
04:46fdobridge_: <mhenning> note that the call that returns ENODEV might not even be the interesting one - that just means "the gpu is dead," so the calls right before that error might be the ones that are actually blowing up
04:46fdobridge_: <mhenning> if you can't reproduce it with push_sync, then there's a reasonable chance that it's some sort of timing issue which will be tricky to track down
04:46fdobridge_: <Sid> I can reproduce it 100% consistently, both with and without push_sync
04:47fdobridge_: <Sid> have even attached the dump push_sync generates for one of the games that sees it happen
04:47fdobridge_: <Sid> I can get a fresh one though
04:48fdobridge_: <mhenning> Oh, I didn't realize that
04:49fdobridge_: <Sid> no worries, lot of talk has happened since 😅
04:50fdobridge_: <Sid> let me get a fresh one from a different game, because dirt rally only fails to render and continues to happily play the main menu music
04:52fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187981068834455552/push_sync-dump-Metal-Hellsinger.log?ex=6598dc96&is=65866796&hm=fbf0022372215d79422aa997075451bad60e13a9c048e1fec7e08bc65ff98117&
05:11fdobridge_: <mhenning> Okay, I looked a bit at the push dumps and nothing stands out to me
05:12fdobridge_: <mhenning> From your dmesg, I think you're on a non-rtx turing card, which is interesting - it's totally possible we don't handle those correctly just yet
05:12fdobridge_: <mhenning> Also, sometimes non-gsp kernels still give better error messages in dmesg, so if you can try with gsp disabled those dmesg logs might be more useful
05:13fdobridge_: <Sid> yes, non-rtx turing card: 1660Ti
05:13fdobridge_: <Sid> and sure, I'll try disabling the GSP
05:18fdobridge_: <Sid> ..game loads (extremely slowly) on non-GSP
05:19fdobridge_: <mhenning> oh, that's interesting
05:19fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187987859030941766/image.png?ex=6598e2e9&is=65866de9&hm=e6daead84e284b9253a74432b1ecb1594b6b0069bfe97a13cb4bdc4eb2fc7bab&
05:20fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187987973472534558/dmesg.log?ex=6598e304&is=65866e04&hm=589099882c27329b1102195fd2a804a4b514bc51096e4420fcea6d7d844e69d2&
05:20fdobridge_: <Sid> a trap!
05:20fdobridge_: <Sid> 🪤
05:20fdobridge_: <!DodoNVK (she) 🇱🇹> A LEGO trap
05:21fdobridge_: <Sid> but yeah, I guess this is something for @airlied
05:21fdobridge_: <Sid> dodo can you try MW2012 without GSP to see if you run into device lost?
05:21fdobridge_: <Sid> or any other app/game you had the error with
05:22fdobridge_: <!DodoNVK (she) 🇱🇹> I think I never got that error without GSP
05:22fdobridge_: <mhenning> The trap makes it sound like something is still going wrong on non-gsp, just the error doesn't kill the context there for whatever reason
05:22fdobridge_: <Sid> yeah
05:24fdobridge_: <Sid> sure enough, non-GSP definitely renders more of Dirt Rally than GSP does
05:25fdobridge_: <Sid> main menu background video? render? doesn't show, but the menus and stages themselves load
05:26fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187989605312638976/image.png?ex=6598e489&is=65866f89&hm=e63d0fda805df610ae8d97208dd011dfdcd6e5c05b1b6d92ee36dd9cba18890f&
05:37fdobridge_: <mhenning> I need to go to sleep, but if you want to keep poking at things, my guess is that you should be looking at the userspace code that's generating the push buffer that fails on gsp.
05:37fdobridge_: <mhenning> If you can find demo code that triggers the issue it could also be helpful to simplify it into a small reproducer
05:38fdobridge_: <Sid> I'll try capturing an apitrace, yeah
05:38fdobridge_: <Sid> I've got time, only 11am for me 😅
05:38fdobridge_: <Sid> have a good night!
05:57fdobridge_: <Sid> two apitraces captured: http://cloud.sloughland.dev/d/492adfc31dde49ffbd86/
05:57fdobridge_: <Sid> currently uploading, 25% done
05:58fdobridge_: <Sid> two apitraces captured: http://cloud.sidonthe.net/d/492adfc31dde49ffbd86/ (edited)
06:00fdobridge_: <redsheep> I haven't quite kept up on what you're testing here, would it be of any use for me to test if dirt rally works with GSP for me? I disabled my iGPU so I am not sure there'd be anything to learn from my hardware on this one.
06:01fdobridge_: <redsheep> Even if I turned it back on my iGPU is AMD so if this is expected to be related to the i915 crashes then I probably can't help
06:03fdobridge_: <Sid> this patchset includes a fix for the system-wide hangs on prime setups. However now we're coming across frequent VK_ERROR_DEVICE_LOST
06:03fdobridge_: <Sid> across games, both DXVK and Vulkan native
06:04fdobridge_: <redsheep> Has anybody tested that patch without being on a prime setup yet? Wonder if that is worthwhile
06:04fdobridge_: <Sid> it also includes fixes for some memory leaks
06:04fdobridge_: <Sid> among other things
06:06fdobridge_: <Sid> but yeah, NVK is returning VK_ERROR_DEVICE_LOST when performing a push_submit to the kernel/DRM infrastructure with GSP enabled
06:06fdobridge_: <Sid> rather frequently
06:08fdobridge_: <Sid> I should maybe check if it also happens in an nvidia-only environment, but I have to leave for drinks with friends in about an hour
06:09fdobridge_: <redsheep> Ok, I might as well test and see if it breaks single gpu with Ada, I will give it a go here in a bit.
06:10fdobridge_: <Sid> so far I've seen device_lost happen with:
06:10fdobridge_: <Sid> Quake 2 EX (dxvk)
06:10fdobridge_: <Sid> Quake 2 EX (vulkan)
06:10fdobridge_: <Sid> Dirt Rally
06:10fdobridge_: <Sid> Sea of Thieves
06:10fdobridge_: <Sid> Cultic
06:10fdobridge_: <Sid> Metal: Hellsinger
06:10fdobridge_: <Sid> Control
06:10fdobridge_: <Sid> that's about all that I've tested
06:10fdobridge_: <Sid> Dodo saw it on Most Wanted 2012 as well
06:11fdobridge_: <redsheep> Any notable games without any issues?
06:12fdobridge_: <Sid> uh, dunno about notable, but Cloudpunk has it happen less frequently
06:13fdobridge_: <Sid> as in I'm able to get in about 30-45 mins of playtime on average before running into DEVICE_LOST
06:14fdobridge_: <Sid> but then again, I also ran into device lost after letting vkcube run for long enough 😅
06:16fdobridge_: <Sid> this is the exact function that's throwing the error, verified by placing a a _debug_printf before the return https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/vulkan/nvk_queue_drm_nouveau.c#L241-#L274
06:16fdobridge_: <Sid> not sure about the kernel side though, why the kernel is returning an ENODEV
06:17fdobridge_: <redsheep> Hmm. If it just happens randomly is it possible that it is also an issue on non-gsp, but it runs so much slower that you don't trip it so easily? Or is it crashing really fast?
06:17fdobridge_: <Sid> except for Quake 2 every game has it happen on boot
06:17fdobridge_: <Sid> some crash, some render only a black screen and play the menu music, some just freeze on a black screen
06:18fdobridge_: <Sid> (Control crashes, Cultic and DR play menu music, everything else freezes on a black screen)
06:18fdobridge_: <Sid> Cloudpunk, when it runs into it, freezes on whatever was the last frame and loops ~half a second of audio
06:19fdobridge_: <redsheep> Sounds like differences that would mostly come down to how the different games do their threading
06:19fdobridge_: <Sid> Quake 2 EX shows an error box and gracefully closes
06:19fdobridge_: <Sid> yeah
06:22fdobridge_: <Sid> but yeah, was just uploading a couple apitraces of the game that runs into it the quickest (Metal Hellsinger) to help reproducibility
07:27fdobridge_: <!DodoNVK (she) 🇱🇹> And 2005
07:37fdobridge_: <redsheep> To be honest, right now I am stuck on testing this because I don't understand how the mailing list works, I don't get how to turn this link to the patchset into something I can build. I see it was mentioned it applies cleanly to 6.7rc6. I have had success with building the linux-git aur with .patch files included but I am currently lost on how to get .patch files here, if that is even a thing.
07:38fdobridge_: <!DodoNVK (she) 🇱🇹> You can use `b4 am`
08:30fdobridge_: <redsheep> I must be doing something wrong. It doesn't seem to apply cleanly for me. Anyway, I decided to test a few more games and Minecraft sees a huge improvement with zink. 84 fps to 360 fps, more than quadruple the performance.
12:40fdobridge_: <Sid> I split the mbox into patch files, I can upload those if you want me to
17:20raket: Hello. What is the difference between the performance level 0d and 0f on a 780ti and what is the best choice to use?
17:28karolherbst: raket: there might not be one, but it might impact max clocks with NvBoost=2
17:28karolherbst: maybe even with 1
17:28karolherbst: but on some GPUs it's quite the same
21:34fdobridge_: <redsheep> Thanks, yeah if you want to upload them I'll give it a try. Wish I understood what went wrong with the mbox.
21:35fdobridge_: <Sid> I *just* closed my machine 15 mins ago e-e
21:35fdobridge_: <Sid> 0305...
21:35fdobridge_: <Sid> I'll upload them in ~5-6 hours
21:35fdobridge_: <redsheep> Ok, no rush
21:36fdobridge_: <Sid> or actually
21:36fdobridge_: <Sid> https://github.com/rrice/shell-scripts/blob/master/mbox-split.py
21:37fdobridge_: <Sid> https://gist.github.com/bonzini/d5bc1946475487167c529f9699e39512 (edited)
21:37fdobridge_: <Sid> that's what I used to split it
21:38fdobridge_: <redsheep> Cool, I'll see if that fixes it for me
21:56fdobridge_: <redsheep> It's building now, this is just what I needed.
22:19fdobridge_: <redsheep> Out of the games listed I only had Control ready to go, but if that was one of the games crashing rapidly then so far I'd say not being on a prime setup or being on Ada makes the patches pretty stable. Spent 5 minutes in Control changing settings and alt tabbing like crazy with no crashing.
22:52fdobridge_: <karolherbst🐧🦀> though `git am` should just eat those files directly
22:53fdobridge_: <karolherbst🐧🦀> ohh wait, that's for kernel packaging patching stuff?
22:53fdobridge_: <karolherbst🐧🦀> pain
22:53fdobridge_: <karolherbst🐧🦀> (it should just support it anyway 😛 )
22:53fdobridge_: <!DodoNVK (she) 🇱🇹> I see random =09's
22:54fdobridge_: <redsheep> That's what I was doing with the resulting mbox from b4 am, but it was saying something about an index not matching on an r535 file
22:54fdobridge_: <redsheep> Anyway, another 10 minutes in Talos and 20 minutes of the witness later I still can't make my known working games crash with those patches
23:10fdobridge_: <redsheep> So... I was preparing a performance comparison of Minecraft on zink+nvk vs the blob, I went to hit save&exit and lo and behold:
23:10fdobridge_: <redsheep> https://cdn.discordapp.com/attachments/1034184951790305330/1188257349790613544/image.png?ex=6599dde4&is=658768e4&hm=512e2ddfaca492f7d13c89ee27df39fa2c4440d43d1b9811ccceb7aa2bd77024&
23:11fdobridge_: <redsheep> That error might be quite a lot more rare with my setup, but it is not unique to prime or little turing after all.
23:12fdobridge_: <!DodoNVK (she) 🇱🇹> Can you check dmesg?
23:12fdobridge_: <redsheep> Sure. For reference I played with zink+nvk saving and exiting a number of times over about 6 hours yesterday before adding these patches without issue
23:14fdobridge_: <redsheep> Yep there it is:
23:14fdobridge_: <redsheep> ```[ 3436.313159] nouveau 0000:01:00.0: Render thread[5387]: job timeout, channel 120 killed!```
23:15fdobridge_: <redsheep> That's the only relevant line
23:18raket: karolherbst: ok thanks! i'll go with 0f pstate just because. the pstate reports the same speed on AC on both settings. thanks!
23:24fdobridge_: <!DodoNVK (she) 🇱🇹> Another job timeout 🤔
23:25fdobridge_: <karolherbst🐧🦀> can be some multi-threading bug
23:26fdobridge_: <karolherbst🐧🦀> tried zink+nvidia as well?
23:26fdobridge_: <karolherbst🐧🦀> wouldn't surprise me if that crashes or hangs the GPU as well
23:31fdobridge_: <redsheep> I was going to show zink+nvk vs blob on Windows, I don't have the nvidia drivers installed on linux anymore. What's the preferred approach to pick which driver you boot with if you have nvidia installed?
23:32fdobridge_: <redsheep> I uninstalled it when I wanted to test nvk because I wanted no chance of that driver messing things up
23:40fdobridge_: <redsheep> That comparison didn't turn out how I expected, turns out the nvidia driver does quite well here: 747 fps vs 353 on zink+nvk
23:40fdobridge_: <redsheep> https://cdn.discordapp.com/attachments/1034184951790305330/1188264964268511382/2023-12-23_16.04.18.png?ex=6599e4fc&is=65876ffc&hm=2acab17dfb143e8846eb0c66873860fae03578f89f06622441202592a88eab55&
23:40fdobridge_: <redsheep> https://cdn.discordapp.com/attachments/1034184951790305330/1188264965426131044/2023-12-23_16.38.51.png?ex=6599e4fc&is=65876ffc&hm=452da3085c896f0dcecd3a58bea45f20e07768e245871f85a8cb767d9acd5c9d&
23:48fdobridge_: <esdrastarsis> And it looks like the nvidia version has more shadows
23:48fdobridge_: <redsheep> It was a different time of day, I didn't have the day cycle off