00:26airlied[d]: notthatclippy[d]: every so often on the rtx6000 we get "[ 327.227081] msg: 00000000: 00 00 00 00 00 00 00 00 c0 ad a1 dd 05 00 00 00 ................
00:26airlied[d]: [ 327.227084] msg: 00000010: 05 00 00 00 00 00 00 00 41 53 53 45 52 54 00 00 ........ASSERT..
00:26airlied[d]: [ 327.227086] msg: 00000020: 4e 56 5f 50 47 43 36 5f 41 4f 4e 5f 53 45 43 55 NV_PGC6_AON_SECU
00:26airlied[d]: [ 327.227088] msg: 00000030: 52 45 5f 53 43 52 41 54 43 48 5f 47 52 4f 55 50 RE_SCRATCH_GROUP
00:26airlied[d]: [ 327.227090] msg: 00000040: 5f 30 35 5f 30 5f 47 46 57 5f 42 4f 4f 54 5f 50 _05_0_GFW_BOOT_P
00:26airlied[d]: [ 327.227092] msg: 00000050: 52 4f 47 52 45 53 53 5f 4e 00 00 00 00 00 00 00 ROGRESS_N.......
00:26airlied[d]: [ 327.227094] msg: 00000060: fc ec 7d 01 85 02 00 00 00 00 00 00 00 00 00 00 ..}.............
00:26airlied[d]: " coming up in a nocat record
00:26airlied[d]: it seems like it's telling us something important, any idea what might be generating it in the fw and how what we need to avoid it?
00:27airlied[d]: GFP_BOOT_PROGRESS is the register we check at devinit time to know the bios is finished, but we never look at it again
00:31airlied[d]: I"d say DPMS turning off HDMI display outputs seems to be related
02:18airlied[d]: just fyi running the vk cts loop from 14610 on a laptop blackwell, seems to be surviving
05:11mohamexiety[d]: hm I actually didn’t try to reproduce and was just looking at the kernel vmm code. I’ll try as well on laptop Blackwell with 8GB
18:16_lyude[d]: airlied[d]: This migh tbe the error notthatclippy[d] was referring to
18:16_lyude[d]: Also, am I right this is related to the compression work?
18:16_lyude[d]: [ 218.671381] nouveau 0000:c1:00.0: gsp: mmu fault queued
18:16_lyude[d]: [ 218.826186] nouveau 0000:c1:00.0: gsp: rc engn:00000001 chid:12 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003ff7df0000 fault_type:00000002
18:16_lyude[d]: [ 218.826197] nouveau 0000:c1:00.0: fifo:c00000:000c:000c:[nvim-gtk[43479]] errored - disabling channel
18:17mhenning[d]: Not sure. Does it reproduce easily?
18:19_lyude[d]: mhenning[d]: yes, and something very interesting to me about this: the issue I think is happening here is the same one I had to start disabling hw renderin gon my text editor for, because it happens when I'm typign code quickly and neovim gtk's completion window gets a bunch of surface updates. Previously it wouldn't crash anything, but it would cause all of the surfaces for neovim gtk to start
18:19_lyude[d]: displaying garbage until I restarted the application
18:19_lyude[d]: something something smells like fences
18:20_lyude[d]: (since I assume the behavior change might literally be from turning compression on?)
18:20mhenning[d]: You can turn off compression by doing this:
18:20mhenning[d]: diff --git a/src/nouveau/vulkan/nvk_image.c b/src/nouveau/vulkan/nvk_image.c
18:20mhenning[d]: index 3cf2ae6d690..e9bebfb512f 100644
18:20mhenning[d]: --- a/src/nouveau/vulkan/nvk_image.c
18:20mhenning[d]: +++ b/src/nouveau/vulkan/nvk_image.c
18:20mhenning[d]: @@ -808,6 +808,7 @@ static bool
18:20mhenning[d]: nvk_image_can_compress(const struct nvkmd_pdev *nvkmd_pdev,
18:20mhenning[d]: const struct nvk_image *image)
18:20mhenning[d]: {
18:20mhenning[d]: + return false;
18:20mhenning[d]: if (nvkmd_pdev->kmd_info.has_compression) {
18:20mhenning[d]: if (image->plane_count > 1 ||
18:20mhenning[d]: image->vk.usage & (VK_IMAGE_USAGE_HOST_TRANSFER_BIT) ||
18:21_lyude[d]: gotcha, I'll build a patched RPM in a sec and see what happens
18:21mhenning[d]: If that doesn't fix it then probably file an issue
18:42marysaka[d]: What GPU are you testing on btw?
18:42marysaka[d]: I tried quite hard to reproduce it without success here :blobcatnotlikethis:
18:44mhenning[d]: Made a comment on the bug, but I reproduce the mmu fault with both a 3060 and a 5060
18:49marysaka[d]: that's odd tbh... I should maybe put my 3060 Ti on my test bench and see
18:50marysaka[d]: my initial assumption was that it was related to comptag not being set on Ampere but if Blackwell is affected that cannot be it
18:53marysaka[d]: (if you want to test the check is at vmmgp100.c:480 but Hopper got ride of those)
19:13_lyude[d]: Yeah I don't think it's compression related (I think). But I do think I've got a nice solid reproducer for this hang 🙂
19:14_lyude[d]: Though, it does involve having to setup neovim-gtk...
19:26mohamexiety[d]: couldnt reproduce on a 5060 laptop either :/
19:35mohamexiety[d]: i wonder if it's faulty hardware tbh
19:39mhenning[d]: what would be faulty, exactly?
19:40mhenning[d]: It happens with two gpus, so it's not the gpu that's bad. What other kind of hardware failure would show up with compression on but not with compression off?
19:42marysaka[d]: Are the GPU page tables in RAM or VRAM?
19:43_lyude[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14662 another issue filed
20:10airlied[d]: do you have a really fast CPU?
20:10_lyude[d]: airlied[d]: 32 cores 64 threads, threadripper 9970X
20:10_lyude[d]: so, yeah, it's -very- fast lol
20:11_lyude[d]: I'm sure this probably opens me up to a new class of bugs lol
20:32marysaka[d]: _lyude[d]: do you have compression patches on your 6.18.5 tree?
20:33marysaka[d]: because I'm pretty sure those are only in 6.19-rc1 and later
20:33_lyude[d]: nope - I dont think I pulled in compression patches. let me get you the actual list of stuff I pulled in
20:34_lyude[d]: marysaka[d]: https://paste.centos.org/view/171bde75 probably want to save it somewhere since it'll go away after a day
20:35marysaka[d]: I see thanks!
20:35mohamexiety[d]: i have seen something similar to this before, but only corruption. never a MMU fault
20:35_lyude[d]: yeah - the mmu fault is new
20:35marysaka[d]: so not related to the kernel patches there at least
20:36_lyude[d]: it was corruption for me before as well
20:36marysaka[d]: also btw _lyude[d] my patches that you have aren't realy ready, the channel error codepath was causing some oops for me so I still need to dig into that :blobcatnotlikethis:
20:37_lyude[d]: ah ok haha, I will go build a kernel without them in just a moment
20:37_lyude[d]: I can check if it actually makes a difference with this as well
20:39marysaka[d]: technically it's really `drm/nouveau/fifo: Remove nvkm_chan lock` causing issues (top patch requires IRQ to be enables tho)
20:39_lyude[d]: eh, i just removed all of them. I don't actually remember why I added those patches
20:40marysaka[d]: I think it was to test suspend stuffs
20:40_lyude[d]: oh right
20:40_lyude[d]: ...i'll never get over how nice it is that I can just rebuild a kernel rpm in a few minutes when stuff like this comes up 🙂
20:41marysaka[d]: do you have any nice tooling around that? still just doing some dumb modules install and install + regenerate the initramfs with dracut but it's not great :blobcatnotlikethis:
20:41marysaka[d]: (that or I hot remove nouveau and insert the new one)
20:44_lyude[d]: Usually I either do exactly that, or I build kernel RPMs with mock depending on whether I plan on running the kernel fulltime or not
20:46_lyude[d]: but i do also put "Lyude-Test" in all of my kernel localversion files or something else I can grep for easily to make removing the old kernels as easy as possible
20:48_lyude[d]: also - i just had another crash doing something entirely different while waiting for the kernel build so maybe those patches are related
20:49marysaka[d]: I guess I missed one path that really needed that lock 🙃
20:49_lyude[d]: fwiw though - if it's not crashing I'm still p certain there's a bug there, if the patches are related I assume the behavior will go back to just corrupting surfaces
20:54marysaka[d]: yeah that lock is likely only hiding it...
21:04_lyude[d]: bingo
21:04_lyude[d]: yeah marysaka[d] that's exactly it, it turned into corruptuion this time
21:05marysaka[d]: uurgh... can you try with only the top patch removed? or maybe cherry-pick the lock one to be sure it's that and not the stop channel codepath being implemented :blobcatnotlikethis:
21:05_lyude[d]: sure thing
21:05_lyude[d]: building things fast is what this machine was made for after all 🙂
21:17_lyude[d]: marysaka[d]: bad news - it's not the lock. I'll try with only the top patch removed
21:18_lyude[d]: though, actually i'm realizing it couldn't be anything but that patch
21:18marysaka[d]: or the stop channel path
21:18marysaka[d]: as this is a "new" thing of that patchset
21:19_lyude[d]: that was what I meant (unless you meant specifically stop and not start/bind?)
21:20marysaka[d]: oh I misread sorry about that...
21:20_lyude[d]: np
21:21marysaka[d]: but yeah I suspect the stop path, bind/start should happen just a bit after what it used to be and should not really have sideeffect
21:22_lyude[d]: do we think that's also the cause of the corruption here?
21:22_lyude[d]: e.g. your patches would fix the corruption if they didn't crash the thing?
21:22marysaka[d]: that would be weird tbh
21:22marysaka[d]: like those patches should only make sure to remove channels that are supposed to be stopped fom the runlist
21:23_lyude[d]: gotcha
21:23mhenning[d]: marysaka[d]: want to send me your kernel config? wondering if compression is sensitive to something there
21:24mhenning[d]: _lyude[d]: commented on the bug but does NVK_DEBUG=zero_memory change anything?
21:24marysaka[d]: so to me if it's an MMU fault instead of a corruption, that mean it's probably unrelated... or if it's related maybe the channel is unmapped before it's fully stopped? but idk how that's possible as stop channel should be waiting
21:24_lyude[d]: mhenning[d]: let's see
21:25marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1461471503026753557/config-6.19.0-rc1?ex=696aace7&is=69695b67&hm=f4b818076c515c585536af627bf73d6fc844f9dc5929ba70a202aabd163da857&
21:25marysaka[d]: mhenning[d]:
21:30_lyude[d]: mhenning[d]: will probably run this workaround for the rest of my workday to see if it really fixed the issue. but, with that turned on I'm not seeing any crashes
21:30_lyude[d]: v interesting
21:31mhenning[d]: Okay, it's an app bug if that fixes it. We can either report upstream or add it to driconf
21:32_lyude[d]: if it's an app bug I'm fairly certain it would be with gtk's vulkan renderer
21:32_lyude[d]: probably a good idea to report it
21:32_lyude[d]: we do some custom rendering in neovim-gtk but none of it is with shaders, it all goes through GTK's snapshot system that goes through vulkan
21:33mhenning[d]: yeah, gtk sounds like a good place to start looking
21:34_lyude[d]: fwiw too: it's only nouveau I've seen this behavior with
21:34_lyude[d]: so whatever the issue is I think other drivers do seem to be working around it
21:35mhenning[d]: We also have `NVK_DEBUG=trash_memory` which writes nonzero repeating patterns into memory and can sometimes make it easier to reproduce these issues
21:36_lyude[d]: BTW - since it doesn't crash my machine, if it helps at all I'm happy to try to get a recording of this. though i don't know what tool we use in place of apitrace for this these days
21:38mhenning[d]: for vulkan stuff it's typically renderdoc although there's an unresolved issue with it on nvk main
21:38mhenning[d]: anv and radv both have this in driconf which could be related:
21:38mhenning[d]: <!-- VK_MAKE_VERSION() encode for 4.0.0 to 4.20.2 -->
21:38mhenning[d]: <engine engine_name_match="GTK" engine_versions="16777216:16859138">
21:38mhenning[d]: <option name="vk_wsi_disable_unordered_submits" value="true" />
21:38mhenning[d]: </engine>
21:39mhenning[d]: so worth trying that on nvk to see if it helps
21:39marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1461474938249805834/config-6.19.0-0.rc5.260115.944aacb6.339.vanilla.fc43.x86_64?ex=696ab01a&is=69695e9a&hm=036727cc535e24bfa12fdc02bcb6100c0fa536974cdacca018c3fd4fbb23af82&
21:39marysaka[d]: marysaka[d]: that's the wrong config that was my custom build kernel (even tho it's similar as it's based on fedora base config) sorry abou that :EstelleFacepalm:
21:41_lyude[d]: also mhenning[d] uh. where does that driconf stuff live?
21:42mhenning[d]: src/util/00-mesa-defaults.conf and src/util/00-radv-defaults.conf
21:45mhenning[d]: Related MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38351
21:45_lyude[d]: gotcha
21:59_lyude[d]: mhenning[d]: looks like I was mislead, just managed to get the bug to happen with zero_memory
22:00mhenning[d]: okay. still worth trying the driconf stuff
22:02_lyude[d]: mhenning[d]: I forgot to mention I did:
22:02_lyude[d]: diff --git a/src/util/00-mesa-defaults.conf b/src/util/00-mesa-defaults.conf
22:02_lyude[d]: index 4ddd2bf9571..8334e484e64 100644
22:02_lyude[d]: --- a/src/util/00-mesa-defaults.conf
22:02_lyude[d]: +++ b/src/util/00-mesa-defaults.conf
22:02_lyude[d]: @@ -1193,6 +1193,10 @@ TODO: document the other workarounds.
22:02_lyude[d]: <application name="X4 Foundations" executable="X4">
22:02_lyude[d]: <option name="force_vk_vendor" value="-1" />
22:02_lyude[d]: </application>
22:02_lyude[d]: + <!-- VK_MAKE_VERSION() encode for 4.0.0 to 4.20.2 -->
22:02_lyude[d]: + <engine engine_name_match="GTK" engine_versions="16777216:16859138">
22:02_lyude[d]: + <option name="vk_wsi_disable_unordered_submits" value="true" />
22:02_lyude[d]: + </engine>
22:02_lyude[d]: </device>
22:02_lyude[d]: <device driver="r300">
22:02_lyude[d]: <!-- Only one app can use Hyperz at a time. -->
22:02_lyude[d]: didn't seem to make any difference
22:04_lyude[d]: want me to grab a renderdoc?
22:04_lyude[d]: wait- right, issues with nvk head...
22:05mhenning[d]: You can try renderdoc, it works on some things but not others right now
22:05_lyude[d]: gotcha, I'm probably going to head out soon but I can give that a try tomorrow
22:06mhenning[d]: Sure, sounds like a plan
22:41notthatclippy[d]: _lyude[d]: (Sorry folks, I'm on vacation till end of the week. Not ignoring your pings, will read up and respond on Monday)