16:43gfxstrand[d]: How do I tell Proton to pretend I'm an AMD GPU so the game doesn't try to use NVAPI that doesn't work?
16:43gfxstrand[d]: `PROTON_DISABLE_NVAPI=1` just makes the game crash.
16:45pixelcluster[d]: try `PROTON_HIDE_NVIDIA_GPU=1`?
16:48gfxstrand[d]: Thanks
16:49gfxstrand[d]: Trying to figure out how to get *Dragon Age: The Veilguard* to launch.
16:50gfxstrand[d]: It seems to just sit and spin forever
16:50gfxstrand[d]: Oh, shit. My GPU Fell off the bus. Should have checked dmesg
16:51gfxstrand[d]: I should probably be trying this on the desktop where things are a tad more robust
17:01gfxstrand[d]: Okay, now that I have my GPU back and Nvidia shit disabled I'm getting illegal instruction encoding errors. That's gonna be annoying to track down...
17:03gfxstrand[d]: I mean, once I track down what shader is failing it's pretty easy to fix them usually but finding that needle in the haystack is the hard part. ðŸ˜
17:04gfxstrand[d]: At least I have enough that I can file a bug now
17:09gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12183
17:11gfxstrand[d]: I really wish I had good tooling for narrowing that down within a game. It's not too bad in a CTS test but in a full game it's a PITA. As long as it renders *something* and doesn't crash, you can get a RenderDoc trace and go from there. If it crashes it's way more annoying.
17:12gfxstrand[d]: Maybe I can hack up NVK to log a bunch of stuff like shaders uploaded and pushbufs. That might let me narrow it down.
17:12gfxstrand[d]: I'll need to dump a mess of stuff, though.
17:13pixelcluster[d]: do you not get a program counter or anything?
17:13gfxstrand[d]: It's a GPU crash. I get very little
17:13gfxstrand[d]: If there is a way to get the PC out of the GPU, we don't know what it is.
17:13pixelcluster[d]: oh that's sad
17:13gfxstrand[d]: Not as nice as AMD, I'm afraid
17:24notthatclippy[d]: Did you try Ben's new nouveau tree and the latest gsp.bin? Should give you more info.
17:24gfxstrand[d]: I think I can come up with something, though. I just need to log a pile of stuff to files somewhere. Every shader upload, every pushbuf. Maybe even split pushbufs harder so I can figure out exactly which one hung.
17:25gfxstrand[d]: notthatclippy[d]: No, I haven't. Where do I find it all?
17:26notthatclippy[d]: gfxstrand[d]: https://discord.com/channels/1033216351990456371/1034184951790305330/1307923889040654360
17:27notthatclippy[d]: I think it has the new log entries already added as well.. lemme check, might need a small patch on top of it.
17:32notthatclippy[d]: Yep, it's there. Just look for this in dmesg, has more info now: https://gitlab.freedesktop.org/bskeggs/nouveau/-/blob/01.03-r565/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r565/fifo.c?ref_type=heads#L39
17:34gfxstrand[d]: Yeah, but I'm not getting faults. I'm getting illegal instruction encoding errors
17:34gfxstrand[d]: What I really need is an 64-bit instruction address to the bad instruction
17:35gfxstrand[d]: I guess that might be possible if I figured out how to write exception handlers. (I think that's a thing?)
17:37notthatclippy[d]: Meh. This info does exist and can be fetched from GSP, but nouveau KMD doesn't do it even in the latest tree...
17:38notthatclippy[d]: It might be an easy patch if we combine it with a "send to NV to decode and get plaintext back" workflow.
17:39notthatclippy[d]: tl;dr there's an RPC you can send to GSP and get a lot of the diagnostics in binary protobuf format, and then decode in userspace. But we don't publish the full protobuf spec for _reasons_ and this bit is part of the not-published bucket.
17:41gfxstrand[d]: Yeah, if we could get something out in debugfs or similar, it'd be useful.
17:43notthatclippy[d]: I'll get back to you on that in about 30 hours. Don't go wasting too much time till then.
17:45notthatclippy[d]: Alternatively... what you could do is run the NV driver stack but hijack the SASS write and replace with your own. Then, when it goes boom, run `nvidia-bug-report.sh`
17:47skeggsb9778[d]: gfxstrand[d]: do you get "rc" messages + your channel being killed when it happens?
17:48skeggsb9778[d]: if so, with r565 if you have nouveau.debug=gsp=debug, you'll (probably) see a whole bunch of POST_NOCAT_RECORD along with the RC_TRIGGERED message from gsp, that appear to have related info in it (though you'll have to look at a hex dump in dmesg...)
17:48skeggsb9778[d]: maybe it'll be hiding in there somewhere, i still need to look into what that POST_NOCAT_RECORD is about properly - i've just silenced it for now
17:50gfxstrand[d]: I'm building the 565 kernel now
17:52gfxstrand[d]: Then I need to figure out how to in-place update my firmware
17:53skeggsb9778[d]: there's a linux-firmware tree on the same gitlab if that helps
18:01gfxstrand[d]: Yeah. But I'll probably just copy the files over because I don't want to muck about with building my own Fedora package.
18:54mhenning[d]: gfxstrand[d]: I have a hack that runs every nak shader binary through nvdisasm and checks if it failed with an error or not, as a check for instruction encoding issues
18:55mhenning[d]: it gives some false positives because of the graphics instructions that are missing from nvdisasm, but it might be helpful
18:57mhenning[d]: I've also been thinking about writing unit tests of our instruction encodings that encode an instruction, decode it with nvdisasm, and then check if the output is what we expect
19:08gfxstrand[d]: Oh, that might help. Got it in a branch?
19:09mhenning[d]: gfxstrand[d]: https://gitlab.freedesktop.org/mhenning/mesa/-/commit/751fcb42bc841daa98b237c665252aeb30dd9731
19:14gfxstrand[d]: Sweet, thanks!
19:14gfxstrand[d]: I'll give it a try later
19:15gfxstrand[d]: I'm "not working" right now so things are pretty ad-hoc.