12:45fdobridge: <pavlo_it_115> @gfxstrand Hello! So, where should I start my studies? Reading patches or something?
12:47fdobridge: <pavlo_it_115> https://nouveau.freedesktop.org/IntroductoryCourse.html
12:47fdobridge: <pavlo_it_115> As I understand it, you need to start with this?
12:47fdobridge: <pavlo_it_115> https://nouveau.freedesktop.org/IntroductoryCourse.html
12:47fdobridge: <pavlo_it_115> As I understand it, need to start with this? (edited)
14:10f_: karolherbst: It happened again.
14:10f_: The issue where it still "works" but where the monitors are slow and barely usable
14:12f_: I think the module crashed.
14:13karolherbst: have anything in dmesg?
14:15f_: Trying to blindly get ssh access
14:15f_: but last time I had this issue there was a bunch of oopses.
14:16f_: I tried switching to a tty and..well..laptop monitor is now off, probably until I reboot
14:17karolherbst: maybe you should set up ssh before running into the problem
14:17f_: yeah..
14:19f_: Oh nice
14:19f_: suspending and getting out of suspend got me to a tty
14:20f_: while it's still semi-hanging, since I have kernel logs on tty I can have a look at some of it
14:20f_: and I see a kernel oops..
14:25f_: I'll try getting some dmesg logs and get back to you.
14:30f_: Got some logs
14:30f_: dmesg was full however so I couldn't get all of it.
14:30f_: Looks lile that thing was really broken
14:30f_: *like
14:32f_: karolherbst: https://0x0.st/HRsN.txt
14:32karolherbst: ahh.. invalid display state or something
14:32karolherbst: that's probably for Lyude to look into
14:32f_:looking for his previous dmesg log..
14:33karolherbst: f_: mind filing a proper report on https://gitlab.freedesktop.org/drm/nouveau/-/issues and attaching a full dmesg (via `journalctl --dmesg --no-hostname` or so)
14:33karolherbst: it's kinda odd that it happens randomly, maybe some undefined behavior going on here
14:33f_: Is a bare `dmesg` enough? I don't have journalctl available on my computer.
14:34karolherbst: only if it contains everything form booting
14:34karolherbst: ehh since booting I mean
14:34f_: Oh here's some more logs from before https://bin.vitali64.duckdns.org/65b3f2a2
14:35f_: karolherbst: Sure
14:35karolherbst: mhh, okay those seem to be time outs
14:35karolherbst: which kinda makes sense if the hardware isn't programmed correclty
14:35karolherbst: but anyway... a full dmesg is kinda required to get a better picture of the situation
14:36karolherbst: maybe you have a logger logging it somewhere?
14:36f_: I also had some stuff show up on my laptop monitor for a second
14:36f_: Took a picture of it, I'll let tesseract parse it and send it here
14:36karolherbst: I think syslog might save it to `/var/log/kern.log`?
14:36karolherbst: nah.. don't bother with a screenshot
14:36karolherbst: ehh
14:36karolherbst: picture
14:37karolherbst: we really need the full dmesg anyway
14:38fdobridge: <gfxstrand> That's really out of date
14:39fdobridge: <gfxstrand> Reading code wouldn't be a horrible start
14:39fdobridge: <gfxstrand> I've been doing a bit of thinking about possible starter projects.
14:48f_: Oh great
14:48f_: I have more logs.
14:58f_: karolherbst: Do you *really* want the entire log? This machine has been up for ~2 days
15:02f_: and I just let it suspend when I'm not using it.
15:12fdobridge: <zmike.> My big zink starter tasks ticket has consistently yielded dividends
15:12fdobridge: <zmike.> You might consider something similar
15:16karolherbst: f_: yes
15:16f_: ¯\_(ツ)_/¯
15:16f_: Sure, I guess.
15:17f_:reading https://nouveau.freedesktop.org/Bugs.html
15:28f_: karolherbst: https://gitlab.freedesktop.org/drm/nouveau/-/issues/330
15:41karolherbst: mhhhh
15:41karolherbst: 2024-02-22 kern.info: [39508.186315] nouveau 0000:01:00.0: devinit: 0x00006699[0]: script needs OR link
15:43karolherbst: I wonder if the misprogramming is just a side effect from that
16:23f_: mhhhh
16:24f_: Do you think it's a hardware issue?
16:30karolherbst: nah
16:30karolherbst: just us not parsing scripts in the vbios correctly
16:30karolherbst: devinit is kinda of a scripting language, and some of the opcodes aren't handled fully or at all
16:30karolherbst: and that just seems some of it
16:31karolherbst: and devinit is kinda important to get the hardware into a proper initial state and if that's messed up it _can_ mess up display
16:31karolherbst: there are also scripts called when you connect a display and stuff
16:31karolherbst: it's just hitting this warning: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/nouveau/nvkm/subdev/bios/init.c?h=v6.8-rc5#n105
16:32karolherbst: and "OR link" is something display related anyway
16:35f_: makes sense.
16:36f_: Need a vbios dump?
16:36karolherbst: could help
16:36karolherbst: there will be one at `/sys/class/drm//card0/ or so
16:36f_: Yeah I know where it is.
16:37f_: And it's in debugfs :P
16:39f_: Sent.
18:42fdobridge: <gfxstrand> @karolherbst does `NOUVEAU_GETPARAM_PTIMER_TIME` give us the same timestamp as timestamp queries?
18:45fdobridge: <karolherbst🐧🦀> I hope so?
18:46fdobridge: <karolherbst🐧🦀> at least I'd assume that the timestamp you get through mmio is the same as the query one...
18:49fdobridge: <phomes_> I tested the updated pipelines MR. Still working well for the vulkan games I test with
18:50fdobridge: <!DodoNVK (she) 🇱🇹> How about OW2?
18:51fdobridge: <phomes_> I mostly test on games that do vulkan directly
18:55fdobridge: <phomes_> one thing I would like some feedback. I have two games that check the vulkan driver version when they see the nvidia vendor id. They just compare it and complain that the driver version is too old. I was working on a patch to add support for the driconf "force_vk_vendor" to override the vendor id. But many games use it because they need to fake the intel id. Maybe we get some unintended behavior if they also start faking it for nvidia. I coul
19:26fdobridge: <gfxstrand> Okay. Good.
19:27fdobridge: <karolherbst🐧🦀> at least we treat it as "GPU time" inside gl
19:27fdobridge: <karolherbst🐧🦀> but the GPU time is also supposed ot be synchronized with the host time
19:37fdobridge: <gfxstrand> Okay, that's what I wanted to know
19:38fdobridge: <gfxstrand> Oh, well that's frustrating
19:38fdobridge: <gfxstrand> They shouldn't be checking the driver version unless they're also checking that the driver is `NVIDIA_PROPRIETARY`
20:03fdobridge: <gfxstrand> @airlied dakr: I made you a label: https://gitlab.freedesktop.org/mesa/mesa/-/issues/?sort=created_date&state=opened&label_name%5B%5D=NVK&label_name%5B%5D=New%20UAPI&first_page_size=20
20:16fdobridge: <airlied> We should also work out what an ideal set of queues is and see what is missing to provide tha5
20:16fdobridge: <airlied> We should also work out what an ideal set of queues is and see what is missing to provide that (edited)
20:21fdobridge: <phomes_> They should but that is not what they do :(. Apparently the driconf options file has sections for each driver. So adding this for the two games I noticed this with was not too bad.
20:24fdobridge: <!DodoNVK (she) 🇱🇹> Will realtime queue be included?
20:25fdobridge: <airlied> What is a realtime queue?
20:27fdobridge: <airlied> Ah part of the extension.
20:28fdobridge: <!DodoNVK (she) 🇱🇹> `VK_QUEUE_GLOBAL_PRIORITY_REALTIME_EXT`
20:28fdobridge: <airlied> I'd like to figure out what is needed before the ext is done
20:28fdobridge: <airlied> I kinda expect we need to split the sparse queue out as well
20:30fdobridge: <airlied> Does NVIDIA expose it? Probably needs more reverse engineering
20:34fdobridge: <rinlovesyou> .
20:35fdobridge: <rinlovesyou> it should, but apparently only returns medium queues
20:40fdobridge: <airlied> Well if nvidia doesn't expose it, its highly unlikely we could
20:43fdobridge: <gfxstrand> They do support it, yes
21:00fdobridge: <phomes_> 27595 removes an assert that I was hitting on X4 at game start. With that MR it starts up and I can get to the menus and launch things. It then crashes with:
21:00fdobridge: <phomes_> gsp: Xid:13 Graphics SM Warp Exception on (GPC 3, TPC 3, SM 0): Out Of Range Address
21:00fdobridge: <phomes_> gsp: Xid:13 Graphics Exception: ESR 0x51df30=0x200000e 0x51df34=0x0 0x51df28=0x4c1eb72 0x51df2c=0x174
21:00fdobridge: <phomes_> gsp: rc engn:00000001 chid:96 type:13 scope:1 part:233
21:00fdobridge: <phomes_> fifo:1eae1001:000c:0060:[X4.exe[76236]] errored - disabling channel
21:02fdobridge: <phomes_> first error from nvk is VK_ERROR_DEVICE_LOST. Any idea of how to debug that further?
21:03fdobridge: <karolherbst🐧🦀> we need a shader trap handler 🥲
21:03fdobridge: <redsheep> What kernel are you on?
21:03fdobridge: <karolherbst🐧🦀> it's a userspace problem probably
21:04fdobridge: <phomes_> 6.8-rc5
21:09fdobridge: <airlied> Yeah it's totally a userspace problem
21:09fdobridge: <airlied> The only open question is whether gsp is more aggressive at killing contexts sometimes than nouveau was
21:22fdobridge: <gfxstrand> I need to look at that one. I'm very confused how it's happening.
21:34fdobridge: <phomes_> Testing without gsp does not crash. No warnings from nvk or in dmesg
21:34fdobridge: <phomes_> I'll try again with gsp, and NAK_DEBUG=print. I will open an issue with the results
21:38fdobridge: <airlied> like either we need to set some flag from userspace to stop gsp dying, or we need to figure out what non-gsp kernel sets different in the context
22:54fdobridge: <gfxstrand> I wonder if it's another scratch OOM issue
22:56fdobridge: <karolherbst🐧🦀> ohh.. so on gsp OOM on local memory behaves differently?
22:56fdobridge: <karolherbst🐧🦀> ehh
22:56fdobridge: <karolherbst🐧🦀> OOB
22:57fdobridge: <gfxstrand> I wonder if it's another scratch OOB issue (edited)
22:57fdobridge: <gfxstrand> I don't know. I just know we've got a bunch of crashes that are some sort of local OOB
22:57fdobridge: <gfxstrand> IDK if that's related to the GSP-only crashes or not
22:57fdobridge: <karolherbst🐧🦀> I think that's within what userspace can configure via mmio registers
22:58fdobridge: <karolherbst🐧🦀> (or the kernel)
22:58fdobridge: <karolherbst🐧🦀> there is like a set of registers to configure shader traps
22:58fdobridge: <karolherbst🐧🦀> and it wouldn't surprise me if OOB nuking the channel or just moving on can be done there
22:59fdobridge: <karolherbst🐧🦀> there is this golden context thing in the kernel, which acts as a template for channel creation
22:59fdobridge: <karolherbst🐧🦀> if certain registers aren't set with gsp, that's certainly something you could try to set via `SET_PRIV_REG` and see if that changes things
23:00fdobridge: <gfxstrand> Do we know what those registers are?
23:01fdobridge: <karolherbst🐧🦀> yes and no. They aren't as specific as that, like they are numbered from 0 to 31 (in bits) but we have no idea what those bits are, just that those relate to shader traps
23:01fdobridge: <karolherbst🐧🦀> but that local mem thing can also be something else
23:01fdobridge: <karolherbst🐧🦀> like the fp helper read stuff
23:02fdobridge: <karolherbst🐧🦀> that also all relates to how shader trap handler work, as there you basically have to stop the kernel from nuking your channel and get the shader trap handler invoked instead
23:02fdobridge: <karolherbst🐧🦀> I think
23:03fdobridge: <karolherbst🐧🦀> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxtu102.c?h=v6.8-rc5#n91
23:03fdobridge: <karolherbst🐧🦀> those are some of the callbacks related
23:03fdobridge: <karolherbst🐧🦀> (on the kernel side)
23:04fdobridge: <karolherbst🐧🦀> but it's kinda weird code overall
23:04fdobridge: <karolherbst🐧🦀> like maybe it's the first write inside `tu102_grctx_generate_r419c0c`?
23:04fdobridge: <karolherbst🐧🦀> I dunno 🙂
23:07fdobridge: <airlied> I think if we can encapsulate the problem in a test and can't figure it out, we would be able to ask nvidia engineers, but we'd need a focused reproducer
23:08fdobridge: <airlied> esp if we can run the test on the prop driver and it works fine