IRC Logs of #nouveau on irc.freenode.net for 2024-11-14

01:01 airlied: Lyude, skeggsb9778[d] : https://lore.kernel.org/dri-devel/20241114004603.3095485-1-airlied@gmail.com/T/#u can I get an rb for that when anyone gets a chance
01:02 Lyude: airlied: Reviewed-by: Lyude Paul <lyude@redhat.com>
01:21 airlied: Lyude: thx
08:47 awilcox: okay, doing some more debugging, I've been able to find out why the ce channel fails to create with -22
08:47 awilcox: nouveau 0000:01:00.0: fifo:000000: engn 5.1[ce] not found
08:51 awilcox: what really confuses me though is:
08:52 awilcox: [493871.920996] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [] ch 1 [007fb98000 wayfire[32073]] subc 0 mthd 0000 data 00000000
08:52 awilcox: [493871.921004] nouveau 0000:01:00.0: fifo:000000:0001:0001:[wayfire[32073]] errored - disabling channel
08:53 awilcox: what is the actual error? it doesn't show a read fault or anything I would expect to proceed this message based on all the other logs I've found searching online for nouveau debugging tips..
11:24 awilcox: okay so update:
11:24 awilcox: my GT 630 is not a Kepler, it is a Fermi as well, so I do not actually have a Kepler card, it appears - other than the one I'm using in my streaming TV PC (a 730)
11:26 awilcox: the GT 520 does find CE1 when I put it in a Dell XPS, so the 5.1[ce] not found error is either a ppc64 thing, or a big endian thing. unfortunately I don't have the one rare sparc that has a PCIe slot, so I can't test "big endian and not ppc64", and I'm going to doubt highly that anyone here has either a Sun Ultra 45 or a s390x with PCIe slot
11:26 awilcox: I'm going to continue working on the assumption it's an endian bug and not a "ppc64 IOMMU is much stricter" bug, but I'm not sure how wise this really is
11:28 awilcox: https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices#NVIDIA it looks like the GTX 760 works in ppc64el, which is Kepler, so.. it's not guaranteed, but it does seem like it should work..
11:40 karolherbst: awilcox: yeah.. I wanted to suggest that you can always try on le first and see if that works
11:41 karolherbst: awilcox: but yeah.. it looks like something goes wrong wiht command submission, either the driver ends up messing up caches or whatever, or userspace is doing something incorrectly
11:42 karolherbst: "subc 0 mthd 0000 data 00000000" basically means the command buffer as read by the GPU is faulty
11:42 karolherbst: as in
11:42 karolherbst: it's all 0
11:43 karolherbst: in nvk there are a few simple programs running macro stuff, might make sense to try that out, because it's really basic
12:56 awilcox: karolherbst: userspace is this big black box that I don't understand, and can't seem to figure out how to debug properly.. all the pages on the nouveau site seem to be .. hopefully outdated? .. and very pessimistic that anything should work at all, and give no guidance on how to even debug userspace
12:57 awilcox: meanwhile, the mesa docs don't even mention nouveau, so I was just doing egl and mesa debug itself, and it wasn't very revealing.
12:58 awilcox: but yes, the card will at least show me twm on a dell xps, so I know the card is working :)
13:05 karolherbst: yeah.. the wiki is in a sad state
13:07 karolherbst: NOUVEAU_LIBDRM_DEBUG=3 might help
13:07 karolherbst: at least that should dump the command buffers sent to the kernel
13:23 awilcox: yes, that was very instructive: https://bpa.st/VZ2Q
13:23 awilcox: line 1283: nouveau: kernel rejected pushbuf: No such device
13:27 awilcox: which makes me believe it really does expect the CE1 engn to be there, and that's why it is unhappy
13:27 karolherbst: mhh.. I thought we wired up the new dumper...
13:27 awilcox: I've been staring at the driver init code all night (irl stuff is making me pull an all-nighter, so I thought I'd spend that time doing some 'light reading', heh)
13:28 karolherbst: ohh we haven't...
13:28 karolherbst: marysaka[d]: is there a way to use the new dumper with the gallium driver?
13:28 awilcox: and the only thing I thought is: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c#n3321 maybe this could be wrong on BE?
13:28 awilcox: I couldn't figure out where BIT() was defined
13:29 marysaka[d]: karolherbst: uuumm currently no, unless you can get raw command stream dumps and feed that to nv_push_dump tool manually
13:29 karolherbst: awilcox: uhh.. yeah.. optentially, depends on what the kernel is doing and how it's submitting stuff to the GPU
13:29 awilcox: I tried to bring the nvdev_error out of the if so that even if the ret was ENODEV, it'd print, and it didn't print anything, so it isn't making that far.
13:29 karolherbst: awilcox: you can limit the amount of engines in device/base.c
13:30 karolherbst: marysaka[d]: pain
13:30 awilcox: and I made a small change: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nv50_fence.c#n64 here, adding WARN_ON(ret), and didn't get a warn
13:31 karolherbst: awilcox: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
13:31 karolherbst: search your chipset
13:31 karolherbst: ".ce = { 0x00000003, gf100_ce_new }," means ce engine at idx 0 and 1 exists
13:31 karolherbst: the first value is a bitmask
13:31 karolherbst: you can change it to 1 and then you'd have a single ce
13:32 awilcox: the weird thing is, it correctly has a ce0 and ce1 on the xps
13:32 karolherbst: but if modesetting works, then it kinda works, no?
13:32 awilcox: true
13:33 karolherbst: though not sure if ce is used for kms stuff
13:33 karolherbst: I think something goes wrong with the submission
13:34 awilcox: ok, this is a GF119, so it should be nvd9_chipset
13:34 awilcox: which has .ce = { 0x00000001, gf100_ce_new },
13:35 karolherbst[d]: looks like you only got one
13:35 awilcox: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c#n921
13:35 awilcox: this is the code that is making it say engn 5.1[ce] not found
13:35 karolherbst: uhh..... `nouveau.debug=debug` might also give you more infos
13:35 awilcox: and it doesn't print that on the dell..
13:35 karolherbst: huh...
13:35 karolherbst: maybe it's a bug in the code
13:35 karolherbst: skeggsb9778[d]: ^^ any thoughts?
13:38 awilcox: https://foxkit.us/linux/nouveau-dmesg-fermi-202411072016.txt
13:38 awilcox: this is only disp=debug, but it does have some infos (it's a few days old, but nothing has changed so far)
13:38 awilcox: not sure if I kept the full debug log, it was very big
13:43 awilcox: ok I did, but it was after an rmmod/modprobe cycle, not a clean boot. I don't know if that matters; the card didn't seem in a particularly bad state (fbdev came back up fine) and the errors were the same https://foxkit.us/linux/nouveau-dmesg-fermi-debug-202411140302.txt
13:48 awilcox: one thing I will also note is that on the dell it says "MM: using COPY for buffer copies" and on ppc64 it says "MM: using M2MF for buffer copies"; I haven't found the significance of this yet, and I guess it's because "failed to create ce channel"..
13:51 karolherbst: ohhhhhh
13:51 karolherbst: mhhh
13:51 karolherbst: yeah...
13:51 karolherbst: COPY won't work on fermi
13:51 karolherbst: awilcox: it needs to use M2MF
13:51 awilcox: weird because the dell is where it was working lol
13:51 karolherbst: on fermi that is
13:51 karolherbst: yeah...
13:51 karolherbst: weird
13:53 awilcox: let me make sure that was the exact output on dmesg on the dell. I still have that
13:55 awilcox: https://bpa.st/MXDA yep, COPY0 sorry
14:08 karolherbst: ohh wait...
14:08 karolherbst: awilcox: yeah... in the kernel that's fine, I thought it's userspace
14:08 karolherbst: userspace can't use COPY
14:08 karolherbst: on fermi that is
14:08 awilcox: ahh I see
14:09 karolherbst: we might be able to support it with custom firmware, but...
14:09 karolherbst: but yeah...
14:09 karolherbst: if it's not using COPY on be, then I guess that's something to follow up on
15:07 karolherbst: awilcox: what mesa version are you using? Because you want to have this: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30455
15:07 awilcox: the debug paste was made with 24.1.7, but I have also been testing main as of... the commit right before my MR
15:08 karolherbst: ehh
15:08 karolherbst: I mean https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29542
15:08 awilcox: which is this: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32050
15:08 karolherbst: you really want to have https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29542 in your tree
15:08 karolherbst: that will parse the command buffer
15:08 karolherbst: (except m2mf I think...)
15:11 awilcox: it looks like m2mf is there
15:14 awilcox: okay, I will retry with main
15:18 awilcox: https://bpa.st/ZRWA doesn't look very different
15:20 karolherbst: awilcox: are you hitting this line? https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/winsys/nouveau/drm/nouveau.c?ref_type=heads#L1138
15:21 karolherbst[d]: or rather
15:21 karolherbst[d]: why aren't you
15:22 awilcox: I see it says vk_push_print, does vk stand for vulkan by chance? vulkan doesn't even build on this platform yet..
15:22 awilcox: it felt, how do I put, "optional" when a compositor doesn't even start yet
15:22 awilcox: if vulkan support is going to be important to debug this, I suppose I can go down another rabbit hole :)
15:22 karolherbst: nah
15:22 karolherbst: it just uses the same code as nvk does
15:23 awilcox: ahh okay
15:23 karolherbst: is dev->info.cls_eng3d 0 by any chance?
15:24 karolherbst: you should have mesa call into nouveau_device_set_classes_for_debug to set it up
15:24 karolherbst: huh....
15:24 karolherbst: actually...
15:24 awilcox: should I set a breakpoint on that function and see?
15:24 karolherbst: fuck...
15:24 karolherbst: :D
15:24 karolherbst: I see it now
15:25 karolherbst: awilcox: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/nouveau/nvc0/nvc0_screen.c?ref_type=heads#L1472
15:25 karolherbst: can you move that one below the call to nouveau_device_set_classes_for_debug
15:25 karolherbst: so that PUSH_KICK comes after nouveau_device_set_classes_for_debug
15:26 awilcox: should I m ove the nouveau_device_set_classes_for_debug above it instead?
15:26 awilcox: or is it safe to move the PUSH_KICK down that far?
15:26 karolherbst: maybe move nouveau_device_set_classes_for_debug up
15:29 awilcox: hey it's all pretty now!
15:29 karolherbst: nice
15:30 karolherbst: not that it helps with the problem you are currently seeing 🙃
15:30 awilcox: "Text exceeds pastebin limit of 512 kB"
15:30 karolherbst: but yeah...
15:30 karolherbst: seems like it was broken only for the initial state push
15:30 karolherbst: which... generally never causes issues and has almost no information :D
15:30 karolherbst: but yeah..
15:31 karolherbst: I think the kernel using M2MF instead of COPY breaks command submission
15:31 karolherbst: or... well
15:31 karolherbst: breaks whatever
15:31 awilcox: does libGL/libdrm use a different fd for writing? this is weird..
15:31 awilcox: wayfire 2>&1 >foo.log logs all of wayfire's stuff but libGL is still written to terminal
15:32 pavlo: Hello
15:32 pavlo: !
15:32 karolherbst: awilcox you can use NOUVEAU_LIBDRM_OUT to point to a different file
15:32 awilcox: ah that helps
15:32 karolherbst: it uses stderr by default
15:32 pavlo: Are you see me?
15:33 pavlo: IDENTIFY
15:33 pavlo: !IDENTIFY
15:33 pavlo: IDENTIFY
15:33 awilcox: https://foxkit.us/linux/nvdrm-wayfire-log-202411140932.txt
15:33 karolherbst: pavlo: yeah, we see what you write
15:33 awilcox: probably not useful but there it is anyway
15:34 karolherbst: awilcox: yeah.. it helps if you run into specific issues later on, but that already worked before 🙃 I just wasn't aware that the initial push to set a default state wasn't using the decoder
15:34 pavlo: Amazingly. I'm from a different IP address, but I still entered the chat only through the username)
15:35 pavlo: @karolherbst Hello! How are you?
15:36 karolherbst: doing alright, but will be afk for a while anyway 🙃
15:37 pavlo: you see my identify ... ?
15:38 awilcox: not sure if I should make an MR with this change or not?
15:38 pavlo: i`m write /msg pavlo identify [my_password], and his send >pavlo< identify ...(password)
15:39 awilcox: 1. no we don't see it, only you see it because it is a private message
15:39 awilcox: 2. you are meant to write /msg nickserv identify yourpassword
15:40 pavlo: Oh, thanks You
15:40 awilcox: but it is more secure if your client supports authentication via "SASL".. help for that is not appropriate for this channel, but do feel free to look up your client docs and see if it supports it
15:40 pavlo: i`m using hexchat
15:41 pavlo: Sorry, i`m newbie
15:41 awilcox: it's okay we all are once :)
15:41 karolherbst: awilcox: OFTC only supports pw or client cert afaik
15:42 awilcox: hrm, that's a bit disappointing
15:43 karolherbst: client cerst are fine tho
15:43 awilcox: true
15:44 karolherbst: just.. there are some IRC clients not supporting it
15:44 awilcox: hrm, kernel debugging is only slightly less of a nightmare than I remember it being
15:45 awilcox: at least this is a module I can rmmod/modprobe a lot :) last kernel thing I did was a pci bus driver that hardlocked on probe, which was lovely
15:46 awilcox: whoever is responsible for making it so that rmmod/modprobe a bunch of times actually does work, thank you immensely, btw. it certainly didn't used to work right, and especially on a "niche" platform :)
15:46 pavlo: Do We have something news of Kepler?
15:47 awilcox:thinks back to linux 4.3 on a tesla, watching it oops on rmmod and kill the irq kworker..
17:17 pavlo: Сan you please invite me to the nouveau discord server (freedesktop)?
17:19 tiredchiku[d]: https://discord.gg/6xwrx33n
17:20 pavlo: Thanks!
17:35 mhenning[d]: Yeah, kepler nvk isn't expected to work
17:35 mhenning[d]: There's a reason the env var is called NVK_I_WANT_A_BROKEN_VULKAN_DRIVER
17:36 awilfox[d]: Oh, cool, I can use Discord instead of IRC and then actually manage to stay here while rebooting my ppc64.
17:40 karolherbst[d]: I have a counter running on my homeassistant 🙃
17:41 karolherbst[d]: ehh
17:41 karolherbst[d]: bouncer
17:43 awilfox[d]: I have a bouncer for other networks but I have had issues with OFTC in the past so I don't stay connected to it, I only use it when I need to for speaking to specific communities. So, this works better for me 🙂
17:46 karolherbst[d]: ahh
17:53 ermine1716[d]: Oh hi, thanks for the link
18:01 tiredchiku[d]: https://tenor.com/view/mygod-whathaveidone-ohno-cat-explosion-gif-4930325
20:13 airlied[d]: it's pretty unlikely ppc64 things, it'll be all endian at a guess
20:43 karolherbst[d]: does ppc64 have weird alignment rules? Might also have some of them randomly, but I hope not
20:53 awilfox[d]: I wouldn't call them "weird", but then I've been doing PowerPC stuff since ~2004… the only real thing is that structures must be aligned to 16 bytes; there was one instance of systemd doing a weird uint8_t* cast thing to an ioctl that caused it to crash on ppc64 because the uint8_t* value wasn't aligned like a structure.
21:02 airlied[d]: I'd definitely think endian until proven otherwise
23:33 awilfox[d]: I made a little progress, I found where things are beginning to go south, hopefully I find more tomorrow 🙂