04:36fdobridge: <benjaminl> how much do we trust the codegen sched implementation for sm50?
04:37fdobridge: <benjaminl> nak sm50 is mostly blocked on encode_alu stuff right now, but I was thinking that if we're gonna need to take new measurements for sched stuff I could start working on a tool for that
04:37fdobridge: <benjaminl> remember there being some
04:37fdobridge: <benjaminl> remember there being some discussion a while ago about the codegen sched being unreliable, but not sure if that was for sm50 or just for sm75 (edited)
06:24fdobridge: <Sid> out of sheer curiosity, would it not be possible to directly use nvidia's opencl/cuda lib for nouveau cuda?
06:31fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Would zink be able to compete with a native OpenGL driver on NVIDIA (because I've heard NVIDIA has dedicated hardware for OpenGL)? 🐸
06:45fdobridge: <Sid> I don't think nv has dedicated cores for GL
06:47fdobridge: <benjaminl> definitely not easily... the kernel side of the nvidia drivers is completely different, so you'd probably need to expose something in the nouveau kernel driver that matches the nvidia kernel uapi
06:48fdobridge: <benjaminl> my not-very-knowledgeable guess is that this would be harder than just implementing a cuda runtime in mesa for a worse outcome
06:50fdobridge: <Sid> I see
09:17fdobridge: <airlied> Like you could port NVIDIA uvm kernel driver to nouveau for lols 🙂
13:06fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Let's see if nouveau explodes after enabling it (I just found a kernel parameter while looking for something else)
16:57fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Why doesn't plugging in a USB-C display adapter wake up my TU117 GPU from D3cold but NVK does (and reenables the display)?
16:57fdobridge: <karolherbst🐧🦀> sounds like an ACPI bug
16:57fdobridge: <karolherbst🐧🦀> or something
16:58fdobridge: <karolherbst🐧🦀> display hotplug events are wired up in the firmware
16:58fdobridge: <karolherbst🐧🦀> so even thought he GPU is off, the firmware pokes the OS to enable the GPU
16:58fdobridge: <karolherbst🐧🦀> there might be a bug somewhere there
16:58fdobridge: <karolherbst🐧🦀> lyude: ^^ display hotplug bug
17:01fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Hopefully my laptop's UEFI firmware isn't the cause (it hasn't received an update for long enough that it still has the TPM stutter issue)
17:08fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> On an unrelated note it took me a really long time to find an adapter that worked with my laptop's USB-C port (I've tried 3 adapters with a DisplayPort port and they all added a weird USB device to lsusb but I now have this Huawei branded one with a HDMI port (and 2 USB ports and VGA too) that actually works with all of my compatible USB-C devices
18:12fdobridge: <airlied> @gfxstrand btw did you manage to do another run with gsp or reproduce the fence crash?
18:12fdobridge: <gfxstrand> Let me try that now
18:14fdobridge: <gfxstrand> @airlied running now
18:16fdobridge: <gfxstrand> If it's like previous runs, it should last less than 10m
18:20fdobridge: <airlied> if it fails quick, can you try reducing the number of parallels?
18:21fdobridge: <gfxstrand> Yup, it's dead again
18:21fdobridge: <gfxstrand> How many parallels do you want me to run? I've got 36 threads on this machine.
18:22airlied: fdobridge: same fence error?
18:22fdobridge: <airlied> same fence error?
18:22fdobridge: <airlied> maybe knock it down to half the number and see
18:25fdobridge: <gfxstrand> Yup
18:26fdobridge: <gfxstrand> The fence error appears to be a failed read so, either an OOB channel id or the BO just vanishes.
18:28fdobridge: <gfxstrand> Looks like maybe we're not handling channel creation failure properly?
18:29fdobridge: <gfxstrand> Still, the channel create failing is a problem.
18:30fdobridge: <airlied> okay it's wierd we haven't seen that up until now, can you send me dmesg for it?
18:32fdobridge: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1171517461640773652/fence-dmesg.txt?ex=655cf7a8&is=654a82a8&hm=4b0ed771cf93c629d286f3d050d411224921365cbf76f4ffd2845fc637dfde06&
18:32fdobridge: <gfxstrand> As usual, kernel backtraces are missing lots of frames. 🙄
18:33fdobridge: <airlied> you've tried cts on same kernel without gsp enabled?
18:35fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Ekstrand did show some crazy test results without GSP and NAK so definitely yes
18:35fdobridge: <gfxstrand> Not exactly the same kernel, no.
18:36fdobridge: <airlied> that might be a good data point
18:36fdobridge: <airlied> since that backtrace has no gsp specifics in it
18:41TimurTabi: airlied: if I want to boot with GSP-RM support, do I need to do anything special with grub or command lines? Do I need to copy GSP-RM firmware to initrd or something?
18:43airlied: TimurTabi: for Ada nothing, for turing/ampere, nouveau.config=NvGspRm=1 has to be set, if you've installed a kernel when the firmware is in the right places the initrd should have it
18:43airlied: you might have to reinstall the kernel if you just added fw to the filesystem after the install
18:46TimurTabi: command-line parm was it. It looks like it works! What's a good graphics test to see if it's really running at full speed?
18:48TimurTabi: vblank_mode=0 glxgears shows 6000fps
18:50airlied: seems like it's working, but I've no idea what your non-gsp gears score is :-P
18:50DodoGTA: TimurTabi: SuperTuxKart at 1080p and maximum preset loads the GPU decently well
18:58TimurTabi: airlied: if I omit NvGspRm=1, I get no gui on a TU104.
18:59airlied: TimurTabi: I'm going to guess the pre-gsp firmware fails to load, see if you can get a dmesg :-)
19:00TimurTabi: Not a whole lot:
19:00DodoGTA: TimurTabi: "Error fetching paste"
19:01DodoGTA: Now the paste works
19:01airlied: okay all seems fine, but no display wierd
19:14TimurTabi: airlied: can you test https://github.com/NVIDIA/linux-firmware/commit/415a5650947396d462033fdf8aa8e6fdf0d9866d to see if it really installs the firmware properly for you? It works for me, but I want a second opinion before I submit a PR
19:18airlied: I probably won't get to it soon, I'm on the road for 2 weeks, but I'll see
19:20TimurTabi: Ok, I'll just work with Mario directly then. I don't think we want to wait 2 weeks.
19:20fdobridge: <gfxstrand> I threw a couple of `BUG_ON()` into the offending function to try and see what's going on. Then I had to dash to an eye appointment. I'll be able to look at the results in half an hour or so.
19:20TimurTabi: Or maybe you can suggest someone else to look at it? I really don't want to screw up a 60MB PR.
19:21airlied: TimurTabi: once the files are in place, we can move them around if anything gets screwed up without another 60MB PR
19:22airlied: I'll see if I can put the patch into the copr I made and replace the one I have
19:42fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> What does does this warning mean? :nouveau:
19:42fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1171535044876177499/message.txt?ex=655d0808&is=654a9308&hm=c068b2002f7fa89b56271fe47de49ce06479a69adfbdb85ec0d59c17f11c13ad&
19:44fdobridge: <karolherbst🐧🦀> RIP
19:45fdobridge: <airlied> looks like an acpi warning
19:46fdobridge: <airlied> it's probably not fatal though
19:46fdobridge: <airlied> some ACPI method gave us back an answer we didn't expect
19:47fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Display output works fine (and vkcube too)
19:47fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> But actual games crash 🤔
19:47airlied: TimurTabi: I've put your patches into a fedora copr build, https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6610281/
19:48fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> 🎮
19:48fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1171536606444933191/message.txt?ex=655d097d&is=654a947d&hm=37b323312a4c5aff2bc3935ced2dac73798ad456a3c66e7de0d74d1560adb684&
19:48fdobridge: <airlied> that seems unrelated to games crashing though
19:48fdobridge: <airlied> okay that seems more relatable
20:26fdobridge: <gfxstrand> Just finished a full run without GSP and no crashes.
20:28fdobridge: <airlied> cool I've got a full run on my ampere going fine 40 minutes in with gsp
20:35fdobridge: <gfxstrand> And, GSP OOM...
20:45fdobridge: <gfxstrand> ```patch
20:45fdobridge: <gfxstrand> commit e123f3e78bfc3f05176b97717f60d953f7ee4c7c (HEAD)
20:45fdobridge: <gfxstrand> Author: Faith Ekstrand <firstname.lastname@example.org>
20:45fdobridge: <gfxstrand> Date: Tue Nov 7 14:44:24 2023 -0600
20:45fdobridge: <gfxstrand> nouveau: BUG_ON some invariants in fence_context_new
20:45fdobridge: <gfxstrand> diff --git a/drivers/gpu/drm/nouveau/nv84_fence.c b/drivers/gpu/drm/nouveau/nv84_fence.c
20:45fdobridge: <gfxstrand> index 812b8c62eeba..173edc19cb4d 100644
20:45fdobridge: <gfxstrand> --- a/drivers/gpu/drm/nouveau/nv84_fence.c
20:45fdobridge: <gfxstrand> +++ b/drivers/gpu/drm/nouveau/nv84_fence.c
20:45fdobridge: <gfxstrand> @@ -131,6 +131,10 @@ nv84_fence_context_new(struct nouveau_channel *chan)
20:45fdobridge: <gfxstrand> struct nv84_fence_chan *fctx;
20:45fdobridge: <gfxstrand> int ret;
20:45fdobridge: <gfxstrand> + BUG_ON(priv == NULL);
20:45fdobridge: <gfxstrand> + BUG_ON(priv->bo == NULL);
20:45fdobridge: <gfxstrand> + BUG_ON(nv84_fence_chid(chan) * 16 >= priv->bo->bo.base.size);
20:45fdobridge: <gfxstrand> +
20:45fdobridge: <gfxstrand> fctx = chan->fence = kzalloc(sizeof(*fctx), GFP_KERNEL);
20:45fdobridge: <gfxstrand> if (!fctx)
20:45fdobridge: <gfxstrand> return -ENOMEM;
20:45fdobridge: <gfxstrand> ```
20:45fdobridge: <gfxstrand> @airlied The third `BUG_ON()` triggers. There's your bug.
20:48fdobridge: <gfxstrand> Either we're not properly recycling channel IDs or our fence BO needs to be bigger.
20:48fdobridge: <gfxstrand> I don't know enough about nouveau and GSP to have opinions on which.
20:52fdobridge: <airlied> can you check if throwing a * 8 or something in the nouveau_bo_new in nv84_fence_create helps? also knowing what nv84_fence_chid(chan) is when it blows up?
20:55fdobridge: <gfxstrand> I can after a bit
21:07fdobridge: <airlied> okay I've no idea how that whole chid stuff works, need more learning
21:17fdobridge: <airlied> also a prink with drm->chan_total in it might be good info
21:17fdobridge: <airlied> probably some disagreement with gsp and pre-gsp on some of those
21:18fdobridge: <airlied> Pass: 395042, Fail: 124, Crash: 93, Skip: 1619939, Flake: 45, Duration: 1:05:51, Remaining: 0 on ampete gsp
21:18fdobridge: <airlied> ampere
22:08fdobridge: <karolherbst🐧🦀> cool
22:42fdobridge: <gfxstrand> Time to try with BO size x 8
22:48fdobridge: <airlied> it might also be a turing only thing we have to chase down
22:52fdobridge: <gfxstrand> @airlied Is it possible that this chid thing and our OOM issues are related? Like, maybe there are hidden channels on some cards that we're not accounting for?
22:52fdobridge: <gfxstrand> Or maybe that makes no sense. I don't actually know
23:05fdobridge: <gfxstrand> Well, it's 20m in and hasn't BUG'd yet.
23:11fdobridge: <gfxstrand> Or maybe it's something with my card? This is the 12GB 2060 from hell that I'm running. 😂
23:11fdobridge: <gfxstrand> (It didn't work with nouveau when I first bought it because of mystery firmware problems)
23:26fdobridge: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1171591449851211877/message.txt?ex=655d3c90&is=654ac790&hm=ae6952025c900226ff41497252241298889aa71e5903a69d57cde015d2e6cb9b&
23:26fdobridge: <gfxstrand> womp womp
23:28fdobridge: <airlied> those are some wierd ass places to crash
23:29fdobridge: <airlied> something really crapped on RAX there
23:48fdobridge: <airlied> I wonder if we have with turing gsp some memory corruption or wrong sized struct, I'll give some stuff a spin on a Turing when I return