IRC Logs of #nouveau on irc.freenode.net for 2024-02-29

01:06 fdobridge: <gfxstrand> I appreciate the reference
01:09 RSpliet: Well that's awkward
01:09 RSpliet: Looks like distribution of the GSP is seriously weighing down initramfs, it's now 80MiB an image, effectively meaning my 450MiB boot partition can only hold two kernels instead of 3.
01:10 fdobridge: <redsheep> I have started just doing a 1 GB boot partition instead of 500 MB
01:10 RSpliet: Yeah it's too late for that, it's a 7yo laptop and I'm not particularly keen to reinstall it
01:10 RSpliet: Just to fit a firmware in an initramfs that I'll never need
01:12 RSpliet: Guess I should ask the Fedora people if there's any way to make dracut a bit more picky about what it shovels in the initramfs
01:14 fdobridge: <samantas5855> does the gsp have to be in the efi partition?
01:19 airlied: not sure we can know with nouveau if we need it or not
01:19 fdobridge: <karolherbst🐧🦀> it's the distributions fault to shipping the same firmware files multiple times tho
01:19 fdobridge: <karolherbst🐧🦀> *of
01:20 fdobridge: <karolherbst🐧🦀> could also just compress and copy the firmware files once
01:21 RSpliet: pretty sure the firmware directory contains symlinks and there's only one copy
01:21 RSpliet: it's just that that firmware package is 25MiB
01:21 fdobridge: <karolherbst🐧🦀> I mean putting the same firmware files in multiple initramfs
01:21 RSpliet: Oh sure, well, one per kernel yes
01:21 fdobridge: <karolherbst🐧🦀> could also have one for all kernels
01:22 RSpliet: firmware wise yes, .ko-wise no
01:22 RSpliet: But you could have two initramfs images? idk
01:22 fdobridge: <karolherbst🐧🦀> yeah, but the issue are the firmware files here 😉
01:22 fdobridge: <karolherbst🐧🦀> yeah
01:22 fdobridge: <karolherbst🐧🦀> this all should be possible
01:23 RSpliet: anyway, easiest solution for me would have been to just not pick up the gsp, since it's a 1st gen maxwell in this laptop, but I can't see any command line argument to dracut to tell it to do that
01:23 fdobridge: <karolherbst🐧🦀> at least the first patches for chained initramfs are like from 2008
01:23 airlied: yeah initrd just takes the list from the driver, no way to cut it down
01:23 fdobridge: <karolherbst🐧🦀> you can tell dracut what not to include I think
01:24 RSpliet: you can tell it not to include a driver, can't tell it not to include a specific firmwae pkg
01:24 RSpliet: maybe I'd be fine excluding nouveau, since the primary GPU is intel anyway
01:24 fdobridge: <airlied> if it's a laptop and runpm works then you save a few watts
01:25 fdobridge: <karolherbst🐧🦀> why do you have a 450MB /boot/ anyway?
01:25 fdobridge: <airlied> old school installs ftw
01:25 fdobridge: <karolherbst🐧🦀> mhh
01:25 RSpliet: well the exact size is eh, but 450MiB has always been way more than necessary
01:25 fdobridge: <karolherbst🐧🦀> does grub support btrfs these days? 😄
01:26 fdobridge: <karolherbst🐧🦀> subvolmes would just solve this issue
01:27 RSpliet: yeah removing nouveau from the initramfs *halves* the size of the image.
01:27 fdobridge: <karolherbst🐧🦀> but anyway, there are legitimate reasons to have kms working before you can access `/` and that's kinda the painful part here
01:27 RSpliet: appreciate that
01:28 RSpliet: the painful part is the sheer size of the firmware package, which is... not under your control obvs
01:28 fdobridge: <karolherbst🐧🦀> but being held back by bad decisions is also something which we can't really bother with
01:28 fdobridge: <karolherbst🐧🦀> maybe we should have been more aggressive about distributions having to find a solution for this :ferrisUpsideDown:
01:29 fdobridge: <karolherbst🐧🦀> but I'm also sure it would have changed nothing
01:29 karolherbst: RSpliet: are your firmware files compressed btw?
01:30 RSpliet: yes, xz-compressed it's still 25MiB
01:31 karolherbst: mhhh right..
01:31 RSpliet: with "it" being the one gsp firmware bin
01:31 karolherbst: well.. if you feel adventurous, you can always mess with your fs and partition table :D
01:31 RSpliet: Oh, and 12 MiB for the turing one
01:32 RSpliet: So it is just nearly 40MiB in firmwares for the GSP
01:33 RSpliet: Meanwhile vmlinuz is about 15 MiB, so... the ga102 gsp package is substantially bigger than *the Linux kernel*
01:34 RSpliet: NVIDIA should be ashamed of themselves for shipping that
01:34 karolherbst: but anyway.. there is no reason to put those files into initramfs... the boot process could just mount /boot/ and load firmware from there, but I think it's all very cursed atm and not _that_ easy to change?
01:34 karolherbst: I'm sure there is a lot of important code in that file :D
01:34 RSpliet: I think it's up to the distro people to fix that eventually
01:34 karolherbst: I mean.. how big is nvidias driver anyway?
01:35 fdobridge: <redsheep> It's like 800 MB these days
01:35 RSpliet: At this point I'm assuming that the GSP *is* the driver and the .ko is just a shim around it
01:37 airlied: pretty much
01:43 RSpliet: right, taking nouveau out of the initramfs does the trick. It's still loaded during boot so runpm should be all good. Thanks for hearing me moan for a bit
01:59 karolherbst: Lyude: sooo... HDMI 2.1... you think we could implement it based on GSP without needing to know anything about how HDMI 2.1 works?
02:00 fdobridge: <redsheep> I already investigated this and I am fairly sure the answer is yes. It seems to work on openrm with no apparent implementation of anything having to do with the HDMI spec.
02:01 Lyude: yeah I would assume so as well tbh
02:01 fdobridge: <redsheep> All the problematic link training and such seems to be stuffed in the gsp
02:01 karolherbst: mhhhhhhhh
02:01 Lyude: as well - honestly we still could have HDMI 2.1 in the kernel - it's just someone has to reverse engineer it
02:02 karolherbst: so apparently the HDMI forum did the shithead moved and turned down AMD's proposal to implement HDMI 2.1 in open source, however... what if 1. we don't ask and 2. we never have seen the spec in the first place and 3. all we do is nothing hdmi related anyway
02:03 karolherbst: Lyude: yeah.. but I think the HDMI folks are in a .. uhm... hostile mood
02:03 fdobridge: <redsheep> It's not against any rules to go read how nvidia does it in openrm right? There's like... nothing there
02:03 karolherbst: I'd agree
02:03 karolherbst: but the HDMI folks might not 🙃
02:03 Lyude: karolherbst: tbh I kind of wish they didn't report on it
02:03 karolherbst: same tbh
02:03 Lyude: karolherbst: it doesn't really matter
02:04 Lyude: If we're not breaking DRM I don't see any issue
02:04 karolherbst: I only have two questions: 1. do we have lawyers and 2. do we have enough money to pay them to win against the HDMI baddies :P
02:04 Lyude: and regardless if we were that would make nvidia's driver in hot water as well
02:04 karolherbst: mhhhh
02:04 karolherbst: good point actually
02:05 karolherbst: sadly I don't have HDMI 2.1 hardware...
02:05 karolherbst: but I'd implement it just for the popcorn
02:06 Lyude: i have a tv
02:06 karolherbst: I mean.. sure, I do as well, but can it do HDMI 2.1?
02:06 Lyude: anyway vesa is cool
02:06 fdobridge: <redsheep> If nvidia has already implemented all of it inside the gsp then we wouldn't even be actually implementing it at all, right? Just making the calls to tell it "Hey do the thing"
02:06 Lyude: (thanks bill)
02:07 Lyude: redsheep: p much
02:07 fdobridge: <redsheep> I am typing this from an HDMI 2.1 TV, I will 100% test it if patches show up
02:07 Lyude: I mean - we might still be able to make some helpers from that because I think intel's driver sort of does the same thing
02:07 karolherbst: I wonder if one can code it blind and it works on first try...
02:07 karolherbst: it's really just setting a different type and do some bw calculation or something, right?
02:08 Lyude: I think
02:08 karolherbst: FRL remains weird tho
02:08 fdobridge: <redsheep> I am not even sure you need to calculate the bandwidth but it has been a few months since I read the code
02:09 Lyude: for nvidia probably not
02:09 karolherbst: mhhh
02:09 fdobridge: <redsheep> FRL is weird? Isn't that the whole point of HDMI 2.1, or were you talking about the VRR parts?
02:09 Lyude: like at least for nova pretty much everything is done in gsp already (at least from what I've seen when going through stuff)
02:09 karolherbst: but don't have to select one of the FRL modes?
02:09 karolherbst: or is that also done in gsp internally?
02:10 Lyude: i didn't look that closely tbh
02:10 Lyude: but nvidia's driver does it
02:10 Lyude: actually
02:10 fdobridge: <redsheep> I can look again, I am not 100% there
02:10 Lyude: yeah p much anything we need should be there so meh
02:10 Lyude: i'm very much not worried about it
02:10 karolherbst: yeah.. I think there was like 3 modes and you select the correct one or something
02:10 karolherbst: right...
02:10 karolherbst: just doing some copy pasta
02:11 airlied: we don't have a contract with the hdmi consortium
02:11 airlied: so as long as nobody reads a leaked spec and does proper reverse engineering they are unlikely to have any grounds
02:12 Lyude: ^^
02:12 airlied: the question for AMD is what is actually needed to drive the FRL bits on their hw
02:12 karolherbst: I mean.. those are IP related corpos on the forum, they don't care about those details, they sue regardless
02:12 fdobridge: <redsheep> We would only ever have to really RE on something openrm doesn't do, or doesn't do in a way that works for nouveau, right?
02:13 karolherbst: but yeah...
02:13 airlied: karolherbst: I doubt they'd bother to be honest, just a question if anyone care to reverse engineer it
02:13 karolherbst: I also don't think that "but nvidia did it" means anything at all to them. They just don't sue nvidia because they'd be too big :P
02:14 karolherbst: airlied: nah.. I don't trust companies in the IP market
02:14 airlied: pretty sure nvidia just buried it in gsp
02:14 karolherbst: they are trigger happy even if they know they lose, because often they just want to scare folks
02:14 karolherbst: not sure if the HDMI forum itself is like that
02:14 airlied: yeah not sure HDMI is in thar rnage
02:14 karolherbst: but some of the bigger members definetly are
02:15 airlied: I've come close to screwing with them when I was doing HDCP and displaylink work
02:15 karolherbst: ahh
02:15 airlied: I think intel might have just punted it out to a separate chip
02:15 karolherbst: I'd be for implementing it regardless anyway, just too see what happens
02:15 airlied: so you've just got a DP->HDMI PCON that does it all
02:15 karolherbst: heh
02:16 airlied: but AMD I think has native HDMI IP
02:16 karolherbst: yeah.. I think that's what Intel at least used to do
02:16 karolherbst: not sure if they still do it?
02:16 karolherbst: I thought they added native HDMI support recently
02:16 airlied: don't think so
02:16 karolherbst: I thought for 2.0 or 2.1 they did
02:16 fdobridge: <redsheep> On the arc cards it's definately though another chip
02:16 airlied: but haven't kept track
02:16 karolherbst: ahh, fair enough then
02:17 airlied: anyone got a cheap montior that can do frl? :-)
02:18 karolherbst: I'd have to check actually...
02:18 fdobridge: <redsheep> I don't think anything cheap has it lol
02:18 airlied: I expect this would be something buried deep in the tech specs
02:18 fdobridge: <redsheep> The best you would likely do would be a used TV
02:18 karolherbst: wait a second...
02:18 karolherbst: I have 4K@120 HDR TV...
02:18 karolherbst: doesn't that like require FRL?
02:18 fdobridge: <redsheep> That's got FRL then
02:18 fdobridge: <redsheep> Yes
02:18 karolherbst: guess I _do_ have the hardware after all
02:18 fdobridge: <redsheep> Unless it has aggressive DSC, yes
02:19 fdobridge: <redsheep> You have to have DSC near max compression for it to work on TDMS
02:19 karolherbst: and 4:2:0
02:19 fdobridge: <redsheep> Oh, that too
02:20 karolherbst: let's see...
02:22 karolherbst: max res: 2160P/120Hz 444 12bit
02:23 fdobridge: <redsheep> Yeah even just 4k120 10bpc at 4:2:0 needs quite a bit of DSC on TDMS https://tomverbeure.github.io/video_timings_calculator
02:24 fdobridge: <redsheep> If that's your max res TDMS is straight up not possible
02:25 karolherbst: Lyude: if you give me a rough draft on how the code should look like you can throw it at me and I'll finish it or something
04:06 fdobridge: <redsheep> @zmike. Looks like display servers aren't quite vroom yet with !27867. I can confirm I can run my desktop on it though.
04:07 fdobridge: <redsheep> I fixed all of the stuff adding confounding variables, I have my own custom pkgbuilds now
04:14 fdobridge: <airlied> @zmike does that tess test leak a load of memory for you? in trying to fix that I've somehow made things worse
04:16 fdobridge: <redsheep> That ping didn't work, it needs the name selected in the ui
04:19 fdobridge: <redsheep> @gfxstrand Nice work on getting GPL working, as of 27860 I have very greatly improved framerate consistency in a variety of games. For instance, the witness isn't faster, but it's playable now because it's not so choppy.
04:19 fdobridge: <gfxstrand> 🥳
04:20 fdobridge: <airlied> @zmike. oops ^
04:24 fdobridge: <airlied> I think I've worked out the crash, shitty C++ ftw
04:25 fdobridge: <airlied> or rather I've worked out a problem in this code, that might elsewhere cause the crash
04:26 fdobridge: <airlied> I fixed the memory leak by delete[] a ptr they new[] elsewhere. but since they create stack object run, allocate a pointer inside run to data, push run into a vector which copies it, but keeps the data ptr, but then the destuctor runs twice and frees it twice and kills the driver
05:14 fdobridge: <gfxstrand> CTS bug?
05:20 HdkR: What's the min-spec kernel for NVK?
05:21 fdobridge: <gfxstrand> 6.6
05:21 fdobridge: <gfxstrand> 6.7 if you want the GPU to go brrr
05:22 HdkR: Coolio, my Orin is on 6.7-rc5. I could enable Nouveau on it and find some Ada GPU to plonk in it
05:23 HdkR: Watch the fireworks
05:24 HdkR: RTX 4000 SFF would probably be the best choice
05:24 fdobridge: <airlied> @gfxstrand I don't think the one I found it the bug, but I think CTS has the same bug elsewhere maybe
05:25 fdobridge: <gfxstrand> Where was it? There's not much C++ in Zink and none in NVK
05:39 fdobridge: <airlied> the bug I found is in CTS, just likely CTS has other similiar bugs
05:41 fdobridge: <gfxstrand> Ah, okay. Yeah, that's believable
05:41 fdobridge: <gfxstrand> I've found my share of those
05:41 fdobridge: <gfxstrand> There was one particularly fun case with C++ exceptions back in the pre-1.0 days. 😅
05:43 HdkR: Oh, the RTX 4000 SFF is priced very silly.
05:44 HdkR: I guess an RTX 3050 would be the next choice for PCIe powered
05:46 fdobridge: <airlied> just fyi i'll be travelling for a few days, so if I don't work out this kernel thing in the few hours I have tomorrow, it'll be a week or so 😛
05:46 fdobridge: <gfxstrand> 😢
05:47 fdobridge: <gfxstrand> Have you tried putting a spin lock around object lookup?
05:47 fdobridge: <orowith2os> HdkR: I've had the 2060 recommended to me in the past iirc, on par with the 3050 for its value too. I just need to find time to get one
05:47 fdobridge: <gfxstrand> That seems like a massive over-simplification but it really does seem to blow up there a lot.
05:48 fdobridge: <airlied> yeah I think it fixes that wierd oops we see in page fault handling
05:48 fdobridge: <airlied> though I'm getting a lockdep with it
05:48 fdobridge: <gfxstrand> 😭
05:49 HdkR: @orowith2os Isn't bus powered, so that wouldn't work.
05:49 fdobridge: <airlied> I think the real problem is probably going to be some TLB type bullshit again
05:50 fdobridge: <orowith2os> HdkR: boo-womp
05:50 fdobridge: <gfxstrand> That sounds plausible
05:50 fdobridge: <gfxstrand> Sounds like you're making progress, anyway
05:50 fdobridge: <orowith2os> I present to thee: a higher capacity PSU
05:50 fdobridge: <airlied> yeah the fact we do an eviction and then never see any interrupts or fences again worries me, I'll add the gsp debug logging patch and see if Timur can spot anything
05:51 fdobridge: <gfxstrand> Yeah. I'm pretty sure I've seen that one, too. I see the map explosion a lot more frequently but sometimes the GPU seems to just disappear. Interrupts ending up in `/dev/null` would explain that.
05:52 HdkR: @orowith2os Orin doesn't take ATX PSUs. It gets power through a 100w USB-C port. I would need to rig some ATX power passthrough. I'd rather just buy a 3050 for a 4000 SFF.
05:52 HdkR: Buy a 3050 or a 4000 SFF*
05:52 fdobridge: <orowith2os> I like your funny words, magic man
05:53 fdobridge: <redsheep> @gfxstrand Just so you know, with my testing yesterday being kind of tainted I retested talos principle on mesa main just now, and with only !27840 applied on top, and I can confirm that it does strangely regress performance still. 74 > 62 fps
05:53 fdobridge: <orowith2os> I just hope it fits, if not I take a hammer and ankle grinder to the case, and call it a day
05:53 fdobridge: <gfxstrand> That's a bummer. Also very confusing as I don't know how that could possibly hurt performance. I believe you, though.
05:54 fdobridge: <orowith2os> I shattered my tempered glass panel so I've been riding a horribly hacked together plexiglass one for a hot minute lol
05:54 fdobridge: <orowith2os> a corner is literally broken off
05:54 fdobridge: <orowith2os> I still need to make a new one
05:54 fdobridge: <orowith2os> been a few years, I think
05:55 HdkR: I guess that's a good question as well. What uarch is getting hammered on the most in NVK today? Anything Volta+ is sort of similar, but if there are any dud cards I should avoid it would be nice to know upfront
05:56 HdkR: Not like I'm going to rip the Titan V out of my other PC for testing NVK though :P
05:57 fdobridge: <gfxstrand> If you're looking for a sure bet, this is my standard test card these days: https://www.amazon.com/dp/B0985X2YR1
05:57 fdobridge: <gfxstrand> I mean, if you want to become the new Volta maintainer... 😛
05:58 HdkR: Only if you're willing the poison the watering hole :P
05:58 HdkR: I can only do testing out of naive curiosity because of that
05:58 fdobridge: <gfxstrand> Oh, right...
05:58 fdobridge: <gfxstrand> I forgot you're NVIDIA poisoned
05:59 HdkR: NVIDIA pilled
05:59 fdobridge: <gfxstrand> 🤷🏻‍♀️
05:59 fdobridge: <orowith2os> Zink-, Wayland-, GNOME-, and Fedora-pilled
05:59 HdkR: Ideally I can just ensure that FEX happily works with NVK
06:00 fdobridge: <airlied> @gfxstrand we leak the cmd buffer descriptor push sets, and the query copies shader btw
06:00 fdobridge: <airlied> or at least the nir from it
06:04 fdobridge: <gfxstrand> Hrm... that's...odd. That all goes through the meta framework and that should clean up on device destroy. I'll have to look into that tomorrow.
06:04 fdobridge: <gfxstrand> Going to plug one into a raspbery pi or something?
06:05 fdobridge: <gfxstrand> I have a pi 5 and it has an M.2 slot...
06:07 HdkR: lol no
06:07 HdkR: Pi's PCIe is broken for GPUs
06:07 HdkR: I'm using the Orin board and just avoiding the integrated as much as physically possible
06:07 HdkR: Currently I have a Radeon Pro 7500 in it which works quite well
06:08 HdkR: All the RDNA3+RADV goodness
06:08 HdkR: It's just a shame NVIDIA hasn't made a new Tegra since the A78 cores in this thing are quite slow.
06:11 fdobridge: <gfxstrand> Yeah
06:11 fdobridge: <gfxstrand> nouveau.ko is currently busted on Tegra anyway.
06:13 HdkR: Yea, I wouldn't have expected you to support host1x yet
06:14 fdobridge: <redsheep> @gfxstrand I don't know for sure what happened with my testing yesterday where the witness only went from 33 > 35 fps, but today 27840 massively improves the witness from 31 > 53 fps, so clearly talos is not the whole story
06:15 fdobridge: <redsheep> I tested like 7 times because I didn't believe it but I confirmed in the dxvk hud that it's running on the commits I expect for my testing, and that patch is the only difference
06:17 fdobridge: <redsheep> It's possible that the nvidia driver would also benefit from not doing that on talos, but it's probably entirely pointless since it runs so fast over there
06:19 fdobridge: <airlied> @gfxstrand https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27870 fixes the first one
06:19 fdobridge: <redsheep> As for the difference, maybe something about the new patches in the last 30 hours means it works better in the witness? Or maybe it was using my accidentally enabled igpu and happened to have similar performance
06:23 fdobridge: <redsheep> Another much nastier area shows 29 > 45 fps improvement. The witness is now not only playable, but enjoyable. That 45 fps spot is the worst case I know of.
06:39 fdobridge: <redsheep> If we know it's what nvidia does it might be worth just going ahead with 27840, the 35-70% improvement dwarfs the 16% regression. I will do more testing though to see if I can spot a pattern or get a more general trend.
07:16 fdobridge: <georgeouzou> Congrats for the 1.3 conformance !
07:16 fdobridge: <georgeouzou> :triangle_nvk:
07:38 fdobridge: <!DodoNVK (she) 🇱🇹> I only have 256 MB
07:39 HdkR: Alright, got a GPU on order so I can test out NVK
07:51 dj-death: need more pci-e slots
07:56 fdobridge: <airlied> Need less pcie cards :-p
07:59 HdkR: Need more ARM boards
08:00 fdobridge: <airlied> I just need to test vendor hw return processes more 🙂
08:46 fdobridge: <zmike.> 🤦
08:46 fdobridge: <zmike.> Awesome find though
08:48 fdobridge: <airlied> Now your test list blows up in libdl so I've no idea what it's smoking
08:48 fdobridge: <airlied> But I should trawl CTS for more double frees
08:48 fdobridge: <zmike.> :fullheadache:
08:49 fdobridge: <zmike.> Maybe someday I'll be able to do a run on nvk without getting a truckload of random crashes
08:49 fdobridge: <zmike.> Probably about the same time I fix ms stencil fallback
09:00 fdobridge: <!DodoNVK (she) 🇱🇹> Are these crashes in XFB?
09:54 fdobridge: <prop_energy_ball> could you imagine a world if Google didn't ever touch the CTS
09:54 fdobridge: <prop_energy_ball> it could have been so beautiful
10:00 fdobridge: <fooishbar> you clearly don’t remember the CTS from before they did
10:00 fdobridge: <prop_energy_ball> i was probably like 11 or 12 at that time
10:00 fdobridge: <prop_energy_ball> just going into secondary school
10:00 fdobridge: <prop_energy_ball> =)
10:01 fdobridge: <fooishbar> let me tell you it was not better
10:03 fdobridge: <prop_energy_ball> terrifying
10:22 fdobridge: <zmike.> I don't know what you mean by "aren't quite vroom"
10:23 fdobridge: <zmike.> there's not much more that can be done for basic 2D rendering
10:27 fdobridge: <triang3l> It's insane how Faith and Karol all of a sudden just completely obliterated one of forever-standing fundamental invariants of graphics on Linux 🐸 Congrats :triangle_nvk:!!! :happy_gears:
11:01 fdobridge: <karolherbst🐧🦀> I haven't done all that much tho
11:41 FLHerne: prop_energy_ball: Pre-Google CTS was pretty much meaningless, there were 'conformant' drivers where various core features just didn't work
11:41 FLHerne: not edge cases, the whole extension
11:42 FLHerne: now it might be the other extreme :p
11:43 FLHerne: 2013 "conformance" https://community.khronos.org/t/how-is-opengl-es-conformance-awarded/4524/2
12:05 fdobridge: <Sid> I did the bisect wrong yesterday
12:15 fdobridge: <Sid> 042b5f83841fbf7ce39474412db3b5e4765a7ea7
12:15 fdobridge: <Sid> @airlied https://github.com/torvalds/linux/commit/042b5f83841fbf7ce39474412db3b5e4765a7ea7 seems to be the problematic commit
12:16 fdobridge: <esdrastarsis> @ttabi1
12:16 fdobridge: <Sid> I'm building current torvalds/master with it reverted just to be sure
12:16 fdobridge: <Sid> but that's what bisect spit at me
12:17 fdobridge: <Sid> here's the call trace that dmesg spits at me: https://paste.sidonthe.net/raw/mole-rat-goose
12:18 fdobridge: <Sid> and that results in a full system lockup if I try to do anything using drm
12:19 fdobridge: <Sid> sddm, trying to launch sway from a tty, both freeze and don't let me change to another tty
12:19 fdobridge: <Sid> and if the call trace is in the dmesg, the system requires a hard reset and does not shut down with a simple `sudo poweroff`
12:20 fdobridge: <Sid> fwiw, prime setup (1660Ti laptop)
12:27 fdobridge: <Sid> ...actually
12:28 fdobridge: <Sid> I should probably have just built the tag where I marked the first git bisect bad with this reverted
12:28 fdobridge: <Sid> because afaik runpm=0 also results in a very similar call trace on rc6
12:28 fdobridge: <Sid> but, if this truly is the culprit, I guess both will be solved...
12:42 fdobridge: <Sid> can confirm
12:42 fdobridge: <Sid> that's the culprit
12:43 fdobridge: <Sid> just booted into torvalds/master with it reverted and could log into sway just fine
12:43 fdobridge: <Sid> nothing in the dmesg either
12:44 fdobridge: <Sid> even with runpm=0
12:52 fdobridge: <Sid> I feel like we can't do this on a prime setup
12:52 fdobridge: <Sid> https://github.com/torvalds/linux/commit/042b5f83841fbf7ce39474412db3b5e4765a7ea7#diff-e61b541a31cc0b5ba2f1f0012024fff1eb8d2ce3fa1ad03cedde80afa89bb25dR1054-R1058
12:53 fdobridge: <Sid> maybe those buffers are needed to pull a gpu out of pci suspend
12:56 fdobridge: <Sid> do we have a way for the kernel to check if the gpu isn't driving the/a display?
13:17 fdobridge: <Sid> I was wrong those buffer releases are fine
13:19 fdobridge: <Sid> however these are not run at all
13:20 fdobridge: <Sid> <https://github.com/torvalds/linux/commit/042b5f83841fbf7ce39474412db3b5e4765a7ea7#diff-e61b541a31cc0b5ba2f1f0012024fff1eb8d2ce3fa1ad03cedde80afa89bb25dR2166-R2169>
13:20 fdobridge: <Sid> we hit the runpm bug before those happen
13:27 fdobridge: <Sid> anyway, I feel like that's the furthest I can go right now, I'm gonna take a break and hope someone smarter than me comes and takes a look with all the info I've given
13:38 fdobridge: <tom3026> do you have a trace on what happends think i just hit the same thing starting kde
13:40 fdobridge: <tom3026> https://gist.githubusercontent.com/gulafaran/427b797405ca11e67acbf25904b9e00f/raw/ca7e40dc4a04252585cc999d50215ce1430fcb55/crash
13:40 fdobridge: <Sid> ^
13:40 fdobridge: <tom3026> null ptr dereference after a bunch of GSP fails
13:40 fdobridge: <tom3026> yeah same
13:40 fdobridge: <Sid> same as mine, yeah
13:41 fdobridge: <Sid> am having dinner, then gonna go see if I can fix it
13:41 fdobridge: <tom3026> @airlied ^
13:41 fdobridge: <Sid> dave and timur have already been pinged
13:44 fdobridge: <tom3026> ah okay
13:48 fdobridge: <Sid> I think I know why it's happening
13:48 fdobridge: <Sid> at a very surface level
13:54 fdobridge: <tom3026> my short fast assumption would be its freeing those things when entering D3 but isnt allocating them again or it shouldnt free them on D3 but it is, and poof null ptr deref
13:58 fdobridge: <rhed0x> why doesnt NVK support STORAGE_BIT for VK_FORMAT_A8_UNORM?
13:58 fdobridge: <rhed0x> pretty sure the proprietary driver does support that
14:03 fdobridge: <triang3l> Touching Mesa's common format code is kinda scary btw because that affects EVERYONE 🥵 R4G4_UNORM can be supported trivially in Terakan and probably RADV, but it doesn't exist in u_format.csv, and I don't have all the hardware in the world for testing 😄
14:04 fdobridge: <rhed0x> what does that have to do with mesa common code?
14:04 fdobridge: <triang3l> but probably nothing
14:04 fdobridge: <rhed0x> support is decided in nil_format.c
14:04 fdobridge: <triang3l> but probably nothing (at least for formats that already exist there) (edited)
14:05 fdobridge: <redsheep> Took the wording from the MR :p
14:05 fdobridge: <redsheep>
14:05 fdobridge: <redsheep> I don't have a good way to measure the performance of the plumbing for the session itself, but it's still not really fast enough to use... If all the zink bits are in place it's probably NVK performance
14:07 fdobridge: <Sid> this is correct
14:07 fdobridge: <Sid> and also what I'd thought
14:07 fdobridge: <Sid> I might be able to fix it myself
14:09 fdobridge: <Sid> now there's two ways I could fix this...
14:10 fdobridge: <Sid> either we could just not release the buffer that suspend requires
14:10 fdobridge: <Sid> or we could release it and re-initialize it only when suspend is needed
14:18 fdobridge: <Sid> hehehehehe silly `GUD USB Display (DRM_GUD) [N/m/y/?] n`
14:19 fdobridge: <tom3026> or there is some double freeing going on
14:19 fdobridge: <tom3026> the gsp complains way before the trace happends
14:20 fdobridge: <Sid> ?
14:21 fdobridge: <Sid> just doing a quick and dirty test to see if it's actually the buffer that I think it is
14:22 fdobridge: <Sid> that causes the null ptr deref
14:22 fdobridge: <Sid> even though the code suggests it is
14:22 fdobridge: <Sid> I'd like to test it myself
14:22 dakr: karolherbst, regarding [1], I think [2] should fix it. This problem is in v6.7 only, since the scheduler part was re-worked for v6.8.
14:22 dakr: [1] https://gist.githubusercontent.com/karolherbst/a20eb0f937a06ed6aabe2ac2ca3d11b5/raw/9cd8b1dc5894872d0eeebbee3dd0fdd28bb576bc/gistfile1.txt
14:22 dakr: [2] https://paste.centos.org/view/198e28d4
14:26 karolherbst: dakr: okay, we can have a 6.7 only fix, I'll test it next week when I get back to the VM_BIND stuff
14:26 karolherbst: dakr: I might also need VM_BIND support for the old pushbuf exec ioctls, but like in a restricted way to it can work, e.g. no relocs or anything fancy
14:26 karolherbst: but that's something I still need to figure out with airlied_
14:27 karolherbst: porting the gl driver to the new exec uapi might be feasible, but it might also not be worth the effort if it can be supported by the kernel easily
14:27 fdobridge: <tom3026> well i know what line is being null ptr btw
14:28 fdobridge: <Sid> same same
14:28 fdobridge: <Sid> I'm just not sure if we should be releasing wpr_meta
14:28 fdobridge: <Sid> let me see if the open kernel modules do that..
14:29 fdobridge: <Sid> ..tbh
14:29 fdobridge: <Sid> isn't wpr write protect region
14:30 fdobridge: <tom3026> wpr_meta freed here https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c#L1058 , r535_gsp_fini called and when suspend is true https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c#L1998
14:30 fdobridge: <tom3026> null
14:30 fdobridge: <Sid> yeah I saw that too
14:30 fdobridge: <Sid> and the relevant header for wpr_meta says it is used for suspend/resume
14:30 fdobridge: <Sid> drivers/gpu/drm/nouveau/include/nvrm/535.113.01/nvidia/arch/nvalloc/common/inc/gsp/gsp_fw_wpr_meta.h
14:31 fdobridge: <Sid> L64
14:31 fdobridge: <Sid> not freeing that buffer does fix the issue, but
14:31 fdobridge: <Sid> I'm not sure if that's the best approach
14:31 fdobridge: <Sid> since the only other place the buffer is used is for handling suspend
14:32 fdobridge: <Sid> meaning it is very well possible to just reinitialize for suspend, and free after resume
14:32 fdobridge: <tom3026> well free it in r535_gsp_booter_unload instead maybe
14:32 fdobridge: <tom3026> 😄
14:33 fdobridge: <Sid> ..wait
14:33 fdobridge: <tom3026> or whatever that last freeing is put it, seems github tricked me a bit
14:34 fdobridge: <tom3026> r535_gsp_dtor
14:35 fdobridge: <Sid> not freeing wpr_meta solves the suspend regression but still breaks drm
14:35 fdobridge: <Sid> can't run sway
14:36 fdobridge: <Sid> meaning there's another buffer that is being freed for no good reason
14:36 fdobridge: <tom3026> yeah reverting that commit kwin booted, vkcube runs but im having still the same spam of gsp "nouveau 0000:01:00.0: gsp: cli:0xc1d00002 obj:0x00730000 ctrl cmd:0x00731341 failed: 0x0000ffff"
14:36 fdobridge: <tom3026> so theres more stuff going on heh
14:39 fdobridge: <marysaka> so NV0073_CTRL_CMD_DP_AUXCH_CTRL failing?
14:39 fdobridge: <Sid> that's a different issue, I don't have that
14:42 fdobridge: <tom3026> oh well time to shop some food for the little ones, and dinner. when im back sid i expect you have fixed all the bugs
14:42 fdobridge: <tom3026> 😄
14:43 fdobridge: <Sid> *all* the bugs?
14:43 fdobridge:<Sid> sweats
14:43 fdobridge: <Sid> I can fix the stuff that's caused by these buffers being freed, yeah
14:43 fdobridge: <mohamexiety> and here begins the story of "sid and how I got nerdsniped into becoming a kernel maintainer"
14:44 fdobridge: <Sid> I've joked about this before
14:45 fdobridge: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1212772559980994610/image.png?ex=65f30d76&is=65e09876&hm=10cc3594197e9012aeeed28a15eac0974a4de4b9d4a6895c3fb5c2b64275fb14&
14:45 fdobridge: <Sid> but yeah, looks like timur released buffers that are still in use for resume/suspend
14:46 fdobridge: <Sid> ...well, makes sense
14:46 fdobridge: <Sid> resume is doing the same init code
14:47 fdobridge: <Sid> I think I'm gonna go with the approach where we reinit the required buffers on suspend
14:52 fdobridge: <Sid> ...this is not as simple as I thought it'd be!
14:53 fdobridge: <Sid> it's easier to just not clear those buffers...
14:54 fdobridge: <redsheep> @gfxstrand I assume these kinds of results in the witness were more like what you had been hoping for. I think 27840 is probably worth the trade-off, I can't find other games or tests that it slows down
14:56 fdobridge: <gfxstrand> I need to spend more time on it
14:57 fdobridge: <gfxstrand> It's good to know that it's improving things but I need to better understand how/why
14:57 fdobridge: <redsheep> Hmm fair enough. I wouldn't be too surprised to find out it's one of those things Nvidia has game specific rules for
15:00 fdobridge: <Sid> not too hard to check if they do
15:02 fdobridge: <Sid> what's the executable name?
15:04 fdobridge: <redsheep> Talos.exe
15:04 fdobridge: <redsheep> https://steamdb.info/app/257510/config/
15:05 fdobridge: <!DodoNVK (she) 🇱🇹> I think that game has basically become a meme in Vulkan driver circles
15:10 fdobridge: <Sid> uhm
15:10 fdobridge: <Sid> so
15:10 fdobridge: <Sid> initializing the buffers in the suspend logic leads to `can't suspend`
15:10 fdobridge: <pac85> It was the first vk game wasn't it?
15:11 fdobridge: <pac85> It has some bugs regarding swapchain recreation
15:11 fdobridge: <Sid> @airlied @karolherbst I need guidance on how to do this
15:12 fdobridge: <Sid> https://github.com/torvalds/linux/commit/042b5f83841fbf7ce39474412db3b5e4765a7ea7 is freeing buffers that re also required for suspend/resume
15:13 fdobridge: <redsheep> One of the first couple at least, yeah
15:14 fdobridge: <Sid> should we just not free those, or should we re-initialize those when required
15:38 fdobridge: <tom3026> 0xOF those " gsp: cli:0xc1d00002 obj:0x00730000 ctrl cmd:0x00731341 failed: 0x0000ffff " spams like 8 lines everytime the gsp seems to suspend or resume if thats what its called, entering D3?
15:38 fdobridge: <tom3026> curiously this didnt happend on that ampere gpu hrm
15:45 fdobridge: <gfxstrand> Adding a `PIPE_FORMAT` enum is pretty safe. No one is going to have it map to anything in their driver so it won't magically turn anything on.
15:46 fdobridge: <gfxstrand> Because the current tables say it's unsupported. I have no idea whether the hardware supports it or not. I guess if the prop driver advertises support, it probably does.
15:46 fdobridge: <rhed0x> which table says that?
15:47 fdobridge: <gfxstrand> `nil_format.c` which I pretty much directly copied+pasted from the old nouveau driver.
15:48 fdobridge: <rhed0x> D3D12 has AN UAVs too, so it pretty much has to support A8 storage
15:48 fdobridge: <rhed0x> D3D12 has mandatory A8 UAVs too, so it pretty much has to support A8 storage (edited)
15:49 fdobridge: <Sid> oh
15:49 fdobridge: <triang3l> From what I understand about how it's supposed to work (at least in AMD drivers AFAIK), everyone should automatically get mappings of PIPE_FORMATs to hardware formats based on component sizes, ordering and numeric types
15:50 fdobridge: <Sid> I'm dumb
15:50 fdobridge: <Sid> the solution is to just dealloc the two required buffers at driver unload instead of postinit
15:51 fdobridge: <gfxstrand> Yeah, I'm pretty sure that's just AMD. Every other driver I'm aware of has an explicit table.
15:51 fdobridge: <gfxstrand> Oh, cool. Yeah, we may as well turn it on and see if anything explodes
16:25 dakr: airlied_, regarding the nvkm client/object rbtree, the root of the tree is per client. For userspace clients that seems to be fine, since it's only accessed through client init/fini and through abi16, guarded by the client mutex.
16:26 dakr: For drm->client it's accessed in the display init path and in some atomic modeset callbacks.
16:27 dakr: This seems fine as well, but rather hard to say without checking every path (and there are quite some) in detail.
16:28 fdobridge: <tom3026> two? isnt there just one null ptr deref?
16:29 fdobridge: <Sid> for suspend, yes
16:29 fdobridge: <Sid> resume causes another :D
16:29 fdobridge: <tom3026> oh 😄
16:29 fdobridge: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1212798936285319288/0001-drm-nouveau-keep-DMA-buffers-required-for-suspend-re.patch?ex=65f32607&is=65e0b107&hm=b97f9bf2ecb494b10f1b891e1ecfdababf21ef25890ae9ab35353f245005457b&
16:30 fdobridge: <gfxstrand> dakr: It's also accessed by fault handlers, annoyingly.
16:30 dakr: Either way, it seems a bit broken that this API implicitly expects that upper layers take care of concurrency.
16:30 fdobridge: <Sid> this patch on top of rc6 should fix the issues @tom3026
16:31 fdobridge: <tom3026> yeah its applicable to 6.7.6 too
16:31 dakr: gfxstrand, didn't catch that one, that for sure broken then. Have a pointer for me?
16:31 fdobridge: <Sid> it should be
16:31 fdobridge: <Sid> assuming 6.7.6 has fixes backported from rc6
16:31 fdobridge: <tom3026> yeah it has that commit backported that causes it
16:32 fdobridge: <gfxstrand> dakr: Here's a backtrace:
16:32 fdobridge: <gfxstrand>
16:32 fdobridge: <gfxstrand> [20613.118596] ? die_addr+0x36/0x90
16:32 fdobridge: <gfxstrand> [20613.118609] ? exc_general_protection+0x1dd/0x450
16:32 fdobridge: <gfxstrand> [20613.118622] ? asm_exc_general_protection+0x26/0x30
16:32 fdobridge: <gfxstrand> [20613.118630] ? nvkm_object_search+0x1d/0x70 [nouveau]
16:32 fdobridge: <gfxstrand> [20613.118929] nvkm_ioctl+0xa1/0x250 [nouveau]
16:32 fdobridge: <gfxstrand> [20613.119228] nvif_object_map_handle+0xc8/0x180 [nouveau]
16:32 fdobridge: <gfxstrand> [20613.119516] nouveau_ttm_io_mem_reserve+0x189/0x2e0 [nouveau]
16:32 fdobridge: <gfxstrand> [20613.119890] ttm_bo_vm_fault_reserved+0xa7/0x3b0 [ttm]
16:32 fdobridge: <gfxstrand> [20613.119919] ? mutex_lock+0x12/0x30
16:32 fdobridge: <gfxstrand> [20613.119926] nouveau_ttm_fault+0x69/0xa0 [nouveau]
16:32 fdobridge: <gfxstrand> [20613.120292] __do_fault+0x32/0x120
16:32 fdobridge: <gfxstrand> [20613.120300] do_fault+0x271/0x490
16:32 fdobridge: <gfxstrand> [20613.120305] __handle_mm_fault+0x81e/0xe50
16:32 fdobridge: <gfxstrand> [20613.120315] handle_mm_fault+0x17f/0x360
16:32 fdobridge: <gfxstrand> [20613.120321] do_user_addr_fault+0x1e2/0x670
16:32 fdobridge: <gfxstrand> [20613.120330] exc_page_fault+0x7f/0x180
16:32 fdobridge: <gfxstrand> [20613.120336] asm_exc_page_fault+0x26/0x30
16:32 fdobridge: <Sid> yeah, will work then
16:33 fdobridge: <Sid> oh no, I'm too late
16:33 fdobridge: <Sid> drm is in feature freeze
16:33 fdobridge: <Sid> gonna have to submit to drm-misc-next-fixes
16:33 fdobridge: <Sid> sad
16:34 fdobridge: <tom3026> well it isnt exactly a feature, its kinda fixing a complete broken state :p
16:34 fdobridge: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1212800106840072252/drm-misc-commit-flow.png?ex=65f3271e&is=65e0b21e&hm=8d9af42a69e37c145f36bbc44a2e35c302a0297e9de7cc8f5b41a00270a4c5fa&
16:34 fdobridge: <Sid> from drm-misc committer guidelines
16:34 fdobridge: <Sid> oh wait
16:35 fdobridge: <Sid> I can give it to misc-fixes nvm am dumb
16:35 fdobridge: <gfxstrand> Unless you're the one merging the patch, you don't need to worry too much about it. If you send it to the wrong list, the maintainer will help you figure it out.
16:36 fdobridge: <Sid> I still need to figure out how to send it to this list e-e
16:39 fdobridge: <Sid> or even which list to send it to
16:43 fdobridge: <gfxstrand> I'd send it to drm-misc with `PATCH drm-misc-fixes` and let the maintainers sort it out
16:45 fdobridge: <Sid> so if I'm getting this right, I have to `git format-patch` my commit, and attach the generated patch to a plaintext email, yeah?
16:46 fdobridge: <Sid> there's so little information on how to correctly send a patch :\
16:48 fdobridge: <gfxstrand> git send-email is the usual way
16:48 fdobridge: <gfxstrand> `git send-email` is the usual way (edited)
16:55 fdobridge: <gfxstrand> CTS looks happy with enabling it. I'm going to scan through and make sure it's doing actual storage tests and then flip it on.
16:55 fdobridge: <rhed0x> nice
16:55 fdobridge: <gfxstrand> Thanks for pointing that out!
16:57 fdobridge: <rhed0x> now I just need to figure out why dxvk used to be broken when actually using A8
16:57 fdobridge: <rhed0x> (not on NVK)
17:01 fdobridge: <gfxstrand> Hrm... Looks like the CTS for it is spotty at best. Maybe I just need to search harder for test names?
17:10 fdobridge: <tom3026> https://github.com/doitsujin/dxvk/commit/5828f0e2b9b233b32f7b5edb54dc4e04014d3b55 ?
17:10 fdobridge: <rhed0x> i know
17:10 fdobridge: <tom3026> Native A8 breaks Crysis 2/3 Remastered for unknown reasons.
17:10 fdobridge: <tom3026> 😄
17:10 fdobridge: <rhed0x> the question is why it breaks Crysis 2 Remastered
17:11 fdobridge: <rhed0x> (which I can't reproduce after reverting it btw)
17:11 fdobridge: <tom3026> perhaps is HW based
17:11 fdobridge: <rhed0x> apparently it happened on radv too..
17:11 fdobridge: <rhed0x> apparently it happened on radv too... (edited)
17:11 fdobridge: <tom3026> oh
17:13 fdobridge: <tom3026> seems @zmike might know seeing his love letter to VK_FORMAT_A8_UNORM_KHR in https://www.supergoodcode.com/yep/
17:14 fdobridge: <waelunix> Are there any significant differences between open driver versions "nvidia-open-6.7.6-535.43.28" and "nvidia-open-6.7.6-545.29.06" in respect to nvk?
17:14 fdobridge: <rhed0x> neither works with nvk
17:14 fdobridge: <waelunix> I'm trying to get Mesa 24.1 working on my NixOS install and the latest open driver is broken in nixpkgs
17:15 fdobridge: <rhed0x> NVK needs nouveau, not the open source kernel module by nvidia
17:15 fdobridge: <waelunix> Ok, so i wasn't doing anything wrong then. I was afraid I had to compile the open driver _as well_
17:15 fdobridge: <waelunix> Thanks!
17:28 fdobridge: <pavlo_it_115> https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/603
17:28 fdobridge: <pavlo_it_115>
17:29 fdobridge: <pavlo_it_115> Please support this discussion with a comment. It is necessary that nvidia paid attention to this! It is very important.
17:31 fdobridge: <Sid> uhm
17:31 fdobridge: <Sid> well...
17:33 fdobridge: <Sid> GSP already supports cards that go as far back as 6 years old
17:33 fdobridge: <waelunix> Is there any kernel cli args i'm supposed to give when running on Turing? (TU116 / GTX 1660 super)
17:33 fdobridge: <waelunix> Are there any kernel cli args i'm supposed to give when running on Turing? (TU116 / GTX 1660 super) (edited)
17:33 fdobridge: <esdrastarsis> yeah, turing is old enough
17:34 fdobridge: <esdrastarsis> let pascal die in hell
17:35 fdobridge: <waelunix> Huh, i'm not sure why I passed `nouveau.debug`, is that why i'm not getting graphical display?
17:35 fdobridge: <waelunix> that and `nouveau.config=NvGspRm=1`
17:36 fdobridge: <waelunix> that and `nouveau.config=NvGspRm=1` and `gsp=debug` (edited)
17:36 fdobridge: <waelunix> that and `nouveau.config=NvGspRm=1` + `gsp=debug` (edited)
17:44 fdobridge: <waelunix> that and `gsp=debug` (edited)
17:44 fdobridge: <waelunix> Actually, maybe it's something else. Is `nvidiafb` supposed to be loaded as well?
17:44 fdobridge: <waelunix> I have terminal output but no wayland compositor is working.
17:44 fdobridge: <redsheep> Nothing Nvidia should load
17:45 fdobridge: <karolherbst🐧🦀> what's `dmesg` saying?
17:47 fdobridge: <waelunix> Hmm, i have to type it out but i see a couple of things of note.
17:47 fdobridge: <waelunix> ```
17:47 fdobridge: <waelunix> nouveau: vgaarb: deactivate vga console
17:47 fdobridge: <waelunix> nouveau: NVIDIA TU116 (168000a1)
17:47 fdobridge: <waelunix> nouveau: bios: version 90.16.48.00.94
17:47 fdobridge: <waelunix> nvidia-gpu: i2c timeout error e0000000
17:47 fdobridge: <waelunix> [drm] Initialized nouveau 1.4.0 20120801 for 0000:01:00.00 on minor 0
17:47 fdobridge: <waelunix> nouveau: [drm] fb0: nouveaudrmfb frame buffer device
17:47 fdobridge: <waelunix> nouveau: DRM: Disabling PCI power management to avoid bug
17:47 fdobridge: <waelunix> ```
17:48 fdobridge: <waelunix> Both niri and Hyprland fail to open. GDM shows a white screen with "Oh no! something has gone wrong." in the middle. I'm assuming both are because it's not getting a GLES context?
17:49 fdobridge: <waelunix> btw, i'm running nouveau (nvk only) + zink. no nouveau gallium
17:49 fdobridge: <pavlo_it_115> Cruelly. Pascal can still produce something adequate for its price
17:49 fdobridge: <karolherbst🐧🦀> mhhh...
17:49 fdobridge: <karolherbst🐧🦀> maybe zink fails to do things? dunno
17:49 fdobridge: <karolherbst🐧🦀> does it work with the gallium driveR?
17:49 fdobridge: <Sid> latest kernel has a regression
17:49 fdobridge: <waelunix> I'm on 6.7.6
17:50 fdobridge: <Sid> yes I know what I said
17:50 fdobridge: <waelunix> I thought you meant 6.8-rc?
17:50 fdobridge: <Sid> requires this patch
17:50 fdobridge: <waelunix> I thought you meant 6.8-rc? 😅 (edited)
17:50 fdobridge: <Sid> anything after rc4 needs it
17:50 fdobridge: <pavlo_it_115> Cruelly. Pascal can still produce something adequate for its price. Just support the discussion. (edited)
17:50 fdobridge: <Sid> as does anything after 6.7.4, since stable gets backported fixes from rc
17:50 fdobridge: <waelunix> Ok, i'll try compiling with that patch on
17:51 fdobridge: <Sid> I'm trying to get the patch submitted to the ml
17:51 fdobridge: <redsheep> I haven't managed to make a Wayland session work on NVK+zink yet either, only x11
17:51 fdobridge: <karolherbst🐧🦀> yeah.. I suspect it's something with zink
17:51 fdobridge: <karolherbst🐧🦀> the compositors probably don't pick it up, or mesa is silly or something
17:51 fdobridge: <Sid> btw hdzki are you on a laptop by any change
17:51 fdobridge: <Sid> btw hdzki are you on a laptop by any chance (edited)
17:52 fdobridge: <waelunix> No, desktop with intel igpu and TU116 as nvidia dgpu
17:52 fdobridge: <Sid> yeah, multi-gpu setup, so you do need that patch
17:52 fdobridge: <redsheep> Maybe I just need to carry the one liner to enable modifiers?
17:52 fdobridge: <Sid> assuming your igpu is driving the display
17:52 fdobridge: <karolherbst🐧🦀> you probably need this: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27628
17:52 fdobridge: <redsheep> Could try disabling igpu
17:52 fdobridge: <waelunix> Wasn't that merged? I'm pulling in mesa master
17:53 fdobridge: <karolherbst🐧🦀> and `NOUVEAU_USE_ZINK=1` set
17:53 fdobridge: <waelunix> I'll try this after compiling the kernel 👍
17:53 fdobridge: <karolherbst🐧🦀> here it matters what's considered the "main" device, but that's probably your dGPU
17:54 fdobridge: <karolherbst🐧🦀> but yeah.. without `NOUVEAU_USE_ZINK` you'll need the gl driver as that's loaded by default
17:54 fdobridge: <karolherbst🐧🦀> and you might need the gl driver for the intel gpu, unless you want to use zink there as well
17:55 fdobridge: <karolherbst🐧🦀> though that should just make the iGPUs outputs unavailable
17:55 fdobridge: <waelunix> I wanna see how zink's fairing which is why i'm disabling everything gallium except swrast and the bare minimum
17:55 fdobridge: <waelunix> I wanna see how zink's fairing which is why i'm disabling everything gallium except swrast and the bare minimum on nixos (edited)
17:56 fdobridge: <karolherbst🐧🦀> right...
17:56 fdobridge: <karolherbst🐧🦀> the thing is, that won't work on the intel side then
17:57 fdobridge: <waelunix> Yeah, no problem 👍 I was mainly thinking of nvk
17:57 fdobridge: <karolherbst🐧🦀> maybe `MESA_LOADER_DRIVER_OVERRIDE=zink` would also work?
17:57 fdobridge: <karolherbst🐧🦀> then it loads zink for every device
17:59 fdobridge: <Sid> patch submitted to dri-devel
18:00 fdobridge: <esdrastarsis> There is nothing to discuss, Nvidia will not release this firmware, they only released the GSP, because they put most of the proprietary driver inside it, openrm is only used to "glue" the userspace with the GSP firmware
18:01 fdobridge: <waelunix> Ok i managed to get output from niri.
18:01 fdobridge: <waelunix> ```
18:01 fdobridge: <waelunix> MESA-LOADER: failed to open nouveau: /path/to/dri/nouveau_dri.so
18:01 fdobridge: <waelunix> ```
18:01 fdobridge: <waelunix> I may have misconfigured something from nixos' side 😅
18:01 fdobridge: <waelunix> I can confirm that `/run/opengl-driver/lib/dri` doesn't have `nouveau`, `kms_swrast`, nor `swrast`
18:02 fdobridge: <waelunix> similarly for zink
18:03 fdobridge: <waelunix> Huh, it did build it. Why isn't it hard-linking correctly 🤨
18:09 fdobridge: <karolherbst🐧🦀> yeah... if zink isn't there it might not even try to load it
18:13 fdobridge: <waelunix> If it turns out to be this i'm losing it 😂
18:13 fdobridge: <waelunix> ```
18:13 fdobridge: <waelunix> -- hardware.opengl.package = mkDefault pkgs.mesa;
18:13 fdobridge: <waelunix> ++ hardware.opengl.package = mkDefault pkgs.mesa.drivers;
18:13 fdobridge: <waelunix> ```
18:13 fdobridge: <waelunix> If it turns out to be this i'm losing it 😂
18:13 fdobridge: <waelunix> ```diff
18:13 fdobridge: <waelunix> -- hardware.opengl.package = mkDefault pkgs.mesa;
18:13 fdobridge: <waelunix> ++ hardware.opengl.package = mkDefault pkgs.mesa.drivers;
18:13 fdobridge: <waelunix> ``` (edited)
18:15 fdobridge: <waelunix> Ok that fixed the dri folder issue. But i'm getting a different one now. Why is it loading the old drivers?
18:15 fdobridge: <waelunix> ```
18:15 fdobridge: <waelunix> DRI driver not from this Mesa build ('24.1.0-devel' vs '24.0.1')
18:15 fdobridge: <waelunix> failed to bind extensions
18:15 fdobridge: <waelunix> ```
18:15 fdobridge: <waelunix> I'll see if restarting helps
18:15 fdobridge: <karolherbst🐧🦀> 🥲 I see you are all having fun with nix
18:16 fdobridge: <waelunix> average nix user evening
18:16 fdobridge: <waelunix> i wouldn't recommend it if you want to be anything productive
18:16 fdobridge: <karolherbst🐧🦀> yeah.. that was my conclusion as well...
18:16 fdobridge: <karolherbst🐧🦀> I used to run gentoo with `USE=-*`.....
18:17 fdobridge: <waelunix> Oh it worked? GDM doesn't crash and GNOME opens
18:17 fdobridge: <karolherbst🐧🦀> but at some point in your life, you value your time a bit higher than previously
18:17 fdobridge: <waelunix> But it doesn't look accelerated 😟
18:17 fdobridge: <waelunix> yup, llvmpipe
18:17 fdobridge: <karolherbst🐧🦀> what env var are you using?
18:17 fdobridge: <karolherbst🐧🦀> and is it even set?
18:18 fdobridge: <waelunix> ```MESA_LOADER_DRIVER_OVERRIDE=zink glxgears
18:18 fdobridge: <waelunix> DRI3 not available
18:18 fdobridge: <waelunix> failed to load driver: zink
18:18 fdobridge: <waelunix> Error: couldn't get an RGB, Double-buffered visual
18:18 fdobridge: <waelunix> ```
18:18 fdobridge: <karolherbst🐧🦀> you need it set for the compositor as well
18:18 fdobridge: <karolherbst🐧🦀> so like... best do it globally
18:19 fdobridge: <pavlo_it_115> @waelunix try wayland session
18:19 fdobridge: <pavlo_it_115> this work me
18:19 fdobridge: <waelunix> This is wayland session...
18:19 fdobridge: <waelunix> actually, no??
18:19 fdobridge: <pavlo_it_115> https://tenor.com/view/eyes-chew-chewing-big-eyes-wide-eyes-gif-8128904
18:19 fdobridge: <karolherbst🐧🦀> anyway.. compositors also try to load the GL driver and if they can't load it, they fail back to llvmpipe
18:20 fdobridge: <karolherbst🐧🦀> and if they fallback, then uhm.. you can't use GL inside
18:20 fdobridge: <karolherbst🐧🦀> and some like gdm _might_ fall back to X if they can't load gl for the GPU they are using
18:22 fdobridge: <waelunix> Ok, so i tried manually launching gnome wayland. Turns out the MESA Loader wasn't gone after all
18:25 fdobridge: <waelunix> Yup, it can't find a GPU
18:25 fdobridge: <waelunix> ```
18:25 fdobridge: <waelunix> Failed to setup: No GPUs found: Failed to open gpu `/dev/dri/card0`: Failed to initialize render device for /dev/dri/card0: Failed to create gbm device: Illegal seek, EGLStream render device requires an EGL display
18:25 fdobridge: <waelunix> ```
18:25 fdobridge: <waelunix> I'm guessing it's because i'm somehow loading 2 different versions of Mesa at the same time
18:26 fdobridge: <Sid> I still think it's trying to render on igpu
18:27 fdobridge: <Sid> either disable that, or set the env var DRI_PRIME for it
18:27 fdobridge: <Sid> use the pci path mehod
18:27 fdobridge: <Sid> https://wiki.archlinux.org/title/PRIME#For_open_source_drivers_-_PRIME
18:30 fdobridge: <gfxstrand> "EGLStream render device"? That looks like you're getting the NVIDIA blob driver.
18:31 fdobridge: <waelunix> I forgot to blacklist `nvidiafb` module despite `nouveau` being the active kernel driver 😔
18:31 fdobridge: <waelunix> I forgot to blacklist the `nvidiafb` module despite `nouveau` being the active kernel driver 😔 (edited)
18:31 fdobridge: <marysaka> if you are on nix can't you just entirely disable NVIDIA blob?
18:31 fdobridge: <marysaka> it shouldn't change anything about the derivations as it's already built anyway
18:32 fdobridge: <waelunix> I can, but there are many module settings. I'm guessing one of them is still building nvidia stuff
18:32 fdobridge: <waelunix> I'm not supposed to have nvidia modeSetting enabled i guess
18:32 fdobridge: <waelunix> That's the only nvidia thing still enabled in my config rn
18:37 fdobridge: <waelunix> That didn't help. How am I supposed to blacklist modules from the kernel cli again?
18:37 fdobridge: <Sid> just do `module_blacklist=nvidia` and it'll prevent loading any proprietary module
18:40 fdobridge: <waelunix> I don't have the proprietary driver "installed" anymore. That didn't remove `nvidiafb` 😟
18:41 fdobridge: <Sid> nvidiafb is part of the linux kerne
18:41 fdobridge: <Sid> nvidiafb is part of the linux kernel (edited)
18:41 fdobridge: <waelunix> so it's likely not causing any issues
18:42 Sid127: correct
18:42 Sid127: nvidiafb: https://github.com/torvalds/linux/tree/master/drivers/video/fbdev/nvidia
18:46 fdobridge: <waelunix> Ok, it's getting weird. I'm not getting errors anymore in niri (it's not displaying anything) but neither Hyprland nor GNOME wayland open
18:46 fdobridge: <waelunix> Is there a flag for mesa debug logs ?
18:46 airlied: dakr: btw I did try locking the nvkm client lock around it, but I got lockdep warning
18:46 airlied: (not the nvkm client has a separate lock to the nouveau client)
18:47 Sid127: waelunix: I still think you should either disable the iGPU and drive the display with nouveau, or set the DRI_PRIME env var
18:48 fdobridge: <waelunix> Ok, it's getting weird. I'm not getting errors loading zink anymore in niri (it's not displaying anything) but neither Hyprland nor GNOME wayland open (edited)
18:48 fdobridge: <waelunix> dmesg shows that the dGPU is prime
18:48 Sid127: if niri isn't displaying anything, that means the dgpu is rendering to a port that isn't connected.
18:49 Sid127: dmesg might say that but without the env var, the drivers won't know how to handle it
18:57 fdobridge: <waelunix> Although rerunning glxgears does tell me that it still can't load zink for some read
18:57 fdobridge: <waelunix> I might be setting the variable incorrectly. but that didn't change anything
18:57 fdobridge: <waelunix> I'm running through HDMI, and there's only 1 HDMI port on TU116
18:59 fdobridge: <waelunix> fuck it, i'm turning back noveau gallium
18:59 Sid127: ah, so your display is driven by the gpu already, hm
18:59 fdobridge: <waelunix> let's see if nvk runs at all
18:59 Sid127: if my experience with freeBSD is anything, mesa is separate from mesa-dri
18:59 Sid127: I'm not sure how nixOS handles that
19:00 fdobridge: <waelunix> I'm only setting dri and not dri32. could that be a part of it?
19:00 Sid127: I wouldn't know, sadly..
19:01 fdobridge: <waelunix> I guess there's only one way to find out 🙃
19:01 dakr: airlied, yeah, you're talking about the spinlock in struct nvkm_client, I guess.
19:02 Sid127: @waelunix what drivers are you building again?
19:02 fdobridge: <waelunix> mesa 24.1.0 (main) with `gallium3d=[swrast, radeonsi, zink]` and `vulkanDrivers=[swrast, nouveau]`
19:03 Sid127: ..why radeonsi
19:03 airlied: dakr: yup that one, I should look at the lockdep and see if it just needs a different lock
19:03 fdobridge: <waelunix> Because I had extra vdpau stuff enabled that needed radeonsi
19:03 fdobridge: <waelunix> Because I had extra vdpau stuff enabled that needed radeonsi (ik it's totally not used) (edited)
19:03 Sid127: anyway, afaik nvk still relies on some nouveau gallium code, I think
19:03 Sid127: unless I'm understanding things wrong
19:03 fdobridge: <karolherbst🐧🦀> it doesn't
19:04 fdobridge: <karolherbst🐧🦀> it could also be, that the nouveau gallium driver needs to exist for stuff to work for silly reasons
19:04 Sid127: doesn't nvk still get modifiers from nouveau gl or something
19:04 Sid127: :>
19:04 fdobridge: <karolherbst🐧🦀> only if you use the nouveau gl driver, not if you use zink
19:04 Sid127: I see
19:04 airlied: dakr: actually maybe I just need to use irq versions
19:05 Sid127: and does zink implement modifiers for nvidia?
19:06 dakr: airlied, yeah, just looking at it and I don't really see why this should be a problem.
19:07 dakr: the lock is barely used and where it is, I don't see a conflict with using it for the rbtree.
19:07 airlied: yeah the lockdep warning is about nesting insdie another lock using irq saverestore
19:07 airlied: WARNING: possible irq lock inversion dependency detected
19:16 fdobridge: <waelunix> I'm gonna go full nuclear; I'm gonna recompile ~~the universe~~ everything that uses mesa to ensure there's only one version being loaded
19:19 fdobridge: <Sid> @waelunix nouveau dri driver
19:19 fdobridge: <Sid> are you building that
19:19 MotorGold: Test #5 -- audible?
19:19 fdobridge: <Sid> oh wait nvm it's deprecated
19:20 fdobridge: <Sid> MotorGold solid copy
19:20 fdobridge: <pavlo_it_115> I tried using strace to find out what the nvidia installer is doing with the firmwares. There is nothing useful here
19:20 fdobridge: <pavlo_it_115> Maybe someone will be interested
19:20 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212841765980078121/install_trace.txt?ex=65f34dea&is=65e0d8ea&hm=f6c56d43bca93f71528d180df24d35a52d1bb95faa54f6e8eaa965c62adfb3a4&
19:20 fdobridge: <waelunix> I'm like 99% it's because i'm somehow loading 2 versions of mesa at the same time
19:20 fdobridge: <waelunix> hopefully the nuclear option clarifies this
19:20 MotorGold: oh heavens, thanks fdobridge. There's no error on the webclient when routing from the nouveau page on freedesktop and I had to debug that it requires registration in here
19:21 Sid127: oh, huh
19:21 Sid127: also fwiw MotorGold fdobridge is the irc-discord bridge we use
19:21 fdobridge: <waelunix> Wait a sec, which feature needs `intel-rt`??
19:21 fdobridge: <waelunix> I'm not even specifying that as an option
19:21 fdobridge: <airlied> nice catch on the dtor fix, I'll grab the patch today!
19:21 fdobridge: <Sid> thanks!
19:22 fdobridge: <pavlo_it_115> also along the path /usr/lib/firmware/nvidia/ I selected some of the correct firmwares for me. I don't know what to do with them, I guess it is necessary to analyze them through the okteta
19:22 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212842317556220025/63e85be220a9ed5f.png?ex=65f34e6e&is=65e0d96e&hm=b0cee8f5d5d51c39a771c4798c635b8a5687efbe2e2b09c1ab9cf400834506e1&
19:22 fdobridge: <Sid> also please avoid using the name associated with the email I sent it from, if it does come up e-e
19:22 fdobridge: <pavlo_it_115> also along the path /usr/lib/firmware/nvidia/ I selected some of the "correct" firmwares for me. I don't know what to do with them, I guess it is necessary to analyze them through the okteta (edited)
19:22 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212842317556220025/63e85be220a9ed5f.png?ex=65f34e6e&is=65e0d96e&hm=b0cee8f5d5d51c39a771c4798c635b8a5687efbe2e2b09c1ab9cf400834506e1&
19:22 fdobridge: <pavlo_it_115> also along the path /usr/lib/firmware/nvidia/ I selected some of the "correct" firmwares. I don't know what to do with them, I guess it is necessary to analyze them through the okteta (edited)
19:22 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212842317556220025/63e85be220a9ed5f.png?ex=65f34e6e&is=65e0d96e&hm=b0cee8f5d5d51c39a771c4798c635b8a5687efbe2e2b09c1ab9cf400834506e1&
19:22 fdobridge: <Sid> iirc lib32 enables it by default for some reason...
19:23 MotorGold: ah thanks Sid127 -- all new to me.
19:23 fdobridge: <pavlo_it_115> also along the path /usr/lib/firmware/nvidia/ I selected some of the "correct" firmwares. I don't know what to do with them, I guess it is necessary to analyze them through the okteta
19:23 fdobridge: <pavlo_it_115>
19:23 fdobridge: <pavlo_it_115>
19:23 fdobridge: <pavlo_it_115>
19:23 fdobridge: <pavlo_it_115> But everything would not be so simple. They are encrypted somewhere, but I don't know where. And I don't know where to look for them (edited)
19:23 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212842317556220025/63e85be220a9ed5f.png?ex=65f34e6e&is=65e0d96e&hm=b0cee8f5d5d51c39a771c4798c635b8a5687efbe2e2b09c1ab9cf400834506e1&
19:24 fdobridge: <pavlo_it_115> also along the path /usr/lib/firmware/nvidia/ I selected some of the "correct" firmwares. I don't know what to do with them, I guess it is necessary to analyze them through the okteta
19:24 MotorGold: to wit -- I have an odd bug on my laptop where the external monitor is totally busted if I boot with HDMI in. What's the lowest level disconnect I can do to get the kernel to pick it back up and pass it back to nouveau? I appreciate any help debugging it in advance.
19:24 fdobridge: <pavlo_it_115>
19:24 fdobridge: <pavlo_it_115>
19:24 fdobridge: <pavlo_it_115>
19:24 fdobridge: <pavlo_it_115> But everything would not be so simple. They are encrypted somewhere, but I don't know where. And I don't know where to look for them
19:24 fdobridge: <pavlo_it_115> Maybe you need to take the very first nvidia installer for pascal and maybe there will be something there (edited)
19:24 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212842317556220025/63e85be220a9ed5f.png?ex=65f34e6e&is=65e0d96e&hm=b0cee8f5d5d51c39a771c4798c635b8a5687efbe2e2b09c1ab9cf400834506e1&
19:24 fdobridge: <pavlo_it_115> also along the path /usr/lib/firmware/nvidia/ I selected some of the "correct" firmwares. I don't know what to do with them, I guess it is necessary to analyze them through the okteta
19:24 fdobridge: <pavlo_it_115>
19:24 fdobridge: <pavlo_it_115>
19:24 fdobridge: <pavlo_it_115>
19:24 fdobridge: <pavlo_it_115> But everything would not be so simple. They are encrypted somewhere, but I don't know where. And I don't know where to look for them
19:24 fdobridge: <pavlo_it_115> Maybe me need to take the very first nvidia installer for pascal and maybe there will be something there (edited)
19:24 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212842317556220025/63e85be220a9ed5f.png?ex=65f34e6e&is=65e0d96e&hm=b0cee8f5d5d51c39a771c4798c635b8a5687efbe2e2b09c1ab9cf400834506e1&
19:24 Sid127: gentle reminder to all my discord friends, please avoid editing your messages
19:25 MotorGold: oh that's what is happening; it's looping like crazy lol
19:25 fdobridge: <tom3026> When merged/fixed https://gitlab.freedesktop.org/drm/nouveau/-/issues/329
19:25 fdobridge: <airlied> @pavlo_it_115 nvidia won't care about the issue you opened, they are not going to put any effort into fixing it, we tried for years
19:26 fdobridge: <airlied> there already are a number of closed issues
19:26 Sid127: MotorGold: what GPU does it have? also afaik the lowest level disconnect you can do is throw it off the PCIe bus
19:26 Sid127: let me check how to do that, I forget...
19:27 MotorGold: Sid127 -- it's one of these https://tech-docs.system76.com/models/oryp8/README.html. I'll see if it's a 70 or 80
19:27 Sid127: got it! you need to do echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
19:27 Sid127: replace the pci ID with the one for your machine, of cource
19:27 Sid127: s/cource/course
19:27 MotorGold: 01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] (rev a1)
19:28 MotorGold: o7 lemme try that
19:28 Sid127: should be the same as my command then
19:28 MotorGold: and heyyy lucky, you got the right pci bus yeah lol
19:28 MotorGold: less backslashing for me
19:28 Sid127: after that you have to run echo 1 > /sys/bus/pci/rescan
19:28 MotorGold: AH! yesssss, it turned it off.
19:28 Sid127: to make the kernel pick it up again
19:29 MotorGold: YOOOO Sid127 what the hell yessssssssss it worrrrrrrks
19:29 fdobridge: <pavlo_it_115> Ok. I will forward the edited messages
19:29 fdobridge: <pavlo_it_115>
19:29 fdobridge: <pavlo_it_115> But everything would not be so simple. They are encrypted somewhere, but I don't know where. And I don't know where to look for them
19:29 fdobridge: <pavlo_it_115> Maybe me need to take the very first nvidia installer for pascal and maybe there will be something there
19:29 MotorGold: ok... so ... now the hard part... what the heck does that do xD
19:29 Sid127: MotorGold: hm?
19:29 MotorGold: basically, if the HDMI is in, the KMS goes totally busted, bonkers, berserk. Needs a total remove to fix.
19:30 MotorGold: lemme grab some of the messages I was yeeting into the unregistered ether
19:30 Sid127: no idea...
19:30 MotorGold: "why would the syslog / initial tty fail to hand over the screen to nouveau + my-compositor? It's just stuck on the initial console. If I CTRL+ALT+F1, I can see four copies of my terminal prompt all overlaid and wrong. It's like the CRTC is wrong. xrandr doesn't work on wayland."
19:30 MotorGold: "I can see pixels changing on it when I move a window, but it's all within the top 25% of the monitor. and only columns of pixels -- very few. Exactly 8 columns of pixels, 6 rows. Then junk text above (sylog) and below (tty)"
19:30 Sid127: oh, that sounds like a compositor but, maybe...
19:30 Sid127: s/but/bug
19:30 MotorGold: It's a total mess. Unplugging fixes nothing. Restarting gdm fixes nothing. Changing to the tty, again, shows four copies of my terminal. That's definitely a CRTC mismatch right?
19:31 MotorGold: EDID parses, btw. All good. Checksum, etc. Maybe not *right*, but at least syntactically valid
19:33 MotorGold: Sid127, if I disabled compositor entirely and reboot and it still looks totally screwed... definitely a KMS / nouveau issue, right?
19:33 MotorGold: (or lower)
19:34 MotorGold: I think it's KMS + firmware, tbh. I would just love to tackle it at a nouveau level or higher because I have faint hope of getting the other patches in.
19:35 Sid127: if the tty also looks wrong it's likely a kms/nouveau bug, yeah
19:35 Sid127: though it could also be with how your laptop exposes the gpu at init time and if the firmware does something wonky after
19:35 Sid127: laptops be weird
19:36 Sid127: if you could upload a dmesg (via a paste service) that'd be nice
19:36 MotorGold: I'm 100% certain it's the laptop firmware's fault. That I know.
19:36 MotorGold: On the initial fw revision, this didn't happen. Every fw update, it changes how broken it is, but it never fixes.
19:37 Sid127: by that you mean UEFI firmware, right?
19:37 MotorGold: hm. I'll be precise to my level of expertise, System 76 load a firmware onto my motherboard, to my knowledge, and that is faulty.
19:37 MotorGold: EEPROM something this that
19:37 Sid127: also if you need a paste service you can use mine: https://paste.sidonthe.net
19:38 Sid127: eeprom is the hardware memory where UEFI fw usually resides, yeah
19:38 MotorGold: cool. I just know there's ESP firmware and I know it's not that. That's static.
19:38 Sid127: uh
19:39 Sid127: ..I need to look into what firmware system76 ships exactly
19:39 Sid127: but yeah, a dmesg would be nice :>
19:39 MotorGold: https://github.com/pop-os/system76-firmware here's the former, working on the latter! o7
19:39 fdobridge: <pavlo_it_115> It's a waste of time. Look for what you don't know, and you don't know what to do next. I'm sorry that I bothered you here with messages about the Pascal firmware. I now understand the state of the matter. Without NVidia, this is not possible
19:39 fdobridge: <pavlo_it_115> I won't bother you on this topic anymore
19:41 airlied: dakr: doh the problem is the vblank lock in drm isn't irq save and we have some bad interaction with it
19:41 MotorGold: got an image sharing service suggestion, Sid127? I can show you what the screen looks like after the compositor boots.
19:42 Sid127: you should be able to attach files to my paste service :D /half-joking
19:42 Sid127: otherwise, imgur is fine
19:42 MotorGold: lol I'll base64 it 8)
19:42 MotorGold: (I won't)
19:43 MotorGold: https://paste.sidonthe.net/pasta/spider-ant-fish [LOG]
19:44 MotorGold: lol ... imgur crashed my entire firefox 10/10
19:44 airlied: I wonder should we defer the disp irq handling to a workqueue as well
19:44 Sid127: oh, kernel 6.1...
19:45 Sid127: MotorGold: you *might* have a better experience on kernel 6.7+
19:45 MotorGold: https://pasteboard.co/SWzUvCKgP0QI.jpg
19:45 Sid127: since that makes use of the firmware nvidia provides
19:45 MotorGold: I can check if NixOS unstable has it
19:46 Sid127: are you currently on nix stable?
19:47 Sid127: oh
19:47 Sid127: even nix unstable is on 6.6
19:48 Sid127: hang on..
19:48 Sid127: MotorGold: linuxKernel.kernels.linux_6_7
19:48 fdobridge: <tom3026> he doesnt want 6.7.6 tho :p
19:49 Sid127: ..right
19:49 Sid127: heck
19:52 MotorGold: does it let you scry into what patch version it is? all I see is 6.7.
19:52 MotorGold: ah, online it shows it.
19:52 Sid127: I wouldn't know, I'm not on nix... the website says it's 6.7.6, which has a regression affecting laptops and multi-gpu setups
19:53 Sid127: for which I just submitted a patch about 2h ago
19:53 MotorGold: Yeah I think there's been a fairly big regression lately, right? I have been broken on laptop+multi-gpu for a few months
19:53 MotorGold: awesome about the patch, gj
19:53 Sid127: I mean, with kernel 6.1 you're still on the nouveau meant for older cards :P
19:54 Sid127: but yeah, kernel 6.8-rc4 (and 6.7.5) introduced a regression in the suspend/resume logic
19:54 fdobridge: <tom3026> 6.7.6
19:54 Sid127: which laptops use more than desktops for power saving
19:54 fdobridge: <tom3026> 6.7.5 is fine
19:54 Sid127: oh
19:55 Sid127: I don't really follow linux-stable :P
19:55 Sid127: I'm somehow even using zfs on rc6 rn
19:57 MotorGold: I'll try 6.7.5 and report back. Funny enough, the remove only worked once, and now it's hung the entire pc on some hardware resource contention.
19:58 MotorGold: procs stalled, and even a reboot doesn't work. So same solution as above doesn't work twice ... what a thinker
19:58 MotorGold: MMMMM tasty kernel bug. [ 877.227722] ---[ end trace 0000000000000000 ]---
19:58 MotorGold: [ 877.227842] BUG: kernel NULL pointer dereference, address: 0000000000000058
19:58 MotorGold: [ 877.227845] #PF: supervisor write access in kernel mode
19:58 MotorGold: [ 877.227846] #PF: error_code(0x0002) - not-present page
19:59 Sid127: fun
19:59 MotorGold: security impact: pending lol
19:59 MotorGold: yea it's a nouveau stack trace, actually. guess I crashed the driver something awful
19:59 Sid127: wait, is this on 6.7.5?
20:00 MotorGold: nah, hadn't got there yet 6.1.79
20:00 MotorGold: Reboot to that got stalled on this, is why I'm commenting about it.
20:00 Sid127: right
20:00 MotorGold: Should I paste this massive dmesg bug somewhere? It says "cut here"
20:00 MotorGold: https://paste.sidonthe.net/pasta/bat-bear-pug
20:00 Sid127: but yeah, since 6.7 officially adds support for th RTX xxxx cards..
20:01 MotorGold: oops. ... irc disconnected?
20:01 Sid127: did no
20:01 Sid127: s/no/not
20:01 MotorGold: repeat; not sure where it went: official support for my card?! Hot tamale!
20:02 MotorGold: maybe I hit a bad hexchat key
20:02 Sid127: well, yes :D
20:02 Sid127: kernel 6.7 is where support for the GSP firmware landed
20:02 MotorGold: hm. how hard is it to get NixOS to build it?
20:02 MotorGold: I know it has that magic in it.
20:03 MotorGold: Cause I don't see what was it 6.7.5?
20:03 Sid127: I wouldn't know :P
20:03 MotorGold: I only saw the one 6_7 package
20:03 MotorGold: https://search.nixos.org/packages?channel=unstable&from=0&size=50&sort=alpha_asc&type=packages&query=6_7_5
20:03 MotorGold: How about 6.6.18 ?
20:03 Sid127: nope, 6.7+
20:04 MotorGold: Ok cool
20:04 Sid127: so even linuxKernel.kernels.linux_testing would work
20:04 Sid127: but again
20:04 Sid127: latest 6.7 and linux_testing (6.8 release candidates) have a regression
20:04 Sid127: so I'd wait a week or two
20:04 MotorGold: that's just suspend resume?
20:04 Sid127: yes, but
20:04 Sid127: suspend is used for power management on laptops
20:05 MotorGold: This laptop has ... been utter junk for that (it's bugs on bugs with System76 firmware), so I'm trained to never use it
20:05 Sid127: i.e. turning off the gpu when there's no load on it
20:05 MotorGold: hm.
20:05 MotorGold: ok that's ... more concerning.
20:05 Sid127: and the regression makes it so whenever that logic kicks in, the entire rendering infrastructure freezes up
20:05 Sid127: which, for me, was roughly 10 seconds after boot :D
20:05 MotorGold: not my favorite thing for it to do.
20:06 MotorGold: Yes -- I think I've seen that one actually.
20:06 Sid127: the tty still worked, but I couldn't get it to launch anything graphical
20:07 airlied: dakr: so this ends up in the we need to rip out nvif hole :-P
20:12 airlied: at least for any kernel clients
20:17 MotorGold: 6.7.7 good?
20:18 Sid127: that's out?
20:18 MotorGold: no, i was just bumping the number in my head x)
20:18 Sid127: but also no, since my fix is still yet to hit rc
20:18 Sid127: ah
20:18 MotorGold: Just looking to know if there was a target. I'm starting to hack on nixpkgs
20:19 Sid127: if you're gonna compile the kernel yourself, you can include the patch
20:19 Sid127: 6.7.7 is due tomorrow, fwiw
20:19 MotorGold: oh -- I'll wait.
20:19 MotorGold: It's late. lol
20:19 Sid127: meaning 6.7.7 also not have the fix :D
20:19 MotorGold: xD right
20:20 MotorGold: do you know which rc will? I'll spy for it
20:20 Sid127: assuming airlied grabs the patch for mainlining today
20:20 Sid127: it *could* hit rc7
20:20 Sid127: if not, it'll directly be in 6.8 release
20:21 Sid127: rc7 is due on march 3rd
20:21 Sid127: and 6.8 release on mario
20:21 MotorGold: 6.9 on luigi; gotchu
20:21 MotorGold: I'm on kernel.org -- do they list those dates?
20:21 Sid127: upcoming dates? no
20:22 MotorGold: cool.
20:22 Sid127: but minor versions are on a weekly cycle
20:22 Sid127: so, fairly trivial to estimate
20:22 MotorGold: ah, perf.
20:22 MotorGold: Ok. So hang tight for a bit and this bug should have a bit more to talk about when the actual support hits the tree.
20:22 Sid127: and after 6.8 release it'll be 2 weeks until 6.9-rc1...
20:22 MotorGold: I suppose at that time we can begin shaking some trees and seeing which component team will fall out (kernel/nouveau/s76 firmware)
20:22 Sid127: and then the cycle continues :D
20:23 MotorGold: [the circle of life.wav]
20:23 Sid127: but yeah, if you still face the issue on a newer kernel, it'll be worth looking into
20:23 MotorGold: arright. sounds good. You might see me again. Hopefully regardless; I've been taking an interest in nouveau. thanks for the help!
20:24 MotorGold: o/
20:24 Sid127: anytime! if you use discord and find that more convenient you can hop in there too
20:24 Sid127: https://discord.gg/tEUUZhDq
20:26 MotorGold: o7
20:27 Sid127:quietly wonders if MotorGold is into Elite Dangerous
20:33 MotorGold:has heard of it; but does not own it nor play 8c)
20:33 MotorGold: [reading about it] oh this is way too smart for me. I unga bunga with club in dork souls
20:33 Sid127: was just curious, I've only ever seen o7 being used frequently in those circles ^^'
20:34 MotorGold: oh neat! yeah i'm old irc. i've picked up a few things.
20:36 Sid127: am a youngin but have been using IRC on and off for a couple years now, do prefer it to most other chat platforms
21:02 fdobridge: <Sid> wait wtf is microsoft-experimental
21:02 fdobridge: <Sid> as a vulkan driver
21:03 fdobridge: <gfxstrand> Vulkan on top of D3D12
21:03 fdobridge: <rhed0x> dozen
21:04 fdobridge: <Sid> where is this vkd3d-in-reverse used?
21:04 fdobridge: <Sid> oh wait
21:04 fdobridge: <Sid> probably for wsl?
21:04 fdobridge: <pac85> Yes
21:04 fdobridge: <pac85> And other things
21:05 fdobridge: <triang3l> Windows Subsystem for Android, I think, and maybe Vulkan on meme hardware, though I'm not sure if Qualcomm uses it
21:05 fdobridge: <Sid> I see
21:06 fdobridge: <triang3l> where you'll never see dynamic state and GPL/SO probably :frog_gears:
21:07 fdobridge: <!DodoNVK (she) 🇱🇹> *WSL2
21:07 fdobridge: <gfxstrand> The android ecosystem makes me sad
21:11 fdobridge: <Sid> we're doing something wrong with NVK
21:11 fdobridge: <Sid> ```
21:11 fdobridge: <Sid> [ 7178.199725] nouveau 0000:01:00.0: gsp: Xid:13 Graphics SM Warp Exception on (GPC 2, TPC 3, SM 1): Out Of Range Address
21:11 fdobridge: <Sid> [ 7178.199791] nouveau 0000:01:00.0: gsp: Xid:13 Graphics SM Global Exception on (GPC 2, TPC 3, SM 1): Multiple Warp Errors
21:11 fdobridge: <Sid> [ 7178.199855] nouveau 0000:01:00.0: gsp: Xid:13 Graphics Exception: ESR 0x515fb0=0xc03000e 0x515fb4=0x4 0x515fa8=0x4c1eb72 0x515fac=0x174```
21:13 fdobridge: <Sid> also woa, regression
21:14 fdobridge: <Sid> Guilty Gear Strive doesn't launch anymore!
21:25 fdobridge: <redsheep> Ok good so it's not just me, I was worried my system was broken again and didn't want to feel like I was crying wolf, yeah several games that worked now crash such as x64 TF2 and tower unite. I'm not certain it wasn't due to a proton update, but I don't think so.
21:25 fdobridge: <airlied> probably some of the new pipeline paths maybe
21:32 fdobridge: <gfxstrand> seems likely
21:37 fdobridge: <gfxstrand> I'm hoping once we can sort out this locking problem in nouveau.ko and my 2nd 3060 shows up that I'll be more able to properly regression test.
21:38 fdobridge: <gfxstrand> Right now, if the run makes it at least 80% of the way through before my kernel dies, I consider it a good run and merge based on the partial result.
21:39 fdobridge: <redsheep> Is there much in the way of performance regression testing going on elsewhere in mesa?
21:40 fdobridge: <gfxstrand> The Intel team has some stuff and Valve might but perf regression testing is hard
21:41 fdobridge: <Sid> just so we're on the same page, what locking problem exactly?
21:41 fdobridge: <redsheep> I imagine it would be good to have something like apitraces being tested in CI, naturally that would mean needing more ci resources
21:42 fdobridge: <gfxstrand> It's the one that's causing all the fault explosions I'm seeing. I've pasted a half dozen backtraces in the last week.
21:42 fdobridge: <gfxstrand> The object map isn't properly locked
21:43 fdobridge: <Sid> oh, that one
21:43 fdobridge: <Sid> okie
21:44 fdobridge: <Sid> I shall try to poke around and see what's what, albeit right now I still don't really know how the hardware works
21:44 fdobridge: <Sid> this one, I'm guessing
21:45 fdobridge: <Sid> i.e. this one
21:47 fdobridge: <airlied> @gfxstrand not looking trivial to fix unfortunately
21:48 fdobridge: <gfxstrand> @airlied Do we a hack I can use in the mean time?
21:50 fdobridge: <airlied> https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/899de10b95654a3a0c954fdfdac3be797fd9a169 might fix it, and might not blow up anywhere else (except lockdep)
21:52 fdobridge: <gfxstrand> Building
21:52 fdobridge: <airlied> but the whole nvif abstraction make things pretty horrible
21:53 fdobridge: <airlied> surprisingly sticking a generic object lockup into a bunch of different paths through the driver causes some issues 😛
21:53 fdobridge: <gfxstrand> Yeah...
21:53 fdobridge: <gfxstrand> I knew something called ioctl on the fault path sounded fishy!
21:54 fdobridge: <airlied> I think I can workaround the lockdep with a horrible workqueue offload, but I think it would be bad to offload vblank enables in that way
21:54 fdobridge: <gfxstrand> I knew something called "ioctl" on the fault path sounded fishy! (edited)
21:54 fdobridge: <airlied> I think things generally prefer their vblank notifications to be as close as possible to the vblank
21:54 fdobridge: <gfxstrand> If I have to boot with nomodeset, I'll do it
21:55 fdobridge: <airlied> see if it explodes anywhere and we can work out that
21:55 fdobridge: <gfxstrand> kk
21:55 fdobridge: <airlied> just can't upstream it without fixing the lockdeps
21:55 fdobridge: <gfxstrand> Oh, for sure
21:55 fdobridge: <gfxstrand> It needs to be fixed properly.
21:55 fdobridge: <gfxstrand> In the mean time, I'm going banannas over here
21:56 fdobridge: <gfxstrand> In the mean time, I'm going bananas over here (edited)
22:03 fdobridge: <gfxstrand> Throwing 18 threads at it now
22:07 fdobridge: <Sid> 👀
22:07 fdobridge: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1212883862741454970/image.png?ex=65f3751f&is=65e1001f&hm=f7eee59793f81ac5b3f4ab2c9a542b897922903d5d1d109b5af1889b10374174&
22:12 fdobridge: <pavlo_it_115> Is it some kind of error that is marked "TODO" here?
22:12 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212885105748484116/723de8f4346d342d.png?ex=65f37647&is=65e10147&hm=baab581271d7817efa074571a8a602f8d1fea736fd04b58b3f284b157c56294e&
22:12 fdobridge: <pavlo_it_115> Is it some kind of error that is marked "TODO" here?
22:12 fdobridge: <pavlo_it_115> https://cdn.discordapp.com/attachments/1034184951790305330/1212885240461140008/39ab4094b51f1d84.png?ex=65f37667&is=65e10167&hm=dc613c1035620a80a6bf49af2d50d2ef77333faf263f6342b96e763f37508de8&
22:13 fdobridge: <pavlo_it_115> It's as if the main page says that - 2D/3D acceleration supported on all GPUs
22:13 fdobridge: <pavlo_it_115> Am I wrong?
22:15 fdobridge: <Sid> main page?
22:22 fdobridge: <redsheep> Do you disable hyper threading? Why run 18 instead of 36 on a 10980xe? Aside from the kernel exploding
22:26 fdobridge: <redsheep> Or is it just known that CTS doesn't benefit from hyperthreading? I'd assume the threads spend quite a bit of time waiting for the GPU with latency and all that, which should be ideal for doubling up the threads
22:30 RSpliet: pavlo_it_115: you're looking at a few X-only obsolete render APIs. AFAIK Wayland wouldn't use them, and X.org uses libmodeset these days (or w/e its called) by default, accelerating these sorts of things using standard GL instead
22:31 RSpliet: I may have been imprecise, but that's the gist of it
22:32 RSpliet: "glamor" I think the X.org acceleration library it's called
22:35 RSpliet: so in short, no "TODO" is correct, and given its limited use in a wayland and X.org+glamor world, its priority is so low that likely only a hobbyist would pick a task like this up
22:40 fdobridge: <gfxstrand> If I go too high, it pushes the GPU or something too hard and I get a lot of fails
22:40 fdobridge: <gfxstrand> I've got a 2nd identical card showing up next week and I'm going to try running on two cards at the same time.
22:40 fdobridge: <gfxstrand> Hopefully that'll get my runs back down to 45 min or so
22:41 fdobridge: <redsheep> Maybe 36 threads will work across both then
22:41 fdobridge: <gfxstrand> That's the idea
22:41 fdobridge: <gfxstrand> <Insert SLI joke here>
22:49 fdobridge: <pavlo_it_115> https://nouveau.freedesktop.org/
22:51 fdobridge: <pavlo_it_115> Thanks for explanation!
22:55 fdobridge: <airlied> @gfxstrand GPU channel losses or just wierd fails?
22:58 fdobridge: <gfxstrand> With lots of threads? Flakes, usually
22:58 fdobridge: <gfxstrand> though maybe it's fine now
22:58 fdobridge: <gfxstrand> I also saw issues where it seemed to run out of contexts or something like that. Like the GPU would just get stuck and everything would start timing out
23:09 fdobridge: <gfxstrand> Lasted an hour and now all my GPU seems dead in the water. Maybe the IRQ bug?
23:09 fdobridge: <gfxstrand> Before that, though, I got up to
23:09 fdobridge: <gfxstrand> `Pass: 1278505, Fail: 1, Skip: 1752494, Duration: 1:02:15, Remaining: 14:51`
23:10 fdobridge: <gfxstrand> Note the 0 flakes. I never got 0 flakes before. I think our flakes were all faults going bad
23:11 fdobridge: <gfxstrand> @airlied Really good progress, IMO. Now we just have to figure out how to fix it for realz
23:11 fdobridge: <gfxstrand> And fix IRQs
23:12 fdobridge: <airlied> that was with REBAR enabled? so you shouldn't have any evictions?
23:13 fdobridge: <airlied> but yeah I suppose it could still hit in REBAR enabled places
23:13 fdobridge: <airlied> btw is the GPU fully dead after you stop running deqp-vk?
23:15 fdobridge: <gfxstrand> Yeah, this is with ReBAR
23:16 fdobridge: <gfxstrand> What do you mean?
23:16 fdobridge: <gfxstrand> It was so broken `sudo systemctl reboot` froze and did nothing
23:16 fdobridge: <airlied> oh okay, anything in dmesg?
23:17 fdobridge: <gfxstrand> It's gone now, sorry
23:17 fdobridge: <gfxstrand> Something about a timeout
23:17 fdobridge: <airlied> ah yeah probably all cpus stuck waiting in drm_release
23:18 fdobridge: <redsheep> Can't you still journalctl filtered to kernel and last boot?
23:18 fdobridge: <redsheep> That's what I've done a few times when I lock up and want to send logs
23:18 fdobridge: <airlied> journalctl -b -1
23:19 fdobridge: <Sid> speaking of rebar...
23:19 fdobridge: <Sid> before: `BAR 1: current size: 256MB, supported: 64MB 128MB 256MB`
23:24 fdobridge: <Sid> one reboot with `modprobe.blacklist=nouveau` and manual `modprobe nouveau` after: `BAR 1: current size: 8GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB`
23:24 fdobridge: <Sid> :>
23:25 fdobridge: <gfxstrand> @tiredchiku That's on Turing?
23:26 fdobridge: <Sid> yes
23:26 fdobridge: <Sid> I'm a cursed child
23:26 fdobridge: <Sid> I do cursed things
23:26 fdobridge: <Sid> I modded my UEFI firmware to beat it into submission
23:27 fdobridge: <Sid> shameless plug: https://sidonthe.net/blog/rebar-adventures/
23:35 fdobridge: <rhed0x> > SPIR-V WARNING:
23:35 fdobridge: <rhed0x> > In file ../src/compiler/spirv/spirv_to_nir.c:1526
23:35 fdobridge: <rhed0x> > Image Type operand of OpTypeSampledImage should not have a Dim of Buffer.
23:35 fdobridge: <rhed0x> > 2116 bytes into the SPIR-V binary
23:35 fdobridge: <rhed0x> whats up with that
23:35 fdobridge: <rhed0x> curiously, it seems to be caused by the DXVK HUD shaders which are compiled with glslang
23:36 fdobridge: <gfxstrand> It's bad SPIR-V that we made illegal in like 1.6 or something but up until then Mesa just throws a warning so devs hopefully get anoyed and fix it.
23:36 fdobridge: <gfxstrand> It's bad SPIR-V that we made illegal in like 1.6 or something but up until then Mesa just throws a warning so devs hopefully get anyoyed and fix it. (edited)
23:36 fdobridge: <gfxstrand> It's bad SPIR-V that we made illegal in like 1.6 or something but up until then Mesa just throws a warning so devs hopefully get annoyed and fix it. (edited)
23:37 fdobridge: <rhed0x> is glslang known to produce that?
23:37 fdobridge: <rhed0x> (dxvk targets vulkan 1.2 for its hud shaders, I think that results in spirv 1.3)
23:41 fdobridge: <gfxstrand> It used to
23:44 fdobridge: <gfxstrand> It's been fixed
23:47 fdobridge: <gfxstrand> This `KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.gl_tessLevel` fail confuses the shit out of me. It's the simplest tessellation shader ever
23:48 fdobridge: <gfxstrand> Maybe it's reading `gl_TessCoord.xyz` in point mode that's the problem? Seems unlikely...