01:59fdobridge_: <airlied> so we have a locking order problem with the irq lock and the fence lock, we get an irq, we take irq lock then fence lock, we call fence signalling, we take fence lock then irq lock
02:00fdobridge_: <karolherbst🐧🦀> sounds like a deadlock
02:00fdobridge_: <airlied> yes it's what people are seeing with i915
02:00fdobridge_: <karolherbst🐧🦀> I wonder what's so special about prime that we hit this pattern there
02:00fdobridge_: <airlied> prime calls fence signalling
02:00fdobridge_: <karolherbst🐧🦀> oh right..
02:02fdobridge_: <airlied> just not sure how best to break the deadlock
02:03fdobridge_: <airlied> might be possible to offload some work to a kworker
02:05fdobridge_: <karolherbst🐧🦀> might not be the worst idea
02:06fdobridge_: <karolherbst🐧🦀> ~~could also go lockless~~ *hideS*
02:09fdobridge_: <airlied> there's already some atomics mixed with spinlocks badly 😛
02:09fdobridge_: <karolherbst🐧🦀> maybe we can get away by just using atomics?
02:09fdobridge_: <karolherbst🐧🦀> 😄
02:14fdobridge_: <Sid> again, open to testing patches :)
02:19fdobridge_: <karolherbst🐧🦀> yeah.. no simple patch sadly, because list manipulations really have to take locks and all that 🥲 more kworker threads is kinda the "simple" way out here, but it's also a mood point doing much refactoring if we are going to move gsp support into a new driver anyway
02:20fdobridge_: <karolherbst🐧🦀> though the question remains if we have to keep gsp support just for ada around 😢
02:26fdobridge_: <airlied> I've got a hacky patch, just need to test it boots 😛
02:27fdobridge_: <karolherbst🐧🦀> ~~let users test it and if they don't reply an hour later you know the answer~~
02:28fdobridge_: <karolherbst🐧🦀> also me every time I give a patch to users to test and they don't reply an hour later: "fuck, hopefully the patch didn't make their system unbootable" where they reply 10 hours later with "sorry, got late, patch works great, thanks"
02:28fdobridge_: <redsheep> I would assume nvidia will keep coming out with new hardware that needs gsp, so if requiring the new driver isn't an option then there's not much else that can be done, right?
02:31fdobridge_: <karolherbst🐧🦀> the thing is rather that ada + gsp + nouveau is shipped
02:31fdobridge_: <karolherbst🐧🦀> and if we have a new driver it might be considered a regression
02:32fdobridge_: <karolherbst🐧🦀> if we drop gsp support in nouveau and say "only the new one"
02:44fdobridge_: <airlied> @sid https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/f118adac44b90451b24345e37b5af6844395febf
02:45fdobridge_: <airlied> @tiredchiku ^
02:46fdobridge_: <airlied> I at least can run talos in prime now and it doesn't lockup
02:48fdobridge_: <Sid> on it
02:48fdobridge_: <Sid> gimme \~30 mins
02:51fdobridge_: <Sid> do I keep the previous patch or no?
02:53fdobridge_: <karolherbst🐧🦀> maybe just try the branch
02:53fdobridge_: <karolherbst🐧🦀> or well.. the commit + its tree
02:53fdobridge_: <airlied> nah drop the old patch
02:53fdobridge_: <airlied> or just try the branch
02:54fdobridge_: <Sid> I'll try the branch then
02:55fdobridge_: <airlied> oh I should drop that from the branch
02:55fdobridge_: <Sid> hm
02:56fdobridge_: <Sid> I could apply the patch onto the 6.7-rc6 tag and it *should* work as well, yeah? I already have the build system set up for that
02:58fdobridge_: <karolherbst🐧🦀> prolly yes
02:58fdobridge_: <airlied> yes that is fine
02:58fdobridge_: <karolherbst🐧🦀> worst case it doesn't boot and you compile it from scratch in 5 minutes
02:58fdobridge_: <Sid> I'll try that first, then will try the branch
02:58fdobridge_: <Sid> you overestimate my coffeelake i5 🐸
02:59fdobridge_: <karolherbst🐧🦀> oh sorry, I meant "5 minutes later"
02:59fdobridge_: <Sid> ah
02:59fdobridge_: <karolherbst🐧🦀> but yeah... my desktop compiles the fedora config in 10 minutes :ferrisUpsideDown:
02:59fdobridge_: <karolherbst🐧🦀> which like.. truly helps with `git bisect`
03:11fdobridge_: <Sid> built and installed
03:14fdobridge_: <Sid> mine really depends on the changes
03:14fdobridge_: <Sid> ccache does its part and re-compiles in \~9 minutes
03:15fdobridge_: <karolherbst🐧🦀> maybe I should start using ccache again, but I hit ccache bugs regularly like 5 years ago, so since then I'm too much of a chicken
03:16fdobridge_: <Sid> reminder to self: open ssh before testing
03:21fdobridge_: <Sid> hm, errors I've never seen before
03:21fdobridge_: <Sid> time to update nvk
03:35fdobridge_: <Sid> ah, dxvk gpu vendor spoofing was the cause for the errors
03:36fdobridge_: <Sid> so, I don't know if it's the NVK update or the kernel patch, but Quake 2 runs *much* better than before
03:37fdobridge_: <Sid> framerate is anywhere between 2x-3x than before
03:37fdobridge_: <Sid> in the tutorial, at least. as soon as I load up the first level it's back to 40-50 fps because of probably the map size and effects😅
03:38fdobridge_: <airlied> as long as it doesn't do the system wide hangs 🙂
03:39fdobridge_: <Sid> kernel patch seems to be doing it's thing, yeah
03:39fdobridge_: <Sid> no system wide hangs, no game hangs so far either
03:39fdobridge_: <Sid> halfway through the first level, when I could barely make it to the tutorial before
03:39fdobridge_: <Sid> halfway through the first level, when I could barely make it to the main menu before (edited)
03:40fdobridge_: <Sid> there's nothing in the dmesg either
03:40fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187600648024371261/dmesg.log?ex=65977a4b&is=6585054b&hm=aff4301b3fc23e488bd948251475616bd7c65eae03d47dca4e1c7720857394f9&
03:42fdobridge_: <Sid> depending on the geometry around me the game runs between 20 and 80 fps on my 1660Ti laptop, so.. looks good :D
03:44fdobridge_: <Sid> alt-tabbing got me a minor freeze but the system recovered within 30 seconds
03:45fdobridge_: <Sid> ```
03:45fdobridge_: <Sid> [ 1701.990385] i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=246028 end=246029) time 249 us, min 1063, max 1079, scanline start 1061, end 1095
03:45fdobridge_: <Sid> [ 1840.506974] nouveau 0000:01:00.0: quake2ex_steam.[24943]: job timeout, channel 24 killed!
03:45fdobridge_: <Sid> [ 1840.506925] Asynchronous wait on fence 0000:00:02.0:kwin_wayland[886]:2401a timed out (hint:intel_atomic_commit_ready [i915])
03:45fdobridge_: <Sid> [ 1840.506924] Asynchronous wait on fence drm_sched:nouveau_sched:c89d5 timed out (hint:submit_notify [i915])
03:45fdobridge_: <Sid> [ 1840.507056] [drm:nouveau_job_submit [nouveau]] *ERROR* Trying to push to a killed entity
03:45fdobridge_: <Sid> ```
03:46fdobridge_: <airlied> okay that's interesting but at least it doesn't die
03:46fdobridge_: <redsheep> Sounds like I should try out ccache, every time I have tried to test something with the kmd I've been using the linux-git aur and it builds every single driver in the entire tree... Takes me a solid 20 minutes with 16 cores.
03:46fdobridge_: <Sid> pretty sure that's an old i915 bug
03:47fdobridge_: <Sid> because I've had that since 5.15
03:48fdobridge_: <Sid> the atomic update failures, I mean
03:50fdobridge_: <Sid> ehm
03:50fdobridge_: <Sid> hang on
03:50fdobridge_: <Sid> ok nvm
03:51fdobridge_: <Sid> no wait hang on..
03:51fdobridge_: <Sid> wrong cmdline in the dmesg :D
03:51fdobridge_: <Sid> PEBKAC
03:53fdobridge_: <Sid> of course there was nothing in the logs, gsp was disabled
03:53fdobridge_: <Sid> I must've booted into the wrong boot option
03:54fdobridge_: <Sid> ..or was it
03:55fdobridge_: <Sid> ok nope, gsp was definitely enabled, I have no idea why the cmdline is wrong at the top of the log
03:55fdobridge_: <Sid> ah, I see
03:55fdobridge_: <Sid> accidentally appended to older logs 🐸
03:56fdobridge_: <Sid> instead of overwriting
03:59fdobridge_: <Sid> thanks for the help dave <3
04:02fdobridge_: <airlied> probably need to drop all these patches on the list and see if I can get them to Linus before 6.7 releases
04:03fdobridge_: <Sid> yeah
04:03fdobridge_: <Sid> good luck :D
04:04fdobridge_: <Sid> if previous releases are an indication you have roughly 10 days until 6.7 is out
04:10fdobridge_: <airlied> nah Linus gave it an extra week
04:11fdobridge_: <Sid> oh, we're getting an rc8 this time?
04:11fdobridge_: <esdrastarsis> yes
04:11fdobridge_: <Sid> neat
04:12fdobridge_: <esdrastarsis> "Then, unless anything odd happens, the final 6.7 release will be Jan
04:12fdobridge_: <esdrastarsis> 7th, and so the merge window for 6.8 will open Jan 8th." - Linus
04:13fdobridge_: <Sid> no holiday release, makes sense
04:18fdobridge_: <Sid> out of curiosity, what's our plan for GPU stat reporting with and without GSP?
04:21fdobridge_: <airlied> someone reverse engineers how nvidia does it and writes it
04:22fdobridge_: <Sid> ...oh
04:22fdobridge_: <Sid> shouldn't the open source kernel modules from nvidia have pointers on how it's done?
04:23fdobridge_: <airlied> probably not, depends on how they do it
04:23fdobridge_: <airlied> also depends on what stats you mean
04:24fdobridge_: <Sid> clock speeds, temps
04:24fdobridge_: <Sid> power consumption
04:24fdobridge_: <Sid> load
04:27fdobridge_: <airlied> I don't think nvidia expose those with their driver via standard apis
04:28fdobridge_: <airlied> so if they don't expose hwmon, we've no idea how to do it
04:30fdobridge_: <airlied> likely RE'ing nvidia-smi might be the best path
04:32fdobridge_: <esdrastarsis> Or RE the nvml library
04:33fdobridge_: <airlied> yeah or that thing
04:34fdobridge_: <airlied> it's possible nvidia do a bunch of this in userspace, and we'd have to do it in the kernel
04:34fdobridge_: <Sid> I see
04:35airlied: Lyude, karolherbst : bunch of patches on dri-devel/nouveau lists, I'd prefer to land them without too many bikesheds :-)
04:36fdobridge_: <Sid> nvml header is available
04:37airlied: https://lore.kernel.org/dri-devel/20231222043308.3090089-1-airlied@gmail.com/T/#m45fc0006164444df0085844aeb0f7b0b26a4f977
04:37fdobridge_: <Sid> it's part of the cuda toolkit, nvml.h, though I'm not sure how helpful it'll be
04:38fdobridge_: <airlied> would just need to trace nvidia-smi probably at the kernel level and see what objects it creates or GSP calls it makes
04:39fdobridge_: <enigma9o7> Howdoes ANYONE put up with nouveau? I don't get it.
04:39fdobridge_: <enigma9o7> I have been trying really hard to use nouveau, and tonite during a party I had my computer playing suic, and of course nouveau causes it to lockup. Not doing anyting, just vlc playing a playlist. The fuck.
04:40fdobridge_: <enigma9o7> Takes 3 minute sof slience to reboot and fix it. I have been putting up with it and trying to report issues and trying to figuer out how to report it correctly with very little fucking help.
04:40fdobridge_: <enigma9o7> And Iknow I'm drunk right now but FUCK YOU
04:40fdobridge_: <enigma9o7> It's like, nobody cares that it doesnt work right..
04:40fdobridge_: <enigma9o7> if any other software i r peorte dto dev it locks up, the3y'd be trying to help, but with this its like, get logs, blahblah, ignore me the next time i respond, for friggen weeks
04:41fdobridge_: <Sid> I do have open kernel modules with GSP installed, if you could tell me how to trace nvidia-smi I could look into it when I have some free time
04:41fdobridge_: <Sid> would love to contribute, not entirely sure how to 😅
04:42fdobridge_: <enigma9o7> I spent hours trying to learn to build my own kernel, try it, and its WORSE! A ZILLION TIMES WORSE! BUT NOBODY CARES
04:42fdobridge_: <airlied> mostly involves shoving printks into various places in the open kernel driver, but not sure where exactly
04:42fdobridge_: <Sid> ..ah
04:42fdobridge_: <enigma9o7> and its not like i have some friggewn new cpu
04:42fdobridge_: <Sid> got you
04:42fdobridge_: <enigma9o7> gpu
04:42fdobridge_: <enigma9o7> this thing is friggen ancient, theres nothing that shouldnt have been figured out by now
04:42fdobridge_: <enigma9o7> SOFUCK YOU ALL
04:44fdobridge_: <Sid> except we're trying to help as well but progress is naturally slow because we don't have enough info (logfiles) nor the (old) hardware required to reproduce the issue
04:44fdobridge_: <Sid> only so much we can do
04:45fdobridge_: <airlied> I stuck a timeout on them, maybe when they sober up
04:45fdobridge_: <Sid> yeah..
05:21fdobridge_: <Sid> just had another system-wide freeze, when trying to get a screenshot, plasma wayland
05:21fdobridge_: <Sid> will observe and report with logs if it happens again
05:42fdobridge_: <Sid> yeah, happened again, but ~20 mins into a session
06:29fdobridge_: <Sid> ok, false alarm. trusty old undervolt config failed me
07:43fdobridge_: <!DodoNVK (she) 🇱🇹> I guess I need to try this now with my AMD GPU setup
08:48fdobridge_: <!DodoNVK (she) 🇱🇹> This is even better: https://lore.kernel.org/dri-devel/20231222043308.3090089-1-airlied@gmail.com
08:52fdobridge_: <Sid> ♻️
10:41fdobridge_: <!DodoNVK (she) 🇱🇹> :triangle_nvk:
10:41fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1187706511565258802/Screenshot_20231222_123748.png?ex=6597dce2&is=658567e2&hm=2449bc2612d7168770848d519923161ee63fe8b52d9a11501a706597aa8e9459&
10:55fdobridge_: <huntercz122> what about nfsmw 2005?
10:56fdobridge_: <prop_energy_ball> @karolherbst @airlied I'm still really confused about my bisect results: the latest commit sha in the merge commit works fine, the merge commit itself does not work, the merge commit ~ 1 works fine
10:57fdobridge_: <!DodoNVK (she) 🇱🇹> 2012 killed my system quicker
10:57fdobridge_: <prop_energy_ball> I've rebooted 10 times for each now so I'm pretty confident in my results
10:59fdobridge_: <prop_energy_ball> inb4 somehow i frogged something up catastrophically
11:00fdobridge_: <!DodoNVK (she) 🇱🇹> With Dave's patches I get job timeouts instead of system freezes (I'm not sure if that's because my GPU is waking up from D3cold)
11:07fdobridge_: <prop_energy_ball> @karolherbst @airlied I'm still really confused about my bisect results: the latest commit sha in the merge commit's range works fine, the merge commit itself does not work, the merge commit ~ 1 works fine (edited)
11:12fdobridge_: <Sid> merge commits can rewrite commit history iirc
11:12fdobridge_: <Sid> and introduce commits in between
11:12fdobridge_: <Sid> the actual "merge commit" itself is an empty commit that summarizes all the changes in one place
11:15fdobridge_: <Sid> example:
11:15fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187714927805349908/image.png?ex=6597e4b9&is=65856fb9&hm=69f05483d42ea63ede6c47216ca54ed1c14d69550ecc25235466e3a710dd44dc&
11:15fdobridge_: <Sid> https://github.com/EDCD/coriolis/pull/730
11:15fdobridge_: <Sid> merged on 27th dec, but commit date was a month prior, so the commit history got rewritten
11:16fdobridge_: <Sid> idk, git is weird and I'm not too well versed with merges, but that's what I remember
11:16fdobridge_: <prop_energy_ball> Sure, I know that much at least
11:16fdobridge_: <prop_energy_ball> I am guessing maybe there was some conflict in the merge, not sure? That's the only thing that comes to my mind
11:16fdobridge_: <Sid> also doesn't doing a hard reset on a merge commit undo that merge?
11:17fdobridge_: <Sid> either that, or a conflict in the merge, yeah
11:25fdobridge_: <!DodoNVK (she) 🇱🇹> :triangle_nvk: manages to render this scene better than RADV
11:25fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1187717641243205712/Screenshot_20231222_131659.png?ex=6597e740&is=65857240&hm=2504ce617ce9a10a56aa5f8f10df0de4d5997caef364742e49fac05e738a2aa1&
11:47fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1187723066390876230/Screenshot_20231222_134510.png?ex=6597ec4d&is=6585774d&hm=f803970f830930e867d86a7328d9100294e4fa74653a8327f640495476609da5&
11:52fdobridge_: <!DodoNVK (she) 🇱🇹> @airlied With your GSP fixes patchset my system no longer freezes with PRIME on GSP but instead there are job timeouts at these instances which cause channel kills which eventually cause the games to fail with a DEVICE_LOST error (I just got one after playing NFS Most Wanted 2005 for a few minutes)
11:53fdobridge_: <!DodoNVK (she) 🇱🇹> This is still better than having to hard reset the system though (because SSH wasn't working either) so that's definitely progress :triangle_nvk:
11:55fdobridge_: <!DodoNVK (she) 🇱🇹> Note: The game was working fine without any pauses until it suddenly froze ❄️
11:56fdobridge_: <!DodoNVK (she) 🇱🇹> I'm not sure how I can debug this nouveau issue though 🤔
12:01fdobridge_: <!DodoNVK (she) 🇱🇹> With `push_sync` the game gets to the DEVICE_LOST stage much sooner :cursedgears:
12:02fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1187726850068324362/message.txt?ex=6597efd3&is=65857ad3&hm=8db4a17eb2a36612b4d7b7ec9cb34a5f9a0d5fa1dc9ab287d26487de6683ca50&
12:37fdobridge_: <Sid> would it be worth running the game through a debugger and looking through the stack trace when it freezes
12:38fdobridge_: <karolherbst🐧🦀> not necessarily
12:39fdobridge_: <karolherbst🐧🦀> dumping the last push buffer with `NVK_DEBUG=push_sync` should be what you need
12:44fdobridge_: <Sid> how would one do that
12:45fdobridge_: <Sid> I've got some free time and a bunch of games hitting DEVICE_LOST on my hands :D
12:45fdobridge_: <karolherbst🐧🦀> setting it as an env variable before running the game
12:47fdobridge_: <Sid> wait
12:47fdobridge_: <Sid> `NVK_DEBUG=push_sync` should dump the last push buffer?
12:48fdobridge_: <karolherbst🐧🦀> only the ones messing up the GPU context
12:48fdobridge_: <Sid> I see
12:49fdobridge_: <karolherbst🐧🦀> `push_sync` just waits on each submission, without it nvk just moves on. But if this submission fails, it dumps what got sent to the kernel/GPU.
12:49fdobridge_: <Sid> and how would I access the dumps?
12:49fdobridge_: <karolherbst🐧🦀> it prints it to stdout or stderr
12:49fdobridge_: <karolherbst🐧🦀> if you play through steam, you can simply start steam from the command line
12:50fdobridge_: <Sid> huh
12:50fdobridge_: <Sid> I've already got steam running through the cli
12:51fdobridge_: <karolherbst🐧🦀> but I think that was already done here :ferrisUpsideDown: so for you it might look similar if it's the same game
12:51fdobridge_: <Sid> but I don't see anything like it
12:51fdobridge_: <Sid> and no, different game(s)
12:51fdobridge_: <karolherbst🐧🦀> @asdqueerfromeu what's the kernel error again?
12:51fdobridge_: <Sid> ah ok, proton log was hogging it
12:51fdobridge_: <karolherbst🐧🦀> ahh
12:52fdobridge_: <karolherbst🐧🦀> but yeah.. that dump + the error in `dmesg` is what's needed to figure out what goes wrong
12:52fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187739440681062463/dump-Dirt-Rally.log?ex=6597fb8d&is=6585868d&hm=1a8d19bbfcd115210e8b10cc6cb39888b0267002d909207fcbda32eb30d469b5&
12:52fdobridge_: <!DodoNVK (she) 🇱🇹> A job timeout (nothing else)
12:52fdobridge_: <karolherbst🐧🦀> mhhh
12:52fdobridge_: <Sid> ```
12:52fdobridge_: <Sid> [ 5967.080459] nouveau 0000:01:00.0: drt.exe[21411]: job timeout, channel 24 killed!
12:52fdobridge_: <Sid> [ 5983.464511] nouveau 0000:01:00.0: drt.exe[21411]: job timeout, channel 24 killed!
12:52fdobridge_: <Sid> [ 6022.450367] nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:24 type:13 scope:1 part:233
12:52fdobridge_: <Sid> [ 6022.450377] nouveau 0000:01:00.0: fifo:001001:0003:0018:[drt.exe[21411]] errored - disabling channel
12:53fdobridge_: <Sid> [ 6022.450388] nouveau 0000:01:00.0: drt.exe[21411]: channel 24 killed!
12:53fdobridge_: <Sid> ```
12:53fdobridge_: <karolherbst🐧🦀> let's see...
12:53fdobridge_: <karolherbst🐧🦀> that can mean a shader thread keeps looping or something
12:54fdobridge_: <!DodoNVK (she) 🇱🇹> I don't get the type:13 error
12:54fdobridge_: <!DodoNVK (she) 🇱🇹> It's stderr according to nvk_queue.c
12:56fdobridge_: <karolherbst🐧🦀> huh...
12:57fdobridge_: <karolherbst🐧🦀> why is the value `0x3c`...
13:00fdobridge_: <karolherbst🐧🦀> ohh wait
13:00fdobridge_: <karolherbst🐧🦀> that's the first argument :ferrisUpsideDown:
13:03fdobridge_: <karolherbst🐧🦀> @asdqueerfromeu mind removing those lines and see if it helps? https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/vulkan/nvk_cmd_clear.c?ref_type=heads#L94-98
13:05fdobridge_: <karolherbst🐧🦀> it probably leads to corruptions, but whatever
13:08fdobridge_: <karolherbst🐧🦀> @tiredchiku mhh.. yours is probably really a loop within some shader.. please try running with this: `NVK_USE_NAK=0`
13:08fdobridge_: <karolherbst🐧🦀> might have to also set `MESA_SHADER_CACHE_DISABLE=1`
13:11fdobridge_: <Sid> still the same with both env vars set
13:13fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187744812850479154/dump-Dirt-Rally-NoNAK.log?ex=6598008e&is=65858b8e&hm=0750c55d6b9ceee545e905f0e2de59aeb17ee0e6b08657a0df10d9ecea041f93&
13:18fdobridge_: <karolherbst🐧🦀> interesting..
13:23fdobridge_: <Sid> emergency shopping run
13:23fdobridge_: <Sid> we're out of whiskey
13:24fdobridge_: <Sid> if there's anything I can do to test further (more than half my installed games run into device_lost) I'd be happy to
13:24fdobridge_: <!DodoNVK (she) 🇱🇹> That didn't do much :cursedgears:
13:24fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1187747503798489198/message.txt?ex=65980310&is=65858e10&hm=96649e49190ef4f464133d32ea8f8b23f55b2441a3bcf32982db2f834bacb38d&
16:37fdobridge_: <karolherbst🐧🦀> at least you get a bit further I think? mhhh
16:38fdobridge_: <!DodoNVK (she) 🇱🇹> The 0x3c call stopped appearing (with or without the hack)
16:39fdobridge_: <!DodoNVK (she) 🇱🇹> So this issue is definitely confusing
16:39fdobridge_: <!DodoNVK (she) 🇱🇹> If I don't use `push_sync` I'm able to play for a few minutes before I get a job timeout
16:42fdobridge_: <gfxstrand> That feels like a sync bug somewhere. `push_sync` is just making it sync more often and probably making the bug more reproducible.
16:43fdobridge_: <gfxstrand> I've seen something I can't fully identify when trying to do full CTS runs with GSP where it fairly reproducibly just stops like 2h into a run.
16:44fdobridge_: <gfxstrand> No errors. No trap. No hang. Just a timeout and it kills the channel.
16:44fdobridge_: <!DodoNVK (she) 🇱🇹> Hopefully this blog post mistake can be fixed at least: https://discord.com/channels/1033216351990456371/1071009185833164810/1187129812763033672
16:45fdobridge_: <!DodoNVK (she) 🇱🇹> Without Dave's patchset I got complete system freezes instead of job timeouts
16:46fdobridge_: <gfxstrand> Yeah. I think this is a different bug. I need to pull Dave's patches myself but I think this is unrelated.
16:46fdobridge_: <gfxstrand> One theory I've got is a 32-bit rollover somewhere
16:47fdobridge_: <!DodoNVK (she) 🇱🇹> Both of the NFS games I tested are actually 32-bit so they're using the 32-bit driver
16:47fdobridge_: <gfxstrand> But I've not spent enough time studying it to have a more refined theory
16:47fdobridge_: <gfxstrand> Different 32-bit.
16:48fdobridge_: <gfxstrand> Like some place in the kernel or firmware where there's a `uint32_t` that's being used as a counter or where a 64-bit counter is getting truncated or something.
16:49fdobridge_: <gfxstrand> But I don't actually know. It's a wild stab in the dark. All I know is that eventually fences just stop working.
16:50fdobridge_: <!DodoNVK (she) 🇱🇹> If this was a firmware issue then OGK would likely be affected by that too
16:50fdobridge_: <gfxstrand> I kinda doubt it's firmware
16:51fdobridge_: <gfxstrand> If it is then we got REALLY unlucky with our firmware version pick because I doubt that would have escaped NVIDIA CI for long.
16:52fdobridge_: <karolherbst🐧🦀> ohhh.. maybe we generate enough fences to hit it? 😄
16:52fdobridge_: <gfxstrand> But until @airlied or dakr is able to figure it out, we won't know. 😢
16:52fdobridge_: <karolherbst🐧🦀> let's see...
16:52fdobridge_: <!DodoNVK (she) 🇱🇹> This definitely escaped the CI though: <https://github.com/NVIDIA/egl-wayland/issues/72#issuecomment-1768190892>
16:53fdobridge_: <gfxstrand> Given how repeatable it is with the CTS, it really does feel like a decent theory. When I was trying to do Ampere runs a few months back, it would stop in the same test every time.
16:54fdobridge_: <gfxstrand> Yeah, but dma-buf isn't something NVIDIA cares about. A deep fence bug that blows up your CUDA run after an hour is.
16:55fdobridge_: <karolherbst🐧🦀> uhhh
16:56fdobridge_: <karolherbst🐧🦀> `dma_fence::seqno` us `u64` where in nouveau we treat it as 32 bit 🥲
16:56fdobridge_: <karolherbst🐧🦀> *is
16:56fdobridge_: <gfxstrand> That sounds problematic...
16:56fdobridge_: <gfxstrand> I'd have to do a pretty thorough audit to decide if that's a real problem, though.
16:57fdobridge_: <karolherbst🐧🦀> ```patch
16:57fdobridge_: <karolherbst🐧🦀> diff --git a/drivers/gpu/drm/nouveau/nv84_fence.c b/drivers/gpu/drm/nouveau/nv84_fence.c
16:57fdobridge_: <karolherbst🐧🦀> index 812b8c62eeba1..1adbb588fcdd1 100644
16:57fdobridge_: <karolherbst🐧🦀> --- a/drivers/gpu/drm/nouveau/nv84_fence.c
16:57fdobridge_: <karolherbst🐧🦀> +++ b/drivers/gpu/drm/nouveau/nv84_fence.c
16:57fdobridge_: <karolherbst🐧🦀> @@ -89,6 +89,8 @@ nv84_fence_emit(struct nouveau_fence *fence)
16:57fdobridge_: <karolherbst🐧🦀> struct nv84_fence_chan *fctx = chan->fence;
16:57fdobridge_: <karolherbst🐧🦀> u64 addr = fctx->vma->addr + nv84_fence_chid(chan) * 16;
16:57fdobridge_: <karolherbst🐧🦀>
16:57fdobridge_: <karolherbst🐧🦀> + WARN_ON(fence->base.seqno > (1ul << 32));
16:57fdobridge_: <karolherbst🐧🦀> +
16:57fdobridge_: <karolherbst🐧🦀> return fctx->base.emit32(chan, addr, fence->base.seqno);
16:57fdobridge_: <karolherbst🐧🦀> }
16:57fdobridge_: <karolherbst🐧🦀>
16:57fdobridge_: <karolherbst🐧🦀> ```
16:57fdobridge_: <gfxstrand> And I've only hit this with GSP. My non-GSP CTS runs completed fine.
16:57fdobridge_: <karolherbst🐧🦀> I think _that_ should warn if we overflow 32 bit
16:57fdobridge_: <esdrastarsis> @gfxstrand Deep Rock Galactic is crashing on NVK because it's using compressed textures with 1D, 2D and 3D image types, the game started working after I removed this condition from the code: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/vulkan/nvk_image.c#L228
16:58fdobridge_: <karolherbst🐧🦀> @asdqueerfromeu , @tiredchiku mind running with that kernel patch?
16:58fdobridge_: <esdrastarsis> idk if this is the best approach, RE2 Remake dx11 has the same issue, btw
16:58fdobridge_: <Sid> on it 🫡
16:59fdobridge_: <karolherbst🐧🦀> though...
16:59fdobridge_: <karolherbst🐧🦀> mhhh
16:59fdobridge_: <gfxstrand> Post an MR and I'll CTS it when I get back from holiday.
16:59fdobridge_: <karolherbst🐧🦀> I wonder what we are doing that would overflow that value that quickly
16:59fdobridge_: <Sid> 10 mins to build
16:59fdobridge_: <karolherbst🐧🦀> how long does it take to trigger the bug?
17:00fdobridge_: <gfxstrand> In my CTS runs, like 2h
17:00fdobridge_: <karolherbst🐧🦀> yeah.. I have no doubts that you could run it in the CTS
17:00fdobridge_: <karolherbst🐧🦀> *hit
17:00fdobridge_: <gfxstrand> IDK about games. Sounds like a dress minutes with `push_sync`
17:00fdobridge_: <Sid> DEVICE_LOST immediately on launch for:
17:00fdobridge_: <Sid> Dirt Rally, Metal Hellsinger, Sea of Thieves, etc etc
17:00fdobridge_: <karolherbst🐧🦀> yeah...
17:00fdobridge_: <karolherbst🐧🦀> so within seconds that's kinda impressive
17:00fdobridge_: <gfxstrand> Within seconds isn't going to be that bug
17:01fdobridge_: <karolherbst🐧🦀> yeah...
17:01fdobridge_: <karolherbst🐧🦀> the fence overflow is for sure a bug, the question is just if anybody actually hits it 😄
17:02fdobridge_: <karolherbst🐧🦀> in any case, we should upstream some `WARN_ON` there and think of a proper solution there
17:04fdobridge_: <gfxstrand> Some time in February, I'm going to be deeply motivated to find and fix that bug... 😂
17:05fdobridge_: <karolherbst🐧🦀> 😄
17:05fdobridge_: <karolherbst🐧🦀> well
17:05fdobridge_: <karolherbst🐧🦀> if you know it's that bug, that might motivate you even more
18:17fdobridge_: <gfxstrand> Yeah...
18:18fdobridge_: <gfxstrand> Next project is the pipeline code. That's the last bit of code left in the driver that I really consider prototype quality. Once that's done, it's bug fixing, perf, and conformance for a while.
18:19fdobridge_: <Sid> annoyingly enough I'm unable to get this to apply no matter what I do
18:19fdobridge_: <Sid> even tried making it myself using format-patch
18:19fdobridge_: <karolherbst🐧🦀> prolly copy&paste mess up
18:19fdobridge_: <karolherbst🐧🦀> like
18:19fdobridge_: <karolherbst🐧🦀> tabs vs spaces
18:19fdobridge_: <karolherbst🐧🦀> just add that line yourself
18:19fdobridge_: <Sid> yeah
18:20fdobridge_: <Sid> still, annoying
18:20fdobridge_: <karolherbst🐧🦀> yes...
18:20fdobridge_: <karolherbst🐧🦀> why do tabs exist...
18:44fdobridge_: <!DodoNVK (she) 🇱🇹> Are there any plans for Nvidia-specific API support on nouveau/NVK?
18:46fdobridge_: <karolherbst🐧🦀> depends
18:47fdobridge_: <karolherbst🐧🦀> I think nvenc/nvdec might be something we want to care about because it's actually the only supported API in a few places... but that also depends if vulkan video replaces that soon enough
18:47fdobridge_: <karolherbst🐧🦀> like also how wine layers it all
18:47fdobridge_: <Sid> what about nv specific vulkan extensions
18:47fdobridge_: <karolherbst🐧🦀> sure
18:48fdobridge_: <karolherbst🐧🦀> if they are used
18:48fdobridge_: <karolherbst🐧🦀> and without any alternatives
18:48fdobridge_: <karolherbst🐧🦀> we've implemented a couple of GL_NV extensions as well
18:51fdobridge_: <Sid> I know for a fact stuff like VK_NV_low_latency2 is required for reflex and DLSS frame gen, can't think of others off the top of my head
18:56fdobridge_: <airlied> @prop_energy_ball can you point me at the merge commit again?
19:01fdobridge_: <airlied> also if in prime nouveau.runpm=0 to rule out powering down the gpu as causing problems
19:04fdobridge_: <karolherbst🐧🦀> https://gitlab.freedesktop.org/drm/nouveau/-/issues/301
19:04fdobridge_: <Sid> device lost even with the patch
19:06fdobridge_: <!DodoNVK (she) 🇱🇹> Check dmesg
19:06fdobridge_: <!DodoNVK (she) 🇱🇹> Is this for Joshie?
19:09fdobridge_: <Sid> ```
19:09fdobridge_: <Sid> [ 274.063651] nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:24 type:13 scope:1 part:233
19:09fdobridge_: <Sid> [ 274.063660] nouveau 0000:01:00.0: fifo:001001:0003:0018:[drt.exe[3882]] errored - disabling channel
19:09fdobridge_: <Sid> [ 274.063665] nouveau 0000:01:00.0: drt.exe[3882]: channel 24 killed!
19:09fdobridge_: <Sid> [ 401.813723] nouveau 0000:01:00.0: Update Task[3882]: job timeout, channel 32 killed!
19:09fdobridge_: <Sid> ```
19:09fdobridge_: <airlied> so that merge commit does have a conflict resolution around nouveau disp.c
19:09fdobridge_: <Sid> https://cdn.discordapp.com/attachments/1034184951790305330/1187834385588826162/dmesg.log?ex=659853fa&is=6585defa&hm=4debc200561bf36676eb7ec9ca42324803429df26ed7a7de52668a8daf76be3c&
19:11fdobridge_: <!DodoNVK (she) 🇱🇹> So no warn 🐸
19:11fdobridge_: <!DodoNVK (she) 🇱🇹> You aren't definitely hiting the fence limit
19:11fdobridge_: <!DodoNVK (she) 🇱🇹> You aren't definitely hitting the fence limit (edited)
19:14fdobridge_: <Sid> yup
19:17fdobridge_: <airlied> though I can't spot any problem on the merge commit
19:27fdobridge_: <prop_energy_ball> 79fb229b8810071648b65c37382aea7819a5f935
19:28fdobridge_: <karolherbst🐧🦀> actually.. should be `>=` 😄
19:28fdobridge_: <prop_energy_ball> I'm not on prime
19:28fdobridge_: <karolherbst🐧🦀> but anyway... I doubt seconds of runtime would trigger that issue
19:28fdobridge_: <prop_energy_ball> Display is connected via DP
19:28fdobridge_: <prop_energy_ball> One of my displays via DP works but this one does not
19:28fdobridge_: <prop_energy_ball> There is only 1 display connected at a time also
19:34fdobridge_: <airlied> I wonder if this is just some bw limit being hit on that display, and some changes make it happen now
19:37fdobridge_: <prop_energy_ball> It's a 4K 144Hz display which definitely needs DSC
19:43fdobridge_: <prop_energy_ball> Is there a way to force a safe mode?
19:43fdobridge_: <karolherbst🐧🦀> you can lower the DP version of the display
19:43fdobridge_: <prop_energy_ball> Also 60Hz is the preferred in EDID so shouldn't default to that probably.
19:43fdobridge_: <karolherbst🐧🦀> mhh I wonder...
19:44fdobridge_: <karolherbst🐧🦀> maybe https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/gpu/drm/nouveau?h=v6.7-rc1&id=7f67aa097e875c87fba024e850cf405342300059 broke soemthing?
19:44fdobridge_: <karolherbst🐧🦀> but that was I think fixed by DP fixes
19:49fdobridge_: <prop_energy_ball> That should be in 6.6 which also works no?
19:49fdobridge_: <karolherbst🐧🦀> yeah...
19:49fdobridge_: <karolherbst🐧🦀> but still...
19:49fdobridge_: <karolherbst🐧🦀> which modes are you getting on 6.6 btw?
19:51fdobridge_: <karolherbst🐧🦀> however
19:51fdobridge_: <karolherbst🐧🦀> doing `git diff 7f67aa097e875c87fba024e850cf405342300059^0 7f67aa097e875c87fba024e850cf405342300059^1 drivers/gpu/drm/nouveau/` gives me that patch 😄
19:51fdobridge_: <prop_energy_ball> Pretty sure it was just 4K 60, I migrated to bed now tho
19:51fdobridge_: <karolherbst🐧🦀> ehh wait
19:51fdobridge_: <karolherbst🐧🦀> wrong syntax
19:52fdobridge_: <karolherbst🐧🦀> how could I select the merge sub trees again...
19:52fdobridge_: <karolherbst🐧🦀> ohh wait
19:52fdobridge_: <karolherbst🐧🦀> I checked with the wrong commit..
19:53fdobridge_: <prop_energy_ball> This would have been infinitely easier if kernel was rebase and not merge oriented
19:54fdobridge_: <karolherbst🐧🦀> mhhh
19:55fdobridge_: <karolherbst🐧🦀> yeah...
19:55fdobridge_: <karolherbst🐧🦀> sot here is an ulgy trick...
19:55fdobridge_: <karolherbst🐧🦀> ~~do the rebase yourself~~
19:55fdobridge_: <prop_energy_ball> There is a reason every project aside from the kernel has decided merge commits are the root of all evil and has them banned 🐸
19:56fdobridge_: <prop_energy_ball> This is a job for tomorrow Josh
19:56fdobridge_: <karolherbst🐧🦀> 😄
19:56fdobridge_: <karolherbst🐧🦀> you mean like next week josh
19:56fdobridge_: <karolherbst🐧🦀> but maybe you are lucky on this one and it's not rebasing 10k commits
19:56fdobridge_: <prop_energy_ball> Eeeee
19:56fdobridge_: <prop_energy_ball> Its small enough
19:57fdobridge_: <karolherbst🐧🦀> ohh, fine then
20:08fdobridge_: <karolherbst🐧🦀> though I kinda see that a project on that scale can't really do `rebase` unless there is this kernel gitlab and.....
20:08fdobridge_: <prop_energy_ball> It makes sense for mailing list I guess
20:08fdobridge_: <karolherbst🐧🦀> but the current workflow is that Linus merged 10000 trees on a day and if the 9999 remaining ones conflict after the first merge it's game over
20:08fdobridge_: <karolherbst🐧🦀> 😄
20:08fdobridge_: <prop_energy_ball> Marge for kernel wen
20:09fdobridge_: <karolherbst🐧🦀> first gitlab
20:09fdobridge_: <karolherbst🐧🦀> but yeah...
20:09fdobridge_: <karolherbst🐧🦀> ~~we should just go gitlab for all of drm~~
20:09fdobridge_: <karolherbst🐧🦀> maybe next year I just migrate the drm trees to gitlab if nobody else is doing it :ferrisUpsideDown:
20:10fdobridge_: <prop_energy_ball> They are already there, no? Just submitting is still ML
20:10fdobridge_: <karolherbst🐧🦀> nope
20:10fdobridge_: <karolherbst🐧🦀> git is still on the old git infra
20:10fdobridge_: <karolherbst🐧🦀> https://cgit.freedesktop.org/drm/drm-misc/
20:12fdobridge_: <prop_energy_ball> Ah
20:12fdobridge_: <prop_energy_ball> I was thinking of eg amdgfx
20:12fdobridge_: <prop_energy_ball> And amd staging drm next (good name)
20:13fdobridge_: <prop_energy_ball> You should announce deprecation of cgit.freedesktop.org and it will solve itself 🐸
20:14fdobridge_: <karolherbst🐧🦀> 😄
20:14fdobridge_: <karolherbst🐧🦀> I just do it while everybody is off next week or something
20:14fdobridge_: <karolherbst🐧🦀> "santa migrated drm to gitlab"
20:14fdobridge_: <karolherbst🐧🦀> everybody who complains gets no presents next year
20:15fdobridge_: <prop_energy_ball> cgit definitely wins the award for worlds most confusing and useless git webui, but GitLab is definitely trying to catch up with the latest redesign where everything is an extra click/hover away for no reason
20:17fdobridge_: <karolherbst🐧🦀> the most confusing part of gitlab is that the history commit order is reverse from e.g. github
20:18fdobridge_: <gfxstrand> NV compute shader derivatives, definitely. I'll probably hook up the SM stuff at some point. Maybe other stuff if it's interesting. It'll depend on what apps want.
20:19fdobridge_: <!DodoNVK (she) 🇱🇹> What primaries does it use?
20:19fdobridge_: <!DodoNVK (she) 🇱🇹> That gets us closer to SM 6.6 support
20:20fdobridge_: <gfxstrand> Shouldn't be hard. It's just a QMD bit, probably.
20:20fdobridge_: <gfxstrand> I just need to find it
20:20fdobridge_: <prop_energy_ball> Just for linear group or quads too?
20:20fdobridge_: <gfxstrand> And do a bit of plumbing
20:20fdobridge_: <gfxstrand> Probably both since I think NV does both.
20:21fdobridge_: <gfxstrand> I need to find the bits, though. 🙃
20:21fdobridge_: <prop_energy_ball> RADV only has linear which bit me once
20:21fdobridge_: <prop_energy_ball> "Oh this is the only one, guess its probably the one I want then" haha nooooo, need to rewrite the IDs
20:21fdobridge_: <gfxstrand> RADV could do quad if they cared, I thought
20:22fdobridge_: <gfxstrand> On Intel, the x/y/z workgroup ID stuff is totally arbitrary and linear vs. quad just changes how we do the calculation.
20:23fdobridge_: <gfxstrand> On NVIDIA, you actually get XYZ in the shader and you have to compute the linearized version. 🙃
20:27fdobridge_: <dadschoorse> doesn't vulkan have something like "if workgroup_size.x == subgroup_size, you get a full subgroup with x_ids from 0 to subgroup_size - 1"?
20:27fdobridge_: <dadschoorse> that would make quad hard to implement on amd
21:10fdobridge_: <gfxstrand> Yes
21:11fdobridge_: <gfxstrand> It's part of VK_EXT_subgroup_size_control. It's also in Vulkan 1.2 with SPIR-V 1.4.
21:12fdobridge_: <karolherbst🐧🦀> I'm still having nightmares of just thinking implementing that part in rusticl...
21:13fdobridge_: <karolherbst🐧🦀> `cl_intel_required_subgroup_size` is the CL ext..
21:13fdobridge_: <karolherbst🐧🦀> (if I understood the vulkan ext correctly)
21:46fdobridge_: <gfxstrand> The CL thing isn't bad. I'm pretty sure I have the SPIR-V bits in a branch somewhere.
21:46fdobridge_: <gfxstrand> The annoying bit is that the compile can fail, IIRC.
21:47fdobridge_: <karolherbst🐧🦀> yeah...
21:47fdobridge_: <karolherbst🐧🦀> the pain parts are in the subgroup core ext anyway
21:49fdobridge_: <karolherbst🐧🦀> `Note as well that some devices may support a limited number of sub-group sizes, and that some devices may not support all language constructs with all sub-group sizes. This means that some kernels may fail compilation with one required sub-group size and succeed with another required sub-group size, even if both sub-group sizes are supported by the device.` 🥲
21:49fdobridge_: <karolherbst🐧🦀> at least it's part of that ext
21:50fdobridge_: <karolherbst🐧🦀> the fun part is just to make the iris driver support that properly 😄
21:50fdobridge_: <karolherbst🐧🦀> ohh and radeonsi supports 32|64
21:50fdobridge_: <gfxstrand> Womp womp...
21:51fdobridge_: <karolherbst🐧🦀> non uniform workgroups are also a pain, especially due to the high interactions with the subgroup support
21:51fdobridge_: <gfxstrand> For all the things we got wrong with subgroups in Vulkan, we didn't manage to frog them up nearly as bad as GL or CL. 🤡
21:51fdobridge_: <karolherbst🐧🦀> and non uniform workgroups actually matter for perf 😦
21:52fdobridge_: <karolherbst🐧🦀> look, CL C has workgroup functions 🥲
21:52fdobridge_: <karolherbst🐧🦀> look, CL C has workgroup reduction functions 🥲 (edited)
21:52fdobridge_: <gfxstrand> Yup
21:52fdobridge_: <karolherbst🐧🦀> you can't make a subgroup API worse than that 😄
21:53fdobridge_: <karolherbst🐧🦀> though that might be one feature I'll probably never implement unless something needs it, which I kinda doubt
21:53fdobridge_: <karolherbst🐧🦀> but something is also wrong with my current subgroup implementation... I think.. at least openvino doesn't work if subgroups are supported, but the CTS is fine 😦
22:24fdobridge_: <gfxstrand> :blobcatnotlikethis:
22:34fdobridge_: <karolherbst🐧🦀> I think I should wire up CL subgroups on zink and see if it's also happening there... but maybe the vulkan CTS is way stricter there and it would just work.. 😄