00:00karolherbst: imirkin: and I also wrote an email on how to improve the kernel level patch submission process for random folks, you might want to leave a comment as well
00:00imirkin: i saw
00:01imirkin: i tried to do the proc name thing a long while back
00:01imirkin: but realized i had to modify a LOT of stuff
00:01imirkin: so i gave up :)
00:01karolherbst: I got too annoyed :D
00:01imirkin: it actually seems like you were able to get away without modifying _too_ much
00:01imirkin: i think some of the structure changed since.
00:01karolherbst: but all three MRs should improve the debugability _and_ user experience by quite a lot
00:02karolherbst: imirkin: maybe
00:02karolherbst: I think I cheated my way in
00:02imirkin: cheating is good too :)
00:02karolherbst: I don't like to touch all of those nvif thingies, but meh...
00:03karolherbst: the nv50 and older stuff isn't even used afaik
00:03karolherbst: ohh nv50 is, but not g82
00:04karolherbst: I'd rather have a "base" struct for those fake ioctls, but oh well.. that's up to skeggsb anyway :p
00:04imirkin: anyways, you probably realize this, but ben's going to have to look at that kernel change :)
00:05karolherbst: hence my email
00:05karolherbst: anyway, you can help out with detecting dead channels from userspace problem :p
00:06karolherbst: I have no clue why we freeze X/whatever sometimes while waiting on fences, but we do and this should make users more happy overall
00:07karolherbst: ohh, with that I can even wire up this robustness thing
00:08imirkin: yeah, i have no idea why X freezes either
00:08karolherbst: well.. I think this is the best we can do for now ¯\_(ツ)_/¯
00:09karolherbst: I don't think there is a different place we just block forever
00:09karolherbst: maybe I overlooked something
00:09imirkin: can you repro the freeze (and then check it's fixed)?
00:10karolherbst: I know I was able to, but not sure what exactly it was...
00:10imirkin: there could be 75 freezes
00:10imirkin: and all you know is you fixed 1 of them
00:13karolherbst: ohh right
00:13karolherbst: this vdpau freeze when seeking
00:14karolherbst: but that might be something else...
00:46karolherbst: mhhh.. I don't remember where it helped to just kill the process anymore :(
00:46karolherbst: some of the CTS test will do that though :D
00:46karolherbst: let's see
00:55karolherbst: I think it was piglit actually
00:56karolherbst: imirkin: worst case those patches simply improve CTS+piglit runtimes :D
00:57imirkin: the horror!
00:57karolherbst: soo.. I found a case where the screen freezes but only until nouveau manages to kill the channel
00:57karolherbst: but I think there was something
00:57karolherbst: I am aware of a game which froze everything
00:57karolherbst: but that was when looking into this other thing
00:58karolherbst: power gating
00:58karolherbst: I think it was on gm206 where some of Lyudes more experimental patches broke stuff
00:58karolherbst: and I know it froze until I killed the game
00:58karolherbst: maybe I am able to reproduce this tomorrow
00:59imirkin: yeah, it freezes until chan is killed
00:59imirkin: i thought that was the "usual" thing
00:59imirkin: sometimes killing fails
00:59imirkin: and it never recovers
00:59karolherbst: but I also know the case where it kills successfully, but it still freezes
00:59imirkin: or it doesn't detect that it should kill the chan, etc
00:59imirkin: i haven't seen much of that
01:00karolherbst: I know I worked on that months/years ago because I had a stable reproducer
01:04karolherbst: imirkin: e.g. https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/551
01:04karolherbst: channel 6 indicates it's not X which belongs the channel to
01:05karolherbst: but that's one of those freezes where killing userspace unfreezes it
01:05karolherbst: and the channel is apparently already dead
01:06karolherbst: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/457 as well
01:06karolherbst: well.. not sure if killing the process helps here, but..
01:06karolherbst: but the last one sounds promising to reproduce
11:19felco: morning guys! I have an gtx650ti and a gtx1080 and I can get more logs if that helps, just let me know
11:20felco: I can also make tests with patchs, have a spare disk to push distros, kernels and patchs
16:23karolherbst: Lyude: ehh.. where did you had your clockgating patches for maxwell?
16:27karolherbst: or am I confusing things and I managed to break it on kepler?
16:27karolherbst: you remember the time where I found a games channel breaking and figured out it was the clockgating stuff all along?
16:35karolherbst: ahh yeah...
16:35karolherbst: soo.. let's see..
16:41karolherbst: or was it that? :O https://gitlab.freedesktop.org/lyudess/linux/-/tree/wip/maxwell-clkgate-nopm-v1.1
16:47karolherbst: okay.. cool
19:30karolherbst: Lyude: I just rebased your maxwell clockgating stuff :D and I need it for stuff not even related
20:03karolherbst: imirkin: I have a reproduce for the dead channel -> freeze thing :)
20:03karolherbst: and killing the process makes it unfreeze :)
20:03karolherbst: okay.. nice nice nice
20:03karolherbst: it's not always freezing the desktop though
20:04imirkin: finally got nouveau to freeze
20:04imirkin: shouldn't be _that_ hard :)
20:04karolherbst: if you want it to happen reliably..
20:04karolherbst: I had to dig out Lyudes patches to enable clockgating on maxwell2 :)
20:04karolherbst: and to start a game where it kills the channel within seconds in case it's enabled
20:05imirkin: or try to make queries work
20:05imirkin: that usually causes hangs real quick
20:05karolherbst: the issue with that game are those insanely huge pushbuffers
20:05karolherbst: and I think the engines go idle in the meantime or something
20:05karolherbst: or something
20:05karolherbst: smaller pushbuffers "fixes" it
20:05imirkin: i'll make a simple libdrm-using program which insta-hangs
20:05imirkin: if that's helpful
20:06karolherbst: as long as it freezes and you don't run into this libdrm assert crashing the process :)
20:07imirkin: nah, just a plain ol' gpu hang
20:07imirkin: just need to send like 2 commands
20:07imirkin: maybe 3?
20:07karolherbst: I mean I have enough CTS tests which just kill the channel
20:07karolherbst: but they don't freeze the desktop
20:07imirkin: they kill the channel due to some erorr, right?
20:07karolherbst: that freezing part is usually the difficult part
20:07imirkin: mine will just freeze. no errors.
20:07imirkin: all is well
20:08imirkin: except nothing is working :)
20:08karolherbst: ohh interesting
20:08karolherbst: well the libdrm patches only help if the channel is actually killed
20:08imirkin: no way to do it via GL (without bugs in impl of course)
20:08imirkin: but very easy to put some commands together to do it
20:08karolherbst: btw.. here I just see a "[ 1878.122047] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]" :(
20:08imirkin: yeah, that might happen.
20:08imirkin: i could believe that.
20:08karolherbst: which keeps me thinking about our infamous kepler1 bug
20:09karolherbst: clockgating is causing it :)
20:09karolherbst: and with clockgating disabled it doesn't happen afaik
20:09imirkin: but i could imagine that this would be the end result on fermi+ for my thing
20:09karolherbst: I had _fun_ debugging this one
20:09imirkin: actually no. it shouldn't.
20:09imirkin: all should be well.
20:09imirkin: except nothing will be working.
20:09karolherbst: because how do you get from "game is crashing the channel" to "experimental clockgating patches are the issue"
20:10karolherbst: I still need to improve my blaming the correct process patches
20:10karolherbst: https://gist.githubusercontent.com/karolherbst/a594806dd95f1f55f6f12f282c615c06/raw/82c33e1291001080b4d880509b00a9b5b7d54df9/gistfile1.txt :(
20:11imirkin: i'll try to put it together tonight
20:11karolherbst: I fixed it for two out of the three places
20:11imirkin: i have a second gpu i can test it on
20:11karolherbst: but this last one is super annoying to fix
20:11imirkin: i do think channel recovery fixes it when you kill the prog
20:11imirkin: but i think if it gets run on the main GPU
20:11karolherbst: but not much we can do about it from within mesa, no?
20:11imirkin: then it freezes everything forever
20:11karolherbst: as long as the kernel doesn't kill it on its own
20:12imirkin: oh, i thought you were looking at the freezing
20:12karolherbst: I do
20:12karolherbst: in my instant the freezing happens despite the kernel killing the channel
20:12karolherbst: I know it only fixes a handful of issues, but it improves CTS run times _and_ fixes some of the freezes
20:13karolherbst: what can be done about those other cases? mhh
20:13karolherbst: especially shaders looping infinitly
20:13karolherbst: those are super annoying
20:14karolherbst: anyway.. all I do with my patches is to ask the kernel if the channel is still alive, and if it's dead just stop waiting on the fence
20:15karolherbst: we will error out sooner or later anyway
20:15karolherbst: libdrm without asserts enabled...
20:15karolherbst: that's a problem
20:15karolherbst: I forgot that part
20:17karolherbst: let me verify that
20:22karolherbst: games ignoring my LD_LIBRARY_PATH thing
20:22karolherbst: how annoying
20:23karolherbst: imirkin: okay.. so here is the deal.. once the channel is dead, it can take a while until we run into this assert(kref) thing
20:23karolherbst: and some processes freeze the desktop once we reach it
20:26karolherbst: okay... more likely with a libdrm compiled without asserts :)
20:30karolherbst: imirkin: sooooo... any good ideas what we should do in case we actually detect the channel is dead? I think with my patches we still rely on something to happen
20:32karolherbst: if we are unlucky the process never crashes, but it will because it tries to dump the pushbuffer or something? dunno
20:34karolherbst: but I think we might also just want to allocate a new channel or something in that case
20:34karolherbst: and hope for the best?
20:34karolherbst: just marking all fences in flight as signalled and move on or something
20:35imirkin: and fail all future submits unless some flag is set
20:35karolherbst: we already do that
20:35karolherbst: the kernel rejects those
20:35karolherbst: but I think the first step was to get the libdrm changes in asap
20:36karolherbst: so we have this convenience function to check for a dead channel
20:36karolherbst: submitting work also gives us the proper value, but doesn't help if a single threaded app waits until the fence wait times out
20:37karolherbst: so this is the gap I want to fix here
20:38karolherbst: it's for a reason that the check for a dead channel is implemented through the DRM_NOUVEAU_GEM_PUSHBUF ioctl :D
20:38karolherbst: works on an older kernel as well, as if there are no buffers attached it's essentially a noop
20:41karolherbst: imirkin: anyway.. my plan would be to get the current stuff in so the libdrm update gets rolled out and play around with recreating the channel on top of the current stuff
20:41karolherbst: I am just wondering if there is a better way of handling it atm
20:55Lyude: karolherbst: fantastic! I'm glad to finally hear that branch is going somewhere
20:55Lyude: if you have any questions about it feel free to let me know
20:55karolherbst: well, it's going nowhere, I just needed it to reliably crash the GPU channel :D
20:55Lyude: ah lol
20:56karolherbst: might be a good idea to try to figure out why that is happening though
22:42karolherbst: not great, but good enough: [ 188.458298] nouveau 0000:01:00.0: Xwayland: channel 8 killed of process glcts!
22:43karolherbst: so apparently we sometimes have to kill the channel even though no trap or fault happened :)
23:07karolherbst: imirkin: anyway, would be cool if you could take a look at the patches :)