00:00 karolherbst: imirkin: and I also wrote an email on how to improve the kernel level patch submission process for random folks, you might want to leave a comment as well
00:00 imirkin: i saw
00:01 imirkin: i tried to do the proc name thing a long while back
00:01 imirkin: but realized i had to modify a LOT of stuff
00:01 imirkin: so i gave up :)
00:01 karolherbst: yeah..
00:01 karolherbst: I got too annoyed :D
00:01 imirkin: it actually seems like you were able to get away without modifying _too_ much
00:01 imirkin: i think some of the structure changed since.
00:01 karolherbst: but all three MRs should improve the debugability _and_ user experience by quite a lot
00:02 karolherbst: imirkin: maybe
00:02 karolherbst: I think I cheated my way in
00:02 imirkin: cheating is good too :)
00:02 karolherbst: I don't like to touch all of those nvif thingies, but meh...
00:03 karolherbst: the nv50 and older stuff isn't even used afaik
00:03 karolherbst: ohh nv50 is, but not g82
00:04 karolherbst: I'd rather have a "base" struct for those fake ioctls, but oh well.. that's up to skeggsb anyway :p
00:04 imirkin: anyways, you probably realize this, but ben's going to have to look at that kernel change :)
00:05 karolherbst: :D
00:05 karolherbst: hence my email
00:05 karolherbst: anyway, you can help out with detecting dead channels from userspace problem :p
00:06 karolherbst: I have no clue why we freeze X/whatever sometimes while waiting on fences, but we do and this should make users more happy overall
00:07 karolherbst: ohh, with that I can even wire up this robustness thing
00:08 imirkin: yeah, i have no idea why X freezes either
00:08 karolherbst: well.. I think this is the best we can do for now ¯\_(ツ)_/¯
00:09 karolherbst: I don't think there is a different place we just block forever
00:09 karolherbst: maybe I overlooked something
00:09 imirkin: can you repro the freeze (and then check it's fixed)?
00:10 karolherbst: I know I was able to, but not sure what exactly it was...
00:10 imirkin: sure
00:10 imirkin: there could be 75 freezes
00:10 imirkin: and all you know is you fixed 1 of them
00:11 imirkin: ;)
00:13 karolherbst: ohh right
00:13 karolherbst: this vdpau freeze when seeking
00:14 karolherbst: but that might be something else...
00:46 karolherbst: mhhh.. I don't remember where it helped to just kill the process anymore :(
00:46 karolherbst: some of the CTS test will do that though :D
00:46 karolherbst: let's see
00:55 karolherbst: I think it was piglit actually
00:56 karolherbst: imirkin: worst case those patches simply improve CTS+piglit runtimes :D
00:57 imirkin: the horror!
00:57 karolherbst: soo.. I found a case where the screen freezes but only until nouveau manages to kill the channel
00:57 karolherbst: but I think there was something
00:57 karolherbst: I am aware of a game which froze everything
00:57 karolherbst: but that was when looking into this other thing
00:58 karolherbst: power gating
00:58 karolherbst: I think it was on gm206 where some of Lyudes more experimental patches broke stuff
00:58 karolherbst: and I know it froze until I killed the game
00:58 karolherbst: maybe I am able to reproduce this tomorrow
00:59 imirkin: yeah, it freezes until chan is killed
00:59 imirkin: i thought that was the "usual" thing
00:59 imirkin: sometimes killing fails
00:59 imirkin: and it never recovers
00:59 karolherbst: right
00:59 karolherbst: but I also know the case where it kills successfully, but it still freezes
00:59 imirkin: or it doesn't detect that it should kill the chan, etc
00:59 imirkin: hmmm
00:59 imirkin: i haven't seen much of that
01:00 karolherbst: I know I worked on that months/years ago because I had a stable reproducer
01:04 karolherbst: imirkin: e.g. https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/551
01:04 karolherbst: channel 6 indicates it's not X which belongs the channel to
01:05 karolherbst: but that's one of those freezes where killing userspace unfreezes it
01:05 karolherbst: and the channel is apparently already dead
01:06 karolherbst: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/457 as well
01:06 karolherbst: well.. not sure if killing the process helps here, but..
01:06 karolherbst: but the last one sounds promising to reproduce
11:19 felco: morning guys! I have an gtx650ti and a gtx1080 and I can get more logs if that helps, just let me know
11:20 felco: I can also make tests with patchs, have a spare disk to push distros, kernels and patchs
16:23 karolherbst: Lyude: ehh.. where did you had your clockgating patches for maxwell?
16:27 karolherbst: or am I confusing things and I managed to break it on kepler?
16:27 karolherbst: you remember the time where I found a games channel breaking and figured out it was the clockgating stuff all along?
16:35 karolherbst: ahh yeah...
16:35 karolherbst: https://github.com/Lyude/linux/commits/wip/kepler+-clockgating-v1r5
16:35 karolherbst: soo.. let's see..
16:41 karolherbst: or was it that? :O https://gitlab.freedesktop.org/lyudess/linux/-/tree/wip/maxwell-clkgate-nopm-v1.1
16:46 karolherbst: indeed
16:47 karolherbst: yeah
16:47 karolherbst: okay.. cool
19:30 karolherbst: Lyude: I just rebased your maxwell clockgating stuff :D and I need it for stuff not even related
20:03 karolherbst: imirkin: I have a reproduce for the dead channel -> freeze thing :)
20:03 karolherbst: and killing the process makes it unfreeze :)
20:03 karolherbst: okay.. nice nice nice
20:03 karolherbst: it's not always freezing the desktop though
20:03 imirkin: woo!
20:04 karolherbst: yeah...
20:04 imirkin: finally got nouveau to freeze
20:04 imirkin: shouldn't be _that_ hard :)
20:04 karolherbst: well
20:04 karolherbst: if you want it to happen reliably..
20:04 karolherbst: I had to dig out Lyudes patches to enable clockgating on maxwell2 :)
20:04 karolherbst: and to start a game where it kills the channel within seconds in case it's enabled
20:04 imirkin: hehe
20:05 imirkin: or try to make queries work
20:05 imirkin: that usually causes hangs real quick
20:05 karolherbst: the issue with that game are those insanely huge pushbuffers
20:05 karolherbst: and I think the engines go idle in the meantime or something
20:05 karolherbst: or something
20:05 karolherbst: smaller pushbuffers "fixes" it
20:05 imirkin: i'll make a simple libdrm-using program which insta-hangs
20:05 imirkin: if that's helpful
20:06 karolherbst: as long as it freezes and you don't run into this libdrm assert crashing the process :)
20:07 imirkin: nah, just a plain ol' gpu hang
20:07 imirkin: just need to send like 2 commands
20:07 imirkin: maybe 3?
20:07 karolherbst: I mean I have enough CTS tests which just kill the channel
20:07 karolherbst: but they don't freeze the desktop
20:07 imirkin: ok
20:07 imirkin: they kill the channel due to some erorr, right?
20:07 karolherbst: that freezing part is usually the difficult part
20:07 imirkin: mine will just freeze. no errors.
20:07 imirkin: all is well
20:08 imirkin: except nothing is working :)
20:08 karolherbst: ohh interesting
20:08 karolherbst: well the libdrm patches only help if the channel is actually killed
20:08 imirkin: no way to do it via GL (without bugs in impl of course)
20:08 imirkin: but very easy to put some commands together to do it
20:08 karolherbst: btw.. here I just see a "[ 1878.122047] nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]" :(
20:08 imirkin: yeah, that might happen.
20:08 imirkin: i could believe that.
20:08 karolherbst: which keeps me thinking about our infamous kepler1 bug
20:08 karolherbst: well
20:09 karolherbst: clockgating is causing it :)
20:09 imirkin: oh
20:09 karolherbst: and with clockgating disabled it doesn't happen afaik
20:09 imirkin: but i could imagine that this would be the end result on fermi+ for my thing
20:09 karolherbst: I had _fun_ debugging this one
20:09 imirkin: actually no. it shouldn't.
20:09 imirkin: all should be well.
20:09 imirkin: except nothing will be working.
20:09 karolherbst: because how do you get from "game is crashing the channel" to "experimental clockgating patches are the issue"
20:10 karolherbst: I still need to improve my blaming the correct process patches
20:10 karolherbst: https://gist.githubusercontent.com/karolherbst/a594806dd95f1f55f6f12f282c615c06/raw/82c33e1291001080b4d880509b00a9b5b7d54df9/gistfile1.txt :(
20:11 imirkin: i'll try to put it together tonight
20:11 karolherbst: I fixed it for two out of the three places
20:11 imirkin: i have a second gpu i can test it on
20:11 karolherbst: but this last one is super annoying to fix
20:11 imirkin: i do think channel recovery fixes it when you kill the prog
20:11 karolherbst: ahh
20:11 karolherbst: well
20:11 imirkin: but i think if it gets run on the main GPU
20:11 karolherbst: but not much we can do about it from within mesa, no?
20:11 imirkin: then it freezes everything forever
20:11 karolherbst: as long as the kernel doesn't kill it on its own
20:12 imirkin: oh, i thought you were looking at the freezing
20:12 karolherbst: I do
20:12 karolherbst: in my instant the freezing happens despite the kernel killing the channel
20:12 imirkin: ah
20:12 karolherbst: *instance
20:12 karolherbst: I know it only fixes a handful of issues, but it improves CTS run times _and_ fixes some of the freezes
20:12 karolherbst: so
20:13 karolherbst: what can be done about those other cases? mhh
20:13 karolherbst: especially shaders looping infinitly
20:13 karolherbst: those are super annoying
20:14 karolherbst: anyway.. all I do with my patches is to ask the kernel if the channel is still alive, and if it's dead just stop waiting on the fence
20:15 karolherbst: we will error out sooner or later anyway
20:15 karolherbst: ohhh
20:15 karolherbst: libdrm without asserts enabled...
20:15 karolherbst: that's a problem
20:15 karolherbst: I forgot that part
20:17 karolherbst: let me verify that
20:22 karolherbst: uhhh
20:22 karolherbst: games ignoring my LD_LIBRARY_PATH thing
20:22 karolherbst: wow
20:22 karolherbst: how annoying
20:23 karolherbst: imirkin: okay.. so here is the deal.. once the channel is dead, it can take a while until we run into this assert(kref) thing
20:23 karolherbst: and some processes freeze the desktop once we reach it
20:23 karolherbst: s/once/until/
20:26 karolherbst: okay... more likely with a libdrm compiled without asserts :)
20:30 karolherbst: imirkin: sooooo... any good ideas what we should do in case we actually detect the channel is dead? I think with my patches we still rely on something to happen
20:32 karolherbst: if we are unlucky the process never crashes, but it will because it tries to dump the pushbuffer or something? dunno
20:34 karolherbst: but I think we might also just want to allocate a new channel or something in that case
20:34 karolherbst: and hope for the best?
20:34 karolherbst: just marking all fences in flight as signalled and move on or something
20:35 imirkin: and fail all future submits unless some flag is set
20:35 karolherbst: we already do that
20:35 karolherbst: the kernel rejects those
20:35 karolherbst: but I think the first step was to get the libdrm changes in asap
20:36 karolherbst: so we have this convenience function to check for a dead channel
20:36 karolherbst: submitting work also gives us the proper value, but doesn't help if a single threaded app waits until the fence wait times out
20:37 karolherbst: so this is the gap I want to fix here
20:38 karolherbst: it's for a reason that the check for a dead channel is implemented through the DRM_NOUVEAU_GEM_PUSHBUF ioctl :D
20:38 karolherbst: works on an older kernel as well, as if there are no buffers attached it's essentially a noop
20:41 karolherbst: imirkin: anyway.. my plan would be to get the current stuff in so the libdrm update gets rolled out and play around with recreating the channel on top of the current stuff
20:41 karolherbst: I am just wondering if there is a better way of handling it atm
20:55 Lyude: karolherbst: fantastic! I'm glad to finally hear that branch is going somewhere
20:55 Lyude: if you have any questions about it feel free to let me know
20:55 karolherbst: well, it's going nowhere, I just needed it to reliably crash the GPU channel :D
20:55 Lyude: ah lol
20:56 karolherbst: might be a good idea to try to figure out why that is happening though
22:42 karolherbst: not great, but good enough: [ 188.458298] nouveau 0000:01:00.0: Xwayland[6372]: channel 8 killed of process glcts[7177]!
22:43 karolherbst: so apparently we sometimes have to kill the channel even though no trap or fault happened :)
23:07 karolherbst: imirkin: anyway, would be cool if you could take a look at the patches :)