00:02karolherbst: it's so funny... depending on how I mess up the service file for getty network doesn't even come up.. nice
00:04imirkin: systemd is great.
00:06karolherbst: well in the past openrc failed me more often though :D
00:07karolherbst: but the issue is rather that getty doesn't want to start...
00:15imirkin: i liked back when there was just a rc.M :)
00:15imirkin: much easier to figure out what's happening
00:16imirkin: although getty would always have been managed by init
00:16karolherbst: sure, but it was also slow and processes could just make it even slower
00:16imirkin: via inittab
00:17karolherbst: anyway.. console doesn't seem to corrupt now after suspend.. oh well.. one day I'll figure that shit out
00:17imirkin: i'll take slight slowness once a month over vast every-day system complexity
00:17karolherbst: I did the cycle without having gnome started
00:17karolherbst: there is no error
00:17karolherbst: but something is clearly wrong
00:17ccr:sings "blame gnomeda!"
00:18karolherbst: how do I know you ask? well the output on the display is all garbage
00:18imirkin: do you see something in dmesg about "running init tables"?
00:18imirkin: perhaps the logic for when to do that got messed up
00:19imirkin: supposed to do it on resume
00:19karolherbst: ohh.. good point, let me check
00:20karolherbst: at least it suspends very quickly now
00:21karolherbst: imirkin: okay.. so with init_on_alloc=0 it does work and it does run init tables
00:22imirkin: maybe the problem is actually not with resume
00:22imirkin: but with suspend?
00:22karolherbst: might be.. but when gnome is started it takes a minute to suspend
00:22karolherbst: could be unrelated.. could be not
00:22karolherbst: who knows
00:22imirkin: both ways?
00:22imirkin: or only with init_on_alloc=1
00:22karolherbst: uhm... not sure
00:22imirkin: you just tried it with init_on_alloc=0
00:23imirkin: was it fast to suspend?
00:23karolherbst: but without gnome started as I said
00:23imirkin: and in the past it's been slow to suspend
00:23karolherbst: first let me figure it out without complex userspace started
00:23karolherbst: might make it easier to figure out
00:24karolherbst: it also suspends quickly with init_on_alloc=1 without gnome
00:24karolherbst: and the console is not garbaged, but it could be because now I actually have getty started on a tty...
00:24karolherbst: (which is the thing I tried to check if that makes a difference)
00:26karolherbst: mhhh, it seems like that getty has some form of impact here... or that stuff is super random..
00:27karolherbst: imirkin: okay...
00:27karolherbst: I think I figured it out at least without gnome
00:27karolherbst: stuff works either way
00:27karolherbst: just without getty you get gargabe on resume
00:28ccr: while this may be a long shot, I am curious whether disabling pm_async would make any difference .. eg echo 0 > /sys/power/pm_async before suspend/resume
00:28karolherbst: which becomes normal once you start getty
00:30karolherbst: ccr: what does it change?
00:30ccr: when enabled (which I think is the default), devices are shut down asynchronously. and sometimes it may not work.
00:31karolherbst: imirkin: yeah soo.. no matter what init_on_alloc is, the output on the display is all garbage when there is no actual application started
00:31imirkin: karolherbst: if there's literally nothing driving the VT, that makes sense
00:31karolherbst: ccr: but userspace still waits until it's all done, right?
00:31ccr: "If enabled, this feature will cause some device drivers' suspend and resume callbacks to be executed in parallel with each other and with the main suspend thread."
00:31imirkin: logind should always be running, or whatever the thing is
00:31karolherbst: imirkin: yeah, I'd agree in general
00:32imirkin: iirc getty was for serial
00:32imirkin: but maybe it covers everything? been a while since i looked
00:32karolherbst: imirkin: logind only starts when there is a login manager requiring logind
00:32ccr: karolherbst, yes.
00:32karolherbst: which I think also includes getty started on a tty
00:32karolherbst: not sure though
00:32karolherbst: but I guess as the session do appear in loginctl
00:33ccr: anyway, sorry for interjecting :)
00:33karolherbst: with debug=debug everything is just soooooooo slow :(
00:34karolherbst: imirkin: something is up even with init_on_alloc=0...
00:37karolherbst: ccr: soo... here is what happens: suspend takes like a minute and on resume it breaks inside noueau
00:37karolherbst: and this only happens if there are userspace applications actual doing OpenGL started
00:37karolherbst: so this explains why with gnome started it triggers
00:37karolherbst: but is there something the kernel does when drivers take too long to suspend?
00:40ccr: no idea. in my personal case, pm_async=1 caused suspend to take 2-3 minutes to finish. using pm_async=0 made it "instant" .. well, as instant as a 10+ year old laptop can be.
00:41ccr: (of any deeper reasons I have no idea, my google-fu at that time did not reveal any clues how to debug async suspend problems)
00:45karolherbst: ccr: wtf...
00:46karolherbst: ccr: why does this help...
00:46ccr: hmm :D sorry if I made this issue more wtf'y
00:47karolherbst: ccr: is there a linux command line option for it?
00:49ccr: not that I know of
00:49ccr: but if pm_async=0 helps, then there may be some issue in the ordering of device suspend/resume that causes a problem
00:49karolherbst: guess it wouldn't hurt to figure out why it hangs then
00:53karolherbst: that moment when you execute "systemctl suspend" in the wrong shell
00:57karolherbst: when you do it again..
01:00ccr: "oops, I did it again"?
01:01imirkin: karolherbst: just don't run your compositor in a shell displayed by that compositor, and then ^Z
01:01imirkin: (i guess not a thing with wayland. but def a thing with X)
01:02karolherbst: but it seems like that with pm_async=0 it still fails with init_on_alloc=1
01:10imirkin: karolherbst: fyi there's a thing to let the suspend abort at the last possible step
01:10imirkin: which helps debug
01:10karolherbst: yeah, I am aware
01:12karolherbst: just .... mhh.. with the serial console not accepting my input that will be annoying to debug
01:13karolherbst: ehh wait
01:13karolherbst: it just prints more stuff
01:14karolherbst: imirkin: it just tells me the same stuff I get on resume: https://gist.githubusercontent.com/karolherbst/89f43454c09dbab11f662a37ff3f86bc/raw/aa92841e73e489c647e742865db7b611473ccdcc/gistfile1.txt
01:14karolherbst: which... isn't telling much
01:14imirkin: karolherbst: yeah, but then you can at least take _some_ platform stuff out of the equation
01:14imirkin: since it never _actually_ suspends
01:14karolherbst: well it does suspend
01:14imirkin: then it's not the thing i had in mind :)
01:14karolherbst: just the console works until you hit the actual suspend
01:15karolherbst: and doesn't stop before
01:16karolherbst: imirkin: I could do a s2idle suspend instead of deep
01:16karolherbst: but I think that's equally broken
01:16imirkin: there's some pm debug thing
01:16imirkin: where it aborts the suspend right before suspending
01:17karolherbst: mhh, but does it give me ssh?
01:17imirkin: looks like /sys/power/pm_test is the thing
01:17karolherbst: yeah well.. without ssh it is a bit pointless.. or well... input device drivers or something
01:17imirkin: you need CONFIG_PM_DEBUG to use it
01:17imirkin: that lets you suspend bits and pieces
01:18karolherbst: so I set it to freezer and only userspace gets frozen and then it resumes again?
01:21karolherbst: imirkin: well.. seems to break at "devices"
01:21imirkin: that's not extremely surprising
01:21karolherbst: not very
01:21imirkin: given that it's nouveau that's causing issues
01:22karolherbst: fun thing is.. it does continue without issues though
01:22karolherbst: it still takes a minute
01:22karolherbst: let me rephrase it
01:22karolherbst: it's "less" broken
01:22karolherbst: still getting some [ 281.151031] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 0 MP 0: 00000010 [INVALID_OPCODE] at 07fac8 warp 6, opcode e0991405 40204780
01:23karolherbst: guess when you don't power cycle, VRAM is still mostly intact
01:24karolherbst: what worries me most is, that kasan doesn't trigger... so either we do somethihng so wrong it is still correct or something super odd is happening
01:24karolherbst: that it takes a minute kind of makes me think we loop somewhere for quite some time
01:25karolherbst: but if we loop that long I'd assume we wither go OOB or...... well
01:25karolherbst: obviously don't progress at all
01:25karolherbst: but why does it suspend at some point then?
01:26karolherbst: imirkin: wasn't there a way to let the kerenl print stacktraces every X seconds or so?
01:26imirkin: no, but you can write a command to print stacktraces
01:26imirkin: by echoing something to /proc/sysrq-trigger
01:26karolherbst: yeah well...
01:26karolherbst: I want it to do it while it is suspending
01:26imirkin: you can also force it over the serial
01:26imirkin: so maybe it'll work
01:26imirkin: only if console=serial though
01:26karolherbst: I want it to happen after userspace is frozen
01:27karolherbst: and remember.. my serial console kind of doesn't send over my input :(
01:27karolherbst: guess I need to figure out what's wrong with that stuff and only connect choosen pins
01:28karolherbst: mhh, but if we wouldn't progress soft lockup would trigger..
01:28karolherbst: this issue is getting quite annoying
01:36karolherbst: guess it's printk time
01:40karolherbst: imirkin: anyway.. soo the user figured out that with init_on_alloc=1 it fails to wait on a fence, times out and ttm eviction fails.. at least that's how I understand it
01:42karolherbst: maybe I should start using ftrace.. seems like a nice tool actually
02:01karolherbst: nice.. https://gist.githubusercontent.com/karolherbst/107e87546bc7553bf859c229de4f2db5/raw/592719d492ec417380e1d570ea5bca63f79bd4b8/gistfile1.txt
02:01karolherbst: I guess I can start working like this
15:32karolherbst: imirkin: sooo.. something is up with fencing
15:34karolherbst: mhh, so we call into nouveau_bo_move, do all the stuff and then ... nothing happens
15:44karolherbst: then we do a sw copy with ttm_move_memcpy ...
15:45karolherbst: and that takes forever.... because... no clue
15:46karolherbst: but maybe drm_memcpy_from_wc just takes a while because it's actually copying stuff
15:49karolherbst: okay... I can confirm that I have the same issue the user had while tracing
15:50karolherbst: hw copy fails (seems to not matter if cipher or not is used), then we do a sw copy and that "succeeds" but stuff still messes up
16:10imirkin: you know how knows this stuff? skeggsb :)
16:10karolherbst: but init_on_alloc _makes_ a difference
16:10imirkin: annoying, right?
16:10karolherbst: I can figure out where
16:13karolherbst: ehhh wait a second...
16:13karolherbst: ahh no, that's just the diff tool being odd
16:26karolherbst: imirkin: okay.. so the first difference I have inside nouveau_fence_sync mhhh
16:26karolherbst: and.. well...
16:26karolherbst: it looks strange
16:27karolherbst: top is alloc=1 a.k.a. broken
16:33karolherbst: I think I got it
16:33karolherbst: just let me verify for a sec
16:45karolherbst: imirkin: fun.... something is really up with the fences
16:47karolherbst: I literally just put some printks inside nouveau_fence_sync and now everything already falls apart when starting gdm
16:49karolherbst: still confused why kasan doesn't spot anything :/
17:00karolherbst: I recompiled the kernel and now everything breaks regardless
17:02ccr: heisenberg called and wants his uncertainty principle back
17:03karolherbst: it makes no sense
17:42karolherbst: maybe I should run memcheck?
17:42karolherbst: maybe VRAM breaks?
17:43imirkin: what does VRAM have to do with it?
17:43imirkin: oh, the fence?
17:43imirkin: fence across suspend/resume?
17:44karolherbst: now I can't even start gnome without hitting timeouts
17:44karolherbst: reason: unknown
17:44karolherbst: just recompiled the kernel between tries
17:45karolherbst: it still smells like some weirdo memory issue, but..... I would have expected kasan to complain unless it's an issue kasan won't detect
17:46karolherbst: mhh.. wasn't there something else?
18:07imirkin: Lyude: i sorta assume you did this, but just in case ... did you test that your backlight fix doesn't break the things you were trying to fix in the first place?
18:08karolherbst: ehhh.. I forgot to enable this vmalloc thing for kasan
18:21karolherbst: pmoreau: do you still have your muxed hybrid nvidia laptop?
18:22pmoreau: I do
18:22karolherbst: pmoreau: mind checking that https://patchwork.freedesktop.org/patch/472288/ doesn't break backlight?
18:22karolherbst: I get the feeling that it could impact a system like yours
18:22karolherbst: Lyude: ^^
18:23pmoreau: Sure, I’ll try that
18:23karolherbst: pmoreau: but the display is disconnected on the inactive GPU, right
18:23karolherbst: well.. would be odd if it's connected on both, hence me wondering
18:24pmoreau: Though, the backlighting is handled by apple-gmux, not Nouveau
18:24karolherbst: was going to ask that
18:24karolherbst: pmoreau: and nouveau can't control it?
18:25karolherbst: pmoreau: so I guess we are probably fine after alll.. but do you know of systems where it's not like this?
18:29karolherbst: guess it can also happen on weirdo intel+nv laptops with muxes..
18:29karolherbst: but no clue if those were a thing
18:29karolherbst: imirkin: ..... ahhhhhh
18:30karolherbst: this bug deserves a place in my showcase of the most pita bugs
18:31karolherbst: and I am sure the bugfix will be a one liner
18:31pmoreau: I don’t think Nouveau can control it, no
18:32imirkin: karolherbst: yeah, i suspected it'd be some dumb little thing
18:32imirkin: question is ... WHAT dumb little thing :)
18:32karolherbst: although I get the feeling I am not that far away
18:33pmoreau: I don’t remember how it works on the other MBP I have, if it’s also controlled by apple-gmux or not
18:33karolherbst: pmoreau: if it's apple I suspect apple-gmus kind of
18:33karolherbst: maybe if it's muxless it doesn't?
18:34pmoreau: Let me double check quickly, if there is still some battery left in it
18:38pmoreau: apple-gmux there too
18:38karolherbst: so at least your system won't get broken :)
18:39pmoreau: It’s been a while since I last booted on Linux on that laptop: installed kernel is at 5.8.
18:39pmoreau: Though, I barely use it at all so not too surprising.
18:40pmoreau: Nice! I’m stuck on 5.14 anyway, cause it fails to unlock my encrypted drive with later kernels.
18:41karolherbst: I guess the initramfs just got broken or something
18:41karolherbst: try it by regenerate initrams on an older kernel :)
18:41pmoreau: Could try
18:41karolherbst: please don't
18:42karolherbst: or at least back up your old initramfs :D
18:42karolherbst: imirkin: okay.. soo yeah.. I was right as it seems
18:43karolherbst: nouveau_fence_sync behaves differently
18:58karolherbst: imirkin: we don't get a dma_resv object in the case where it's not working
18:58karolherbst: so we don't sync
18:58karolherbst: and start evicting memory
19:00karolherbst: ehh wait..
19:00karolherbst: maybe I messed up
19:18Lyude: imirkin: yeah I checked
19:18imirkin: Lyude: cool. it can be really tempting to just fix the problem at hand that you forget about the original thing :)
19:19Lyude: imirkin: yes - glad you reminded me lol
19:19Lyude: good news, seems I'll have another unrelated backlight patch in a moment too
19:20imirkin: and that's what everyone loves -- more backlight patches ;)
19:23Lyude: i'm just happy I've finally got the time to go through all of this stuff
19:54karolherbst: Lyude: that reminds me, I had some WIP patches to make backlight support requred but it depends on other patches which might or might not have made it upstream
19:55karolherbst: but given that it can cause issues, maybe we should have a flag to be able to disable it? mhh
19:57Lyude: karolherbst: what issues?
19:57Lyude: I'm fairly sure I just fixed all of the backlight issues I know about, and tbh I don't think we should just be randomly disabling functionality like that
19:59karolherbst: Lyude: just backlight related regressions
20:00karolherbst: ohh not to disable it
20:00karolherbst: just to give users a flag to check
20:00karolherbst: or as a workaround
20:00Lyude: oh oops, sorry my brain inserted "default" into your sentence for some reason
20:00Lyude: karolherbst: yeah - would probably be good for us to add something to the module parameters
21:30karolherbst: imirkin: okay.. I am convinced that init_on_alloc isn't causing anything, it just makes random things random in a different way
21:33imirkin: consistently so, curiously
21:35karolherbst: imirkin: well.. now I have the situation that gnome starts and just suspend+resume is broken, no matter the value of init_on_alloc
21:45enyc: Is nouveau relatively static now? Whats the situtaino with very new cards etc.?
21:50Lyude: no, it's still in active development, and nvidia's last public statement was that they would get us firmware ~someday~
21:50imirkin: that statement has remained constant since 2014 or so
21:50imirkin: and that day has yet to come
21:50Lyude: we do get other resources though
21:51Lyude: so I don't think nvidia's entirely abandoned it
21:52imirkin: they just have no intention of allowing nouveau to be usable for doing anything but the bare basics
21:52imirkin: as for the newest cards, nvidia tends to release firmware about 1-2 years after initial release
22:00anarsoul: are you talking about firmware for reclocking?
22:00Lyude: no, just plain display firmware in some cases
22:00Lyude: but also that
22:01imirkin: well, not display - there's no firmware for that
22:01imirkin: but rather gr ctxsw, etc
22:01Lyude: whoops - yes, gr ctxsw
22:01Lyude: that sort of thing
22:01imirkin: otherwise all you can get is literally display
22:01imirkin: i.e. dumb framebuffer
22:03imirkin: enyc: i'd say nouveau is extremely static right now, btw
22:03imirkin: almost no work being done on it
22:03imirkin: most of the work is unfucking the various "refactors" and "improvements" core systems put in
22:04Lyude: we still actively support new hardware as soon as it's possible, I don't think it's fair to call the whole driver static, and there's still work going on for nouveau CI stuff
22:04imirkin: well, that's my perception, as not a totally-outsider
22:20karolherbst: Lyude: we don't need firmware for display offloading anymore though
22:20karolherbst: as we have a workaround for that
22:32enyc: hrrrm ok =)