14:42imanho: :karolherbst is this the interesting parts of the driver for me?
14:43karolherbst: imanho: this is just how we do it, nvidia has different ioctls
14:43karolherbst: but yeah.. generally if the driver allows you to mmap that stuff then yes
14:44karolherbst: generally all of this is very driver specific
17:06karolherbst: imirkin: I bet that the difference between init_on_free and init_on_alloc is, that this is purely for kernel memory and with init_on_free=1 you'd still get random values from userpace maybe?
17:07imirkin: like i said - needs investigation.
17:07karolherbst: anyway.. I assume we have some if check which does the right thing with random values in it :p
17:08karolherbst: this happens on suspend/resume, right?
17:08karolherbst: nice... let me try my tesla GPUs then
17:09karolherbst: got a new mobo, hopeing that those legacy BIOS card still work :D
17:11karolherbst: ehh.. not cool
17:12imirkin: i'm guessing it's less-than-fully-working?
17:12karolherbst: it does boot at least
17:13imirkin: just no display until nouveau loads?
17:13karolherbst: seems like it
17:13karolherbst: efifb does load though
17:14imirkin: makes sense
17:14imirkin: efifb doesn't know anything about displays
17:14imirkin: it just talks to the efi firmware
17:15imirkin: which is there :)
17:15karolherbst: ehh.. seems like my display was just annoying afterall
17:15karolherbst: now it works after unplugging it
17:16imirkin: oh heh
17:17karolherbst: the serial console on this board is strange
17:20karolherbst: at least on my G80 it does seem to work over DVI
17:21karolherbst: imirkin: anything specific what those systems have in common coming to your mind?
17:21imirkin: you're testing with a literal G80?
17:21imirkin: well, fwiw i think all the failures were on like G92 or G96's or later
17:22imirkin: at least the GPIO stuff changed around a bunch
17:22imirkin: also G80 didn't have a PCIPHER
17:22karolherbst: ahh nice.. passively cooled G92 here
17:22imirkin: whereas G84+ do (with a handful of weird laptop exceptions where it's fused off)
17:23imirkin: and it's used for copying stuff around, esp on resume
17:23imirkin: also are you doing suspend-to-ram or disk?
17:23karolherbst: to ram
17:23imirkin: ok. i think that's what the other people were doing
17:25karolherbst: something is up
17:25karolherbst: gnome stopped doing... anything
17:25karolherbst: as in it doesn't login completely
17:27karolherbst: ahh yeah
17:27karolherbst: imirkin: kasan complains loudly
17:27karolherbst: ehh wait
17:27karolherbst: it's not kasan
17:27karolherbst: soft lockup
17:27imirkin: whatever it is, it's loud ;)
17:29karolherbst: but gnome already doesn't start, so I assume there is something up for real
17:30karolherbst: can also be this backlight regression ...
17:30karolherbst: let's see what git bisect says
17:35karolherbst: ehhh... I hope my CPU even boots on older kernels :O
17:38karolherbst: mhh [ 395.589275] nouveau 0000:01:00.0: fb: trapped read at 4f31764800 on channel 3 [0f8c4000 Xorg] engine 00 [PGRAPH] client 0a [TEXTURE] subclient 00  reason 00000000 [PT_NOT_PRESENT]
17:38karolherbst: G80 _did_ work
17:38imirkin: G80 uses m2mf to do copies, fwiw
17:38imirkin: rather than PCIPHER
17:39imirkin: (which is a copy engine part of video unit, which can also do encryption)
17:39karolherbst: but something mm related is completely broken as it seems
17:39imirkin: yeah, people have been changing stuff around there and making lots of changes to nouveau, with what i assume is zero testing
17:40imirkin: this is why i keep an older kernel on my box -- upgrading generally causes things to not work
17:40karolherbst: wouldn't be the first regression
17:40imirkin: i'm on 5.6 now which has proven fairly solid
17:40karolherbst: I am sure I won't even be able to boot on that one :D
17:41karolherbst: well maybe boot
17:41karolherbst: got a new i7-12700 here and I could imagine that the x86 code isn't ready for it
17:41imirkin: yea no clue
17:41imirkin: in general these things are pretty backwards-compatible
17:41karolherbst: right... but mono also broke on bit.LITTLE configs due to weirdo cache line assumptions
17:42karolherbst: not sure how much the cores are difference on alderlake, but...
17:43karolherbst: at least the L2 cache is different in size
17:49karolherbst: now I got a NULL pointer access on 5.14...
17:49karolherbst: bbut we also had this other weirdo ttm bug
18:42karolherbst: imirkin: ehh... I tested on the wrong branch :(
18:42karolherbst: I was missing the ttm bugfix
18:42karolherbst: yeah.. I thought drm-misc-next would be good, but...
19:30karolherbst: mhhh.. my machine doesn't suspend though...
19:30karolherbst: ahh now it did.. that took long
19:31karolherbst: mhh "[ 127.352945] nouveau 0000:01:00.0: fb: trapped read at 0000000d00 on channel 3 [0f8c4000 Xorg] engine 00 [PGRAPH] client 05 [CCACHE] subclient 00 [CB] reason 00000002 [PAGE_NOT_PRESENT]"
19:31imirkin: that's ... a very low address
19:31imirkin: feels like where the PTE's live?
19:32karolherbst: might be
19:32karolherbst: trying without init_on_alloc=1 now
19:32imirkin: er, PDE's rather
19:32karolherbst: I hope init_on_alloc doesn't affect device memory in weird ways...
19:33karolherbst: wondering if kasan even triggers on init_on_alloc...
19:35karolherbst: imirkin: the thing is.. I think the kernel aborts suspending
19:36karolherbst: not sure if this is purely informational or if something gives up
19:36imirkin: no, that's something that can def happen
19:36imirkin: you're probably the first person trying suspend-to-ram with a G80
19:37imirkin: and only like the second trying nouveau with a G80 :)
19:37karolherbst: ehh it's a G92 now
19:37karolherbst: the G80 worked fine
19:37imirkin: what if you boot with nouveau.config=cipher=0
19:37karolherbst: welll.. I think.. didn't dry on 5.17-rc2
19:37imirkin: that should make it use m2mf for copies too
19:38karolherbst: our error handling is catastrophic :(
19:38karolherbst: I think we need to be better about when to give up.. like if a channel is throwing out errors constantly every second for minutes, maybe we should just let it bet and kill it :D
19:42karolherbst: imirkin: sure that this option still exists?
19:42imirkin: i'm sure that an option like it exists
19:42imirkin: but perhaps not spelled precisely like that
19:42imirkin: maybe cypher?
19:42imirkin: or something else?
19:43imirkin: or crypt maybe?
19:43karolherbst: ahh maybe
19:43imirkin: yeah. crypt sounds more like it actually
19:43imirkin: dunno. should be able to rtfs to sort it out
19:43imirkin: there's also a wiki page that covers it
19:43karolherbst: at least it rings a bell
19:44imirkin: oh, but it hasn't been updated for the new terminology
19:46karolherbst: then it should be chipher, as the engine is called cipher
19:47karolherbst: but booting with it didn't change anything
19:47imirkin: ok. that's what i thought it got updated to
19:47imirkin: on boot iirc it says what engine it uses for copies
19:47imirkin: does it still say that it uses cipher for copies?
19:47imirkin: hopefully not :)
19:47karolherbst: "[ 16.146750] nouveau 0000:01:00.0: DRM: MM: using M2MF for buffer copies"
19:47imirkin: ok. so it worked.
19:48karolherbst: yeah... okay, but suspend taking like a minute is wrong regardless
19:48karolherbst: but kasan doesn't print anything, so most likely a valid bug somewhere
19:52karolherbst: imirkin: but it seems to be quite the old regression :/
19:53karolherbst: "Downgrading to 4.19 (4.19+105+deb10u11) works. I had similar problems with 5.8." :O
19:53imirkin: well, the init_on_* thing got introduced in 5.2
19:53imirkin: (or thereabouts)
19:53imirkin: which is what "caused" the problem
19:53imirkin: (obv not, but made it apparent)
19:53karolherbst: so might be broken since forever
19:53imirkin: which is why i was more keen on understanding how the new feature works precisely
19:54imirkin: to understand what things it can affect
19:54imirkin: to hopefully reduce the scope from "everything"
19:54karolherbst: ohh wait...
19:54karolherbst: some user was smart
19:54karolherbst: apparently it stalls inside nv50_disp_atomic_commit_tail
19:54imirkin: there's a user who wrote a book-length report on his findings
19:54imirkin: tbh i didn't have the time to go through it all
19:55imirkin: but it seemed like he made progress in investigating stuff
19:55imirkin: if not actual progress in figuring out what the issue is
19:56karolherbst: thing is.. I don't get the eviction failed message
19:59karolherbst: but it feels like that memory is trashed
20:00karolherbst: "TRAP_MP_EXEC - TP 0 MP 0: 00000010 [INVALID_OPCODE] at 07ff80 warp 0, opcode fffeffff ffffffff" :)
20:01imirkin: certainly doesn't look _great_
20:01imirkin: it's consistent with his findings about ttm stuff somewhere
20:01imirkin: feels like we're not restoring stuff correctly?
20:01karolherbst: I think we do something terrible at suspend
20:01karolherbst: why does it take a minute for starters?
20:02karolherbst: if it takes like 5 seconds.. okay, not great, but not terrible
20:02karolherbst: but it shouldn't take longer than that
20:22karolherbst: and something is up with the linux serial console driver :(
20:22karolherbst: at random points it starts printing garbage
22:15karolherbst: imirkin: mhhhh.. I suspect it's something channel related
22:16karolherbst: sooo.. suspending without anything running besides the console seems fine
22:16karolherbst: maybe we mess up halting/stopping channels or they continue too soon or something stupid like that?
22:16karolherbst: but that still doesn't explain why suspending takes so long if something is running
22:23imirkin: sounds like a whole barrel of fun you just poked
22:23imirkin: esp since none of this is reproducing the original issue people were having :)
22:24karolherbst: esp since my serial console really messes up... well... on the software side :(
22:27karolherbst: ohhhh... the heck is it doing
22:29karolherbst: mhhh.. maybe some pins are wrongly connected after all?
22:31karolherbst: no.. seems alright
22:32imirkin: could be control flow setting?
22:32imirkin: 8N1 is the common thing
22:32karolherbst: console=ttyS0,115200n8 on the host
22:32imirkin: hardware vs software control flow? dunno
22:32karolherbst: both disabled
22:32imirkin: maybe the cable can't handle it?
22:32karolherbst: it usualy works
22:32karolherbst: I just get random crap
22:32imirkin: 115.2 is high
22:32imirkin: try 56k?
22:33imirkin: or even, *gasp*, 9600
22:33karolherbst: well.. it does work perfectly fine up to a point where my client interprets random crap in a way the client breaks
22:33karolherbst: but as you see I get dmesg putout
22:33karolherbst: and also sysrq
22:33karolherbst: and random stuff
22:33imirkin: right. it works perfectly until it doesn't
22:34karolherbst: in minicom it looks like this e.g.: https://gist.github.com/karolherbst/6efb57f95248188b94d88e81f43bf86e
22:34karolherbst: and I used the same settings on the old mobo
22:34imirkin: welcome to the world of hardware
22:34imirkin: hardware sucks.
22:34karolherbst: anyway, this one has a proper plug so I don't even have to rewire
22:35karolherbst: and if the pins were mirrored I shouldn't even get anything
22:38karolherbst: I am trying with flow control enabled
22:38karolherbst: maybe that works better
22:41imirkin: i've always done it with hardware flow control
22:41imirkin: coz i use the fancy new-fangled 16650 UARTs :)
22:41karolherbst: heh.. now it blocked boot :D guess I can't hit random keys
22:43karolherbst: something is up though as I can't send anything ...
22:44ccr: I remember running into odd case with some HP network switches, if they were already powered on, the serial interface would not sometimes work. or at least it didn't work with linux PC box + minicom. sometimes it started working when banging random keys/data, sometimes it required booting the switch while the serial cable was connected.
22:44karolherbst: ccr: yeah.... it's weird
22:45karolherbst: I think my laptop is sending random crap... or the usb thingy
22:46ccr: never know about those, somehow I have a gut-feeling that the modern cheap usb-to-serial implementations are probably worse than "real" integrated serial chips
22:47ccr: but perhaps there are good ones and bad ones, shrug
22:47karolherbst: the botherboard has a real one on it
22:47ccr: botherboard :D
22:47karolherbst: but to communicate with it I have a RS232 usb thingy
22:47karolherbst: I receive everything correctly as it seems, but sending stuff doesn't really work
22:49ccr: I have one of those usb-serial converters as well, due to laptop not having a rs232 port. I can't remember which chip it has but it is supported by kernel.
22:52karolherbst: mhhh maybe getty is causing issues here
22:52imirkin: make sure the getty config matches
22:52imirkin: iirc the default for getty used to be like 9600 or 38400
22:53karolherbst: I do see the proper login prompt though
22:53imirkin: perhaps it's not expecting you to send with some setting? dunno
22:53karolherbst: no idea
22:54karolherbst: now ssh doesn't start
22:57karolherbst: nope.. it was nm
23:01karolherbst: or maybe it's systemd writing to the console...
23:15karolherbst: yeah... I think it's getty