11:18 cosurgi: imirkin: another crash. And again nothing ov value in Xorg.17.log :(
11:18 cosurgi: imirkin: but this time I got some message in the text VT:
11:18 cosurgi: (ROX-Session:9700): GLib-CRITICAL **: Source ID 112 was not found when attempting to remove it
11:18 cosurgi: -2
11:18 cosurgi: in virtual void libgtkui::X11InputMethodContextImplGtk::SetSurroundingText(const string16&, const gfx::Range&)-2-2-2mplemented reached
11:18 cosurgi: file at /var/log/Xorg.17.log for additional information.(EE) (EE) Server terminated with error (1). Closing log file.o check the log
11:18 cosurgi: xinit: connection to X server lost
11:18 cosurgi:
11:20 cosurgi: no, wait! There is something in Xorg.17.log ! I looked in the wrong file Xorg.17.log.old earlier. But despite dbgsym package there are no line numbers in the stacktrace, just some function names :(
11:20 cosurgi: imirkin: that's the Xorg.17.log : https://paste.ubuntu.com/p/jjHbBf6S4s/
11:20 cosurgi: imirkin: that's dmesg: https://paste.ubuntu.com/p/kJtTnzPDnw/
11:22 cosurgi: damn. Why xserver did not recognize dbgsym package :/
11:22 cosurgi: Xorg, I mean.
11:23 cosurgi: I will reinstall entire recompilled package, maybe that will help.
11:23 cosurgi: The stacktrace is there, but not informative enough
11:39 cosurgi: whoops
11:39 cosurgi: after reinstalling xserver-common*.deb xserver-xorg-core-dbgsym*.deb all my xservers died when I switched to them :(
11:42 cosurgi: that was a sad experience. But maybe now the stacktrace will work?
11:44 cosurgi: good thing that critical calculations I am running inside screen :)
11:46 cosurgi: oh.
11:46 cosurgi: One of the old xservers is still running. I bet it will crash when I switch to it.
11:54 cosurgi: yep, it did.
12:06 cosurgi: wtf. I can now run only one xserver at the time?
12:45 cosurgi: ok. I found out the real reason.
12:46 cosurgi: probably the stacktrace will be fixed now too.
12:47 cosurgi: Turns out that /usr/lib/xorg/Xorg.wrap is what I need, because it takes care of permissions for users who startx from VT. And it is present in xserver-xorg-legacy*.deb
12:48 cosurgi: and so it turn out that I also didn't have xserver-xorg-legacy-dbgsym_1.19.2-1+deb9u5_amd64.deb installed
12:48 cosurgi: And that must be the reason why there was no stacktrace
12:48 cosurgi: So I hope it's good now.
12:49 cosurgi: But I got angry a bit. I wish I could buy amdgpu now ;> But for now I will keep trying to give you usefulll debug data & stacktraces ;)
12:53 imirkin: yeah, all this stuff is pretty annoying
12:56 cosurgi: imirkin: thanks for psychological supports, It means a lot to me :/
12:56 cosurgi: :)
12:56 imirkin: when you update a binary while it's running, that binary can get unhappy
12:56 imirkin: (or its libraries)
12:57 imirkin: i never precisely tracked down why that might be -- seems like it should be fine -- but experimentally it does happen esp with Xorg
12:57 cosurgi: TBH that was a bit surprising. I got so used to the fact that ext4 keeps the deleted files in some cache, so that the running binaries can keep running despite them being deleted.
12:59 cosurgi: looks like for some reaon Xorg goes behind ext4 cache
12:59 cosurgi: *reason
13:00 imirkin: well, not ext4, but rather VFS
13:00 imirkin: and it doesn't really have a way of circumventing it
13:00 imirkin: which means it's either actively tracking the inodes
13:00 imirkin: or ... something
13:01 imirkin: i'm sure there's a simple explanation
13:01 imirkin: i just don't know what it is
13:19 imirkin: skeggsb: did you mean to leave this in? type = 0; /*XXX: need to confirm stuff works with depth enabled... */
13:22 imirkin: skeggsb: btw, did you mean to flip DP_CONDITION 7 from always-true to always-false?
13:23 imirkin: or is it the one where the bytes that follow disable execution?
13:23 skeggsb: imirkin: yes, i meant to leave that in.. we've always overridden the value, just wanted to add a note as to why
13:23 imirkin: skeggsb: really? didn't look like that in the old code
13:23 skeggsb: - u32 type = depth << 24;
13:23 skeggsb: -
13:23 skeggsb: - type = 0x00000001; /* PAGE_ALL */
13:24 skeggsb: ignored :P
13:24 imirkin: lol
13:24 imirkin: well-hidden
13:24 imirkin: i totally glazed over that bit, even knowing that i should be looking for it
13:24 skeggsb: that was deliberate, but not very obvious that it wasn't a mistake
13:25 skeggsb: RM doesn't seem to make use of it (or didn't when i looked last), though nvidia-uvm does
13:25 skeggsb: and yeah, for the scripts i seen, GENERIC_CONDITION will do the same thing it did before by skipping opcodes for unknown conditions
13:25 skeggsb: false is correct though from what nvidia told me
13:26 imirkin: ok
16:54 cosurgi: usually Xorg was crashing when I left if while gimp was rotating ahuge 16000x4000 image a little, to make the horizon horizontal.
16:54 cosurgi: *left it.
16:54 cosurgi: And it was crashing when I was switching back, after this rotation was finished.
16:55 cosurgi: Not the first time though.
16:55 cosurgi: I had to do this several times.
17:42 imirkin_: skeggsb: do you run afoul of stuff like this in the kernel? https://cgit.freedesktop.org/mesa/mesa/commit/?id=129a9f4937b8f2adb4d37999677d748d816d611c
17:58 Lyude: imirkin_: the other day you showed me a register that could be used to check the firmware post status of a GPU, 0x2240c, do you have any idea if we could read that value just using pci_read_config_byte() ?
17:58 Lyude: in the kernel I mean
17:59 Lyude: I'd like to use it for implementing an early quirk for this laptop with the GPU power cycling issues on reboot
18:16 Lyude: hm, I think I see what I'd need to do now: i'd actually need to map the io space for the mmio registers temporarily, then read the register value from there with ioread
18:20 cosurgi: is there a danger that you will read an old value? i.e. one that already changed?
18:20 Lyude: was that aimed at me
18:20 Lyude: ?
18:20 cosurgi: yes
18:20 cosurgi: or it does not apply?
18:21 cosurgi: Lyude: maybe my question does not make sense though :)
18:21 Lyude: cosurgi: don't think it applies, this is being done from the PCI quirks which happen waaay before nouveau is loaded
18:21 cosurgi: ok
18:22 Lyude: (we have to do it that early because the GPU being left on on this laptop causes other problems before nouveau gets loaded)
18:32 karolherbst: Lyude: the PCI config is at 0x88000
18:32 karolherbst: and I doubt we can read any other value through the pci config
18:33 Lyude: karolherbst: yeah, that's what I figured out. Read a little higher up, I did figure out a way I can get the mmio space mapped so I can access 0x2240c
18:34 Lyude: (mellanox seems to also do something similar for one of its early boot quirks, so it should be OK to do
18:34 karolherbst: Lyude: do you know a way how to fully reset the pci bus/device/whatever inside the kernel?
18:34 Lyude: yes
18:35 Lyude: 1 > /sys/class/drm/card$N/device/reset
18:35 Lyude: if you mean from a kernel driver, it's pci_reset_function()
18:35 karolherbst: mhh, okay, because my absolute assumption on what's happening in regards to runpm as well is that the kernel more or less uses a cached or stale pci config inside the kernel and missconfigures $something
18:36 Lyude: karolherbst: yeah I'm suspecious of that now as well
18:36 karolherbst: mhh, pci_reset_function sounds interesting
18:36 Lyude: karolherbst: if you have any machines with the runpm bug you should definitely give that a shot
18:36 karolherbst: I am wondering if sticking a pci_reset_function somewhere before suspending/shortly after resuming might just fix stuff
18:36 Lyude: it manages to fully reset the gpu on this p50 and actually get it into a coherent state
18:36 karolherbst: ahh, nice
18:36 karolherbst: my main laptop has that runpm bug :)
18:36 karolherbst: that's why I have those scripts
18:37 Lyude: karolherbst: mind giving the reset stuff a shot today?
18:37 karolherbst: today is a bit full on meetings, but maybe I find some time after :)
18:37 Lyude: alright
18:37 karolherbst: mhhh
18:37 karolherbst: "Some devices allow an individual function to be reset without affecting other functions in the same device. The PCI device must be responsive to PCI config space in order to use this function."
18:37 Lyude: i will honestly throw a party with free booze if that fixes our runpm issues
18:37 karolherbst: so that's not gonna fly after resuming :/
18:37 karolherbst: or
18:38 karolherbst: it does some more magic
18:38 Lyude: karolherbst: we can get to the pci config space at the point we try setting up secboot though can't we/
18:38 karolherbst: well, the issue is rather that after resuming we only read out garbage :)
18:38 karolherbst: or nothing at all
18:39 karolherbst: but I suspect we have to do some reset before suspending actually
18:39 karolherbst: Lyude: any suggestion on where to put the __pci_reset_function?
18:39 karolherbst: ...
18:39 karolherbst: pci_reset_function
18:39 Lyude: karolherbst: probably immediately after we enable the PCI device
18:40 Lyude: and before we actually try to touch any device state with nvkm
18:40 karolherbst: that's way too late for the runpm issue
18:40 karolherbst: not even the PCI subsystem is able to do anything
18:41 Lyude: I wonder if we could reset the card immediately after suspending it, but before finally putting the device in D3
18:41 karolherbst: yeah, something like that
18:41 karolherbst: there is no state to preserve anyway
18:41 Lyude: we miht be able to reset it earlier even, I'm not entirely sure
18:42 karolherbst: maybe we can do that in our pm_suspend methodd
18:46 Lyude: karolherbst: btw, I'm also fairly sure that the pci reset stuff might be calling acpi methods to do things
18:46 Lyude: but i'm not 100% sure, just noticed that when it's called on this P50 the first thing it says is that the power state got changed to D0 by ACPI
18:46 imirkin_: Lyude: i think you figured it out
18:46 imirkin_: but basically if you want to access MMIO, you get to play with BAR's
18:46 imirkin_: i recommend using the helpers
18:47 imirkin_: although doing it by hand is Lots Of Fun (tm)
18:47 Lyude: imirkin_: where are those?
18:47 imirkin_: i just mean like mapping mmio regions
18:47 Lyude: no i mean the helpers
18:47 imirkin_: i dunno. hwo does nouveau map them
18:47 imirkin_: do that :)
18:47 Lyude: imirkin_: ah, yeah that's what I'm doing :p
18:48 karolherbst: nouveau should use ioremap, no?
18:48 Lyude: yep
18:48 Lyude: ioremap(mmio_base, 0x102000)
18:48 karolherbst: yeah, then you skipped most of the fun parts :p
18:48 Lyude: there's some special pci quirk-time magic I need to do before that to get the device to let me map mmio as well
18:48 Lyude: but there's pci helpers for that
18:54 Lyude: karolherbst: I think the one thing I might still need to know though, what value gets returned if you read a part of the mmiospace that isn't mapped? since I'm assuming on boots where the GPU isn't left on, it's likely 0x2240c won't even be accessible
18:55 karolherbst: Lyude: what do you mean?
18:55 karolherbst: you can't read non mapped areas through virtual memory anyway
18:56 Lyude: karolherbst: oh-wait, I might just be being silly. When nvapeek reads 0x2240c and the GPU isn't posted it returns "..."
18:56 Lyude: but I just realized "..." is just 0x0 isn't it
18:56 karolherbst: well, on error we read 0x0
18:56 karolherbst: or something
18:56 Lyude: that's the value I'm wondering about :p
18:56 karolherbst: ... means 0
18:57 Lyude: ahh, cool
19:07 Lyude: nice, imirkin_: register works perfectly for the quirk :)
19:10 imirkin_: yay
19:10 Lyude: i'm honestly still blown away by the fact I actually found a fix for this...
19:11 karolherbst: Lyude: is pci_reset_function the fix?
19:11 Lyude: karolherbst: for the P50 issue yeah
19:11 karolherbst: mhhhhhhhh
19:11 karolherbst: when are you calling it?
19:12 Lyude: karolherbst: waaaaaaaay before nouveau loads, drivers/pci/quirks.c declared with DECLARE_PCI_FIXUP_CLASS_FINAL()
19:12 karolherbst: I see
19:12 Lyude: karolherbst: yeah, it has to be done that early because otherwise we'll start getting spurious interrupts from it being left on, which causes us to disable the IRQ it's on and in turn also disable the touchpad by accident
19:13 karolherbst: ahh
19:13 karolherbst: I see
19:13 karolherbst: I guess we need a quirk for post resume/boot and pre suspend in the end... hopefully that also fixes the runpm stuff
19:13 karolherbst: would make sense
19:13 Lyude: luckily too I've got the quirk limited to the specific P50 SKUs with the issue, so for everyone else it should make no difference
19:14 Lyude: hooray for subdevice device/vendor
19:14 karolherbst:is wondering why he didn't try pci_reset_function all this time :/
19:14 Lyude: i'm wondering why I didn't know about this either tbh
19:30 Lyude: [ 1.541302] pci 0000:01:00.0: quirk_lenovo_thinkpad_p50_nvgpu_survives_reboot+0x0/0xd0 took 1003737 usecs
19:30 Lyude: slow workaround though
19:30 imirkin_: limit it to just the GP107M or whatever pci ids?
19:31 Lyude: imirkin_: wrong gpu but yeah I already did that, limited it to this specific gpu + lenovo's subsystem vendor/device combo
19:32 imirkin_: surprising that mapping a bar is so slow
19:32 imirkin_: or is it the reset that takes a whiel
19:32 Lyude: it's the reset yeah
19:32 imirkin_: since it re-reads config space and whatnot
19:33 imirkin_: a reset is a pci-level thing, right?
19:33 imirkin_: i.e. it enables some pci signal
19:33 Lyude: imirkin_: yeah
19:33 imirkin_: which flips the board into reset mode
19:33 imirkin_: and then out
19:33 Lyude: pci_reset_function()
19:33 karolherbst: mhhhh
19:33 karolherbst: that really sounds like something it could fix the runpm issues we've got
19:33 imirkin_: i mean at the electrical level
19:34 karolherbst: just do a pci_reset_function before returning from pm_suspend
19:34 Lyude: imirkin_: honestly all I know is after pci_reset_function() is entered the laptop performs a magic spell and the GPU is healthy again :p
19:34 Lyude: im assuming it's probably at an electrical level
19:34 imirkin_: ;)
19:34 karolherbst: mhh, nouveau_pmops_runtime_suspend
19:34 imirkin_: fair enough
19:35 imirkin_: you seemed low-level-enough-inclined that i figured you might know
19:35 Lyude: imirkin_: i might be able to figure out if I looked a bit more closely, but acpi is a pain :p
19:35 imirkin_: no worries
19:35 imirkin_: and yes, acpi *is* quite painful
19:35 karolherbst: Lyude: wrong, ACPI is _fun_ :p
19:35 Lyude: karolherbst: hehe, it is both
19:35 imirkin_: i _still_ can't read it
19:35 Lyude: at the very least the debugging tools are seriously useful
19:35 karolherbst: at least it gives you those nice debugging features
19:36 imirkin_: although i consider that a feature
19:36 Lyude: imirkin_: i don't think anyone really can, sometimes you're just lucky and the function you need has a legible 4 letter acronym and not some random acronym that could mean litrally anything
19:36 karolherbst: yeah,, ACPI variable names are weird
19:36 karolherbst: Lyude: just use make up names in bug reports
19:36 karolherbst: and hopefully your name becomes a thing
19:37 Lyude: hehe
22:03 Lyude: karolherbst / skeggsb : at long last, patch for the P50 GPU power cycling issue posted on linux-pci/nouveau ML
22:03 HdkR:claps
22:04 Lyude: HdkR: next up is seeing if we can use this to help out with the pascal rpm stuff :)
22:04 karolherbst: :P
22:05 karolherbst: skeggsb: do we actually have a way on which commands the GPU/engines are stuck?
22:05 karolherbst: *to figure out
22:06 karolherbst: or at least what the engines are doing while we get the ctxsw timeout?
22:06 HdkR: Lyude: Neat. My Lenovo w540 had one of those Quadros. I can't remember which
22:07 Lyude: HdkR: did rpm ever cause issues on it?
22:07 Lyude: afaik this is the only machine (and only a very specific subset of P50 skus even) i've managed to hit this issue on
22:07 skeggsb: not really, for methods, it doesn't really work like that.. chances are it'll hang in response to a "draw" command, and the million things that happen when you send one. that's completely ignoring that it's pipelined and probably processing stuff for the future
22:07 HdkR: I don't remember. I was also running the blob on it at the time
22:08 Lyude: ahh
22:08 skeggsb: as for what it's doing.. i'm sure nvidia can read the thousand of regs and figure it out, we don't have that much info...
22:08 karolherbst: right
22:08 karolherbst: but we could ask for that info ;)
22:08 skeggsb: good luck with that ;)
22:08 karolherbst: :D
22:08 Lyude: ^
22:08 imirkin_: karolherbst: you could do a kick + wait after every draw
22:08 skeggsb: that would go firmly into "secret hidden internals", and we don't even have docs on the method front-end :P
22:08 karolherbst: well, I have this choice: a. convince nvidia to tell us b. work thorugh 100.000 lines of parsed pushbuffers :)
22:08 Lyude: everything is "secret hidden internals"
22:08 imirkin_: karolherbst: MESA_DEBUG=flush will do the kick, but you can also alter it to do a glFinish() equivalent
22:09 Lyude: including the externals
22:09 karolherbst: ahhhh
22:09 imirkin_: karolherbst: if you can still get it to hang with that, you'll know exactly which draw is responsible
22:09 karolherbst: smaller push buffers might be helpful here :)
22:09 skeggsb: Lyude: in fairness, they're getting better about the front-end stuff
22:09 karolherbst: imirkin_: good idea, thanks!
22:09 Lyude: skeggsb: front-end being?
22:10 karolherbst: Lyude: the methods we send from mesa
22:10 imirkin_: skeggsb: btw, in case you missed it -- looks like gcc9 + struct init is going to cause you pain
22:10 Lyude: ah
22:10 imirkin_: skeggsb: see https://cgit.freedesktop.org/mesa/mesa/commit/?id=129a9f4937b8f2adb4d37999677d748d816d611c
22:10 karolherbst: essentially everything we do on a channel
22:11 skeggsb: yeah, we have headers/info for the display side of that, and for compute.. nothing for the 3d class unfortunately :P
22:11 Lyude: yeah
22:11 karolherbst: imirkin_: heh.. :/
22:11 karolherbst: what if there are no breaks
22:11 Lyude: i'm not really impressed by them releasing stuff that we're probably figuring out anyway and doesn't help us out with the actual problems nouveau has right now
22:11 Lyude: erm, *really big
22:12 Lyude: not actual, obv all problems are problems
22:12 karolherbst: :)
22:12 karolherbst: Lyude: by definition