15:34 karolherbst: imirkin: you can invite dri-logger into the channel
15:34 karolherbst: just tested it
15:34 karolherbst: now it's there again :)
16:10 half-beard: cosurgi, hey bud, I managed to get my GTX1060 working again :)
16:10 half-beard: So I'm going to leave the 8400GS for now.
16:10 half-beard: Next issue. I'm running GTX1060 with nouveau driver. My GPU has a feature called fan stop. How to I make it work with nouveau
16:12 karolherbst: half-beard: by stopping the fan :p you can manually control the fan, but normally it's not advised to do that, because of heat generation
16:13 karolherbst: "fan stop" just means that below a certain threshold the fan stops
16:13 karolherbst: half-beard: you could open a bug and attach your vbios.rom file and I could take a look and see how we have to implement it
16:13 karolherbst: or where we actually have to change our code
16:13 karolherbst: _but_
16:13 karolherbst: it's a 1060 and it requires signed firmware to control the fan anyway
16:13 karolherbst: so I don't think we can do anything right now
16:14 karolherbst: still waiting on the PMU firmware files
16:14 cosurgi: half-beard: congrats :)
16:16 cosurgi: imirkin: gdb is still waiting for you. I continue my work in another xsrver :)
16:16 cosurgi: (I have plenty :-D)
16:17 cosurgi: In fact, I observed that this crash happens mostly in the morning when the screens wake up. So when I finish my work at night I switch VT to the xserver which can crash. And in the morning it crashes, then I switch back to VT with my work, and continue working :)
16:19 cosurgi: fingrs crossed it won't crash while VT is on the xserver in which I work :]
16:22 letterrip: hi all - could you point me to directions for using ssh to debug xserver? I used to have it setup so I could login via my phone over my local wifi connection but seem to have misplaced the directions
16:23 letterrip: i'm pretty sure i got the directions from here for last time i set it up
16:25 cosurgi: I'm going to paste my notes.txt about that.
16:25 cosurgi: # git clone upstream git://anongit.freedesktop.org/nouveau/xf86-video-nouveau
16:25 cosurgi: cd ~/xf86-video-nouveau/nouveau-xf86-video-nouveau
16:25 cosurgi: make clean
16:25 cosurgi: ./autogen.sh CFLAGS='-g -O0'
16:25 cosurgi: make
16:25 imirkin_: cosurgi: dri-logs were down, so i don't have anything you said after i left... feel free to pastebin the conversation
16:26 cosurgi: imirkin_: Oh, hello back!
16:26 cosurgi: okay. hold on.
16:27 imirkin_: skeggsb: sigh ... another long weekend, another failed opportunity to test the 1024 lut thing.
16:27 cosurgi: imirkin_: https://paste.ubuntu.com/p/8NHxWqHtqy/
16:27 cosurgi: imirkin_: we can continue gdb session.
16:28 cosurgi: imirkin_: this is gdb output soo far: https://paste.ubuntu.com/p/BC8QCHxV5C/ this is Xorg.log: https://paste.ubuntu.com/p/hTS5jzSFgZ/
16:29 imirkin_: yeah, reading...
16:29 imirkin_: ok, we're in business...
16:29 cosurgi: letterrip: ps auxw | grep Xorg to find the PID, then, `gdb -p PID`, then `handle SIGUSR1 nostop` and `handle SIGPIPE nostop` and `c` and gdb session is working.
16:29 imirkin_: bo->map = 0x7f443a226000, size = 0x600000
16:29 letterrip: cosurgi, thanks
16:30 imirkin_: er, size = 0x6000000
16:30 imirkin_: src=0x7f443a523bf0
16:30 cosurgi: ok. what should I do with src=0x7f443a523bf0 ?
16:31 imirkin_: nothing. just note to self.
16:31 cosurgi: OK :)
16:31 imirkin_: my beautiful theory ... full of fail.
16:31 imirkin_: so that pointer *is* in the range it's supposed to be in
16:31 imirkin_: i.e. it's between map and map+size
16:33 imirkin_: skeggsb: any great ideas on how that can happen? i.e. we map a bo, and the pagefault handler is returning a SIGBUS?
16:33 imirkin_: but nothing in the kernel logs
16:33 cosurgi: and it usually happens under high CPU load when screens wake up from 'xset dpms force off'
16:33 imirkin_: but this is an entirely desktop setup, right?
16:34 imirkin_: i.e. there's no runpm turning off the gpu somehow
16:34 cosurgi: what do you mean?
16:34 cosurgi: oh. How would I check this?
16:34 cosurgi:never heard about runpm
16:34 imirkin_: is this a desktop or a laptop?
16:34 cosurgi: desktop
16:34 imirkin_: then no runpm for the gpu :)
16:34 imirkin_: (runpm = runtime power management)
16:34 imirkin_: allows you to power off various devices
16:34 cosurgi:never heard about laptop which could handle such 3 screens :)
16:34 imirkin_: obviously requires the platform to play along
16:35 cosurgi: is there any cat /proc/... that could tell us for sure if that's not power management problem?
16:35 imirkin_: it's definitely not.
16:36 cosurgi: but sometimes it happens while I resize gvim. Not always when screens are off.
16:36 cosurgi: gvim is refreshing slowly (because of high CPU load) and the it crashes.
16:36 cosurgi: gvim is refreshing slowly (because of high CPU load) and then xserver crashes.
16:37 imirkin_: yeah, i mean it's some kind of condition
16:37 imirkin_: something is getting exhausted, but i'm not sure what.
16:37 cosurgi: one time it happened when I was scrolling pastebin in chromium.
16:37 imirkin_: and normally it would scream bloody murder in that case
16:37 imirkin_: and yet you don't have any such screams in dmesg
16:37 cosurgi: oh. I didn't paste my last dmesg!
16:38 cosurgi: dmsg: https://paste.ubuntu.com/p/csMcpz2XVd/
16:38 cosurgi: From Xorg.log ( https://paste.ubuntu.com/p/hTS5jzSFgZ/ ) we should look around [3791615.920] timestamp.
16:39 cosurgi: but yeah. There's nothing there, completely.
16:41 cosurgi: maybe we need to poke around gdb a bit more? Session is still active.
16:41 imirkin_: yeah, all your errors are about link training failures
16:41 imirkin_: not about memory allocation failures
16:41 cosurgi: what is link training?
16:41 imirkin_: DP is a fancy protocol
16:41 cosurgi: displayport ? ouch.
16:41 imirkin_: search for "link training", i'm sure there's a thing that explains it better than i ever could
16:42 imirkin_: basically there's a step of "make sure the physical media can support the requested symbol rate"
16:42 cosurgi: That rings a bell.
16:43 cosurgi: I wasn't sure which 5 meter DP cables to buy. So one is a lightfiber, two other are copper from different producers. They all work. But the copper ones have more often some problems.
16:43 imirkin_: like if both sides can support 5.4Gbps, but they're connected by pringles cans and a wire, then you'll have to use one of the lower rates :)
16:44 imirkin_: (yeah, i know, the pringles can thing works for audio, with the physical vibrations transmitted over the wire, but ... same principle.)
16:44 cosurgi: "some problems" means that rarely the display is wrong. Turning screen on/off always helps.
16:44 cosurgi: Definitely that falls into "link training" kind of problems.
16:44 imirkin_: yeah. DP has an elaborate description of what you're supposed to do under various conditions
16:44 imirkin_: i'm sure we handle like 5% of those
16:45 imirkin_: the majority case is "things work" though, so it's not too bad.
16:45 imirkin_: and the DP spec itself is closed, so... we can only guess.
16:45 imirkin_: and it's one of those things where ignorance is bliss :)
16:46 cosurgi: maybe we could add some loop which tries to solve the problems, and screams in the logs instead of crashing?
16:46 imirkin_: it's starting to sound like a kernel-side problem
16:46 cosurgi: This "link training" thing souncs like it should work after few tries.
16:47 imirkin_: patches welcome.
16:47 cosurgi: uh-huh :)
16:47 cosurgi: So you are sure it's not inside nouveau?
16:47 imirkin_: what is?
16:47 cosurgi: the problem which leads to my crash.
16:47 imirkin_: the current troubles? inside the nouveau/ttm kernel modules, i suspect.
16:48 imirkin_: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/engine/disp/dp.c#L344
16:48 imirkin_: that's the current training code.
16:49 cosurgi: ok. Why we don't see messages like OUTP_ERR(&dp->outp, "training failed"); ?
16:49 imirkin_: [3303335.472839] nouveau 0000:04:00.0: disp: outp 03:0006:0f44: training failed
16:49 imirkin_: you mean like that?
16:49 cosurgi: ahh! It's there!
16:49 cosurgi: sorry
16:50 cosurgi: last one is at 3764283.289062, while crash happened 3788584.630
16:50 cosurgi: what are the units in this timestamp?
16:50 cosurgi: Is that 2 seconds ?
16:51 cosurgi: difference?
16:53 cosurgi: hmm no. That looks like 6hours difference: (3764283-3788584)/3600 = -6.75
16:53 cosurgi: Maybe link training failed when I was sleeping, then xserver crashed in the morning.
16:53 cosurgi: It had 6 more hours to retry.
16:53 cosurgi: But screens were turned off.
16:53 imirkin_: seconds since boot
16:53 imirkin_: but not accurate seconds
16:53 imirkin_: internal kernel time counter seconds
16:53 imirkin_: as opposed to ntp-sync'd seconds
16:55 cosurgi: I turn off the screens when I go to sleep. Because one of the screens refuses to go into proper dpms sleep. It wakes up every 3 minutes. Says there is no signal then turns off again.
16:55 imirkin_: annoying.
16:55 cosurgi: Maybe that's not a problem with the monitor, but the kernel is checking connection?
16:55 imirkin_: is it one of the fiber ones?
16:55 cosurgi: No. The copper one.
16:55 imirkin_: double-odd.
16:55 cosurgi: The light fiber one works the best.
16:56 imirkin_: and presumably the monitors themselves are identical?
16:56 cosurgi: yes. Completely
16:56 cosurgi: Only cables are different.
16:57 cosurgi: It seems that a working workaround for me is to switch to text VT when I go away.
16:57 cosurgi: Worse if it happens during work.
16:58 cosurgi: however. I could confirm our working theory by checking dmesg for "training failed" before each crash.
16:59 imirkin_: i doubt it matters
16:59 imirkin_: your message was like 20000 seconds away
16:59 cosurgi: oh?
16:59 imirkin_: or 2000?
16:59 imirkin_: some large number.
16:59 cosurgi: 6 hours.
17:00 cosurgi: It could have crashed during the night.
17:00 cosurgi: While I was sleeping.
17:00 cosurgi: Ah no!
17:00 cosurgi: This crash which we investigate now is *live*
17:00 cosurgi: it happened while we were talking
17:00 cosurgi: and there were no "training failed" messages about that time.
17:02 cosurgi: So we have two options (1) link training did not produce any message (2) it's not link training problem.
17:02 cosurgi: ?
17:03 cosurgi: imirkin_: recall that gdb session is still active :)
17:03 imirkin_: yeah. i doubt you can get too much more out of that one
17:06 cosurgi: ok. I have now startx running with -O0 -g
17:06 cosurgi: So we will not have 'optimized out' messages, right?
17:06 cosurgi: I did this:
17:06 cosurgi: ./autogen.sh CFLAGS='-g -O0'
17:07 cosurgi: Also I wanted to make sure: this is the repository I should follow: git://anongit.freedesktop.org/nouveau/xf86-video-nouveau ?
17:07 cosurgi: and this is the commit which I have running: ec2b45d Bump version to 1.0.16
17:08 cosurgi: I will keep this gdb session for a while, maybe you will think of something.
17:08 cosurgi: I will close it when we get another crash :)
17:09 imirkin_: right.
17:11 cosurgi: ok. Thanks a lot :)
17:31 letterrip: hi all - after a freeze - I do sysreq+r; then try and get a terminal via ctrl+alt+f1
17:31 letterrip: but no terminal appears
17:32 letterrip: the sysreq + b definitely works
17:33 letterrip: (by freeze I mean that no input seems available; mouse pointer still moves; but nothing else functions)
17:33 letterrip: any ideas/suggestions?
17:34 imirkin_: nouveau's handling of hangs is pretty piss-poor
18:49 karolherbst: the mesa bits, yes
18:49 imirkin_: kernel bits too
18:49 karolherbst: no, the kernel is fine in most cases
18:49 imirkin_: i get hangs, i have to ssh in and kill the process, and if i'm too late, the whole display server
18:49 karolherbst: most freezes I've encounter were userspace only issues
18:49 karolherbst: imirkin_: that's userspace
18:49 imirkin_: the X server gets hung.
18:49 imirkin_: by the kernel.
18:49 karolherbst: well, the channel is dead
18:50 karolherbst: but yeah
18:50 karolherbst: the kernel _should_ notify userspace about it
18:50 karolherbst: and I've written patches
18:50 karolherbst: but nobody reviewed it
18:51 karolherbst: but I don't like the designs I've choosen so far, but still
18:51 karolherbst: it's easily fixable
18:51 karolherbst: even asked skeggsb for help to get around all that nvif layering to implement a proper ioctl
18:55 karolherbst: anyway.. the kernel has everything in place, it just needs to be exposed to userspace
18:55 karolherbst: but everytime I look at that nvif code I just want to run away