15:34karolherbst: imirkin: you can invite dri-logger into the channel
15:34karolherbst: just tested it
15:34karolherbst: now it's there again :)
16:10half-beard: cosurgi, hey bud, I managed to get my GTX1060 working again :)
16:10half-beard: So I'm going to leave the 8400GS for now.
16:10half-beard: Next issue. I'm running GTX1060 with nouveau driver. My GPU has a feature called fan stop. How to I make it work with nouveau
16:12karolherbst: half-beard: by stopping the fan :p you can manually control the fan, but normally it's not advised to do that, because of heat generation
16:13karolherbst: "fan stop" just means that below a certain threshold the fan stops
16:13karolherbst: half-beard: you could open a bug and attach your vbios.rom file and I could take a look and see how we have to implement it
16:13karolherbst: or where we actually have to change our code
16:13karolherbst: it's a 1060 and it requires signed firmware to control the fan anyway
16:13karolherbst: so I don't think we can do anything right now
16:14karolherbst: still waiting on the PMU firmware files
16:14cosurgi: half-beard: congrats :)
16:16cosurgi: imirkin: gdb is still waiting for you. I continue my work in another xsrver :)
16:16cosurgi: (I have plenty :-D)
16:17cosurgi: In fact, I observed that this crash happens mostly in the morning when the screens wake up. So when I finish my work at night I switch VT to the xserver which can crash. And in the morning it crashes, then I switch back to VT with my work, and continue working :)
16:19cosurgi: fingrs crossed it won't crash while VT is on the xserver in which I work :]
16:22letterrip: hi all - could you point me to directions for using ssh to debug xserver? I used to have it setup so I could login via my phone over my local wifi connection but seem to have misplaced the directions
16:23letterrip: i'm pretty sure i got the directions from here for last time i set it up
16:25cosurgi: I'm going to paste my notes.txt about that.
16:25cosurgi: # git clone upstream git://anongit.freedesktop.org/nouveau/xf86-video-nouveau
16:25cosurgi: cd ~/xf86-video-nouveau/nouveau-xf86-video-nouveau
16:25cosurgi: make clean
16:25cosurgi: ./autogen.sh CFLAGS='-g -O0'
16:25imirkin_: cosurgi: dri-logs were down, so i don't have anything you said after i left... feel free to pastebin the conversation
16:26cosurgi: imirkin_: Oh, hello back!
16:26cosurgi: okay. hold on.
16:27imirkin_: skeggsb: sigh ... another long weekend, another failed opportunity to test the 1024 lut thing.
16:27cosurgi: imirkin_: https://paste.ubuntu.com/p/8NHxWqHtqy/
16:27cosurgi: imirkin_: we can continue gdb session.
16:28cosurgi: imirkin_: this is gdb output soo far: https://paste.ubuntu.com/p/BC8QCHxV5C/ this is Xorg.log: https://paste.ubuntu.com/p/hTS5jzSFgZ/
16:29imirkin_: yeah, reading...
16:29imirkin_: ok, we're in business...
16:29cosurgi: letterrip: ps auxw | grep Xorg to find the PID, then, `gdb -p PID`, then `handle SIGUSR1 nostop` and `handle SIGPIPE nostop` and `c` and gdb session is working.
16:29imirkin_: bo->map = 0x7f443a226000, size = 0x600000
16:29letterrip: cosurgi, thanks
16:30imirkin_: er, size = 0x6000000
16:30cosurgi: ok. what should I do with src=0x7f443a523bf0 ?
16:31imirkin_: nothing. just note to self.
16:31cosurgi: OK :)
16:31imirkin_: my beautiful theory ... full of fail.
16:31imirkin_: so that pointer *is* in the range it's supposed to be in
16:31imirkin_: i.e. it's between map and map+size
16:33imirkin_: skeggsb: any great ideas on how that can happen? i.e. we map a bo, and the pagefault handler is returning a SIGBUS?
16:33imirkin_: but nothing in the kernel logs
16:33cosurgi: and it usually happens under high CPU load when screens wake up from 'xset dpms force off'
16:33imirkin_: but this is an entirely desktop setup, right?
16:34imirkin_: i.e. there's no runpm turning off the gpu somehow
16:34cosurgi: what do you mean?
16:34cosurgi: oh. How would I check this?
16:34cosurgi:never heard about runpm
16:34imirkin_: is this a desktop or a laptop?
16:34imirkin_: then no runpm for the gpu :)
16:34imirkin_: (runpm = runtime power management)
16:34imirkin_: allows you to power off various devices
16:34cosurgi:never heard about laptop which could handle such 3 screens :)
16:34imirkin_: obviously requires the platform to play along
16:35cosurgi: is there any cat /proc/... that could tell us for sure if that's not power management problem?
16:35imirkin_: it's definitely not.
16:36cosurgi: but sometimes it happens while I resize gvim. Not always when screens are off.
16:36cosurgi: gvim is refreshing slowly (because of high CPU load) and the it crashes.
16:36cosurgi: gvim is refreshing slowly (because of high CPU load) and then xserver crashes.
16:37imirkin_: yeah, i mean it's some kind of condition
16:37imirkin_: something is getting exhausted, but i'm not sure what.
16:37cosurgi: one time it happened when I was scrolling pastebin in chromium.
16:37imirkin_: and normally it would scream bloody murder in that case
16:37imirkin_: and yet you don't have any such screams in dmesg
16:37cosurgi: oh. I didn't paste my last dmesg!
16:38cosurgi: dmsg: https://paste.ubuntu.com/p/csMcpz2XVd/
16:38cosurgi: From Xorg.log ( https://paste.ubuntu.com/p/hTS5jzSFgZ/ ) we should look around [3791615.920] timestamp.
16:39cosurgi: but yeah. There's nothing there, completely.
16:41cosurgi: maybe we need to poke around gdb a bit more? Session is still active.
16:41imirkin_: yeah, all your errors are about link training failures
16:41imirkin_: not about memory allocation failures
16:41cosurgi: what is link training?
16:41imirkin_: DP is a fancy protocol
16:41cosurgi: displayport ? ouch.
16:41imirkin_: search for "link training", i'm sure there's a thing that explains it better than i ever could
16:42imirkin_: basically there's a step of "make sure the physical media can support the requested symbol rate"
16:42cosurgi: That rings a bell.
16:43cosurgi: I wasn't sure which 5 meter DP cables to buy. So one is a lightfiber, two other are copper from different producers. They all work. But the copper ones have more often some problems.
16:43imirkin_: like if both sides can support 5.4Gbps, but they're connected by pringles cans and a wire, then you'll have to use one of the lower rates :)
16:44imirkin_: (yeah, i know, the pringles can thing works for audio, with the physical vibrations transmitted over the wire, but ... same principle.)
16:44cosurgi: "some problems" means that rarely the display is wrong. Turning screen on/off always helps.
16:44cosurgi: Definitely that falls into "link training" kind of problems.
16:44imirkin_: yeah. DP has an elaborate description of what you're supposed to do under various conditions
16:44imirkin_: i'm sure we handle like 5% of those
16:45imirkin_: the majority case is "things work" though, so it's not too bad.
16:45imirkin_: and the DP spec itself is closed, so... we can only guess.
16:45imirkin_: and it's one of those things where ignorance is bliss :)
16:46cosurgi: maybe we could add some loop which tries to solve the problems, and screams in the logs instead of crashing?
16:46imirkin_: it's starting to sound like a kernel-side problem
16:46cosurgi: This "link training" thing souncs like it should work after few tries.
16:47imirkin_: patches welcome.
16:47cosurgi: uh-huh :)
16:47cosurgi: So you are sure it's not inside nouveau?
16:47imirkin_: what is?
16:47cosurgi: the problem which leads to my crash.
16:47imirkin_: the current troubles? inside the nouveau/ttm kernel modules, i suspect.
16:48imirkin_: that's the current training code.
16:49cosurgi: ok. Why we don't see messages like OUTP_ERR(&dp->outp, "training failed"); ?
16:49imirkin_: [3303335.472839] nouveau 0000:04:00.0: disp: outp 03:0006:0f44: training failed
16:49imirkin_: you mean like that?
16:49cosurgi: ahh! It's there!
16:50cosurgi: last one is at 3764283.289062, while crash happened 3788584.630
16:50cosurgi: what are the units in this timestamp?
16:50cosurgi: Is that 2 seconds ?
16:53cosurgi: hmm no. That looks like 6hours difference: (3764283-3788584)/3600 = -6.75
16:53cosurgi: Maybe link training failed when I was sleeping, then xserver crashed in the morning.
16:53cosurgi: It had 6 more hours to retry.
16:53cosurgi: But screens were turned off.
16:53imirkin_: seconds since boot
16:53imirkin_: but not accurate seconds
16:53imirkin_: internal kernel time counter seconds
16:53imirkin_: as opposed to ntp-sync'd seconds
16:55cosurgi: I turn off the screens when I go to sleep. Because one of the screens refuses to go into proper dpms sleep. It wakes up every 3 minutes. Says there is no signal then turns off again.
16:55cosurgi: Maybe that's not a problem with the monitor, but the kernel is checking connection?
16:55imirkin_: is it one of the fiber ones?
16:55cosurgi: No. The copper one.
16:55cosurgi: The light fiber one works the best.
16:56imirkin_: and presumably the monitors themselves are identical?
16:56cosurgi: yes. Completely
16:56cosurgi: Only cables are different.
16:57cosurgi: It seems that a working workaround for me is to switch to text VT when I go away.
16:57cosurgi: Worse if it happens during work.
16:58cosurgi: however. I could confirm our working theory by checking dmesg for "training failed" before each crash.
16:59imirkin_: i doubt it matters
16:59imirkin_: your message was like 20000 seconds away
16:59imirkin_: or 2000?
16:59imirkin_: some large number.
16:59cosurgi: 6 hours.
17:00cosurgi: It could have crashed during the night.
17:00cosurgi: While I was sleeping.
17:00cosurgi: Ah no!
17:00cosurgi: This crash which we investigate now is *live*
17:00cosurgi: it happened while we were talking
17:00cosurgi: and there were no "training failed" messages about that time.
17:02cosurgi: So we have two options (1) link training did not produce any message (2) it's not link training problem.
17:03cosurgi: imirkin_: recall that gdb session is still active :)
17:03imirkin_: yeah. i doubt you can get too much more out of that one
17:06cosurgi: ok. I have now startx running with -O0 -g
17:06cosurgi: So we will not have 'optimized out' messages, right?
17:06cosurgi: I did this:
17:06cosurgi: ./autogen.sh CFLAGS='-g -O0'
17:07cosurgi: Also I wanted to make sure: this is the repository I should follow: git://anongit.freedesktop.org/nouveau/xf86-video-nouveau ?
17:07cosurgi: and this is the commit which I have running: ec2b45d Bump version to 1.0.16
17:08cosurgi: I will keep this gdb session for a while, maybe you will think of something.
17:08cosurgi: I will close it when we get another crash :)
17:11cosurgi: ok. Thanks a lot :)
17:31letterrip: hi all - after a freeze - I do sysreq+r; then try and get a terminal via ctrl+alt+f1
17:31letterrip: but no terminal appears
17:32letterrip: the sysreq + b definitely works
17:33letterrip: (by freeze I mean that no input seems available; mouse pointer still moves; but nothing else functions)
17:33letterrip: any ideas/suggestions?
17:34imirkin_: nouveau's handling of hangs is pretty piss-poor
18:49karolherbst: the mesa bits, yes
18:49imirkin_: kernel bits too
18:49karolherbst: no, the kernel is fine in most cases
18:49imirkin_: i get hangs, i have to ssh in and kill the process, and if i'm too late, the whole display server
18:49karolherbst: most freezes I've encounter were userspace only issues
18:49karolherbst: imirkin_: that's userspace
18:49imirkin_: the X server gets hung.
18:49imirkin_: by the kernel.
18:49karolherbst: well, the channel is dead
18:50karolherbst: but yeah
18:50karolherbst: the kernel _should_ notify userspace about it
18:50karolherbst: and I've written patches
18:50karolherbst: but nobody reviewed it
18:51karolherbst: but I don't like the designs I've choosen so far, but still
18:51karolherbst: it's easily fixable
18:51karolherbst: even asked skeggsb for help to get around all that nvif layering to implement a proper ioctl
18:55karolherbst: anyway.. the kernel has everything in place, it just needs to be exposed to userspace
18:55karolherbst: but everytime I look at that nvif code I just want to run away