00:02 RSpliet: I mean, there could be 1001 reasons why the engines crap themselves and either crash or busy-loop infinitely. Missing some status register when checking interrupt reasons sounds plausible, but that bit is just a wild guess. I don't expect there's too much value in looking at the falcons, they're just the grandson visiting grannys house with her requested groceries only to be scarred for life as he discovers she's been dead for three da
00:02 RSpliet: s.
00:02 skeggsb: the fw has nothing to do with it, it's hw that detects that an engine hasn't context switched in an appropriate amount of time and signals that error
00:02 skeggsb: and yes, it could literally be *anything*. it's Host (PFIFO) going "hmm, i told the engine to switch channels, and it still hasn't"
00:03 karolherbst: okay... but then the questions: why would an engine be upset about a vertex buffer inside vram, but system mem is fine?
00:04 skeggsb: it wouldn't be, the issue is something else.. possibly triggered by that change *somehow*, but it's not the reason
00:05 karolherbst: we already guessed that using sys mem might have slowed down things enough to delay something... but this means it's quite hard to track down what it actually is :/
00:08 RSpliet: karolherbst: What other differences are there? Listen to skeggsb over me, he's had his nose in GPUs for 10 years while I'm just thinking about generic concepts, but... :-p For vertex buffers in VRAM you require a buffer upload, presumably in the process a fence is created in the shader launch descriptor... whereare a sysmem buffer just needs to be mapped in? Could the engine be waiting for a fence that's never signalled?
00:09 RSpliet: (errr... not the descriptor, but anywhere in the command stream. I guess)
00:09 karolherbst: well the application(s) seems to use client vbufs as well
00:09 karolherbst: just checked with one
00:10 karolherbst: what's curious is that with faster vram the error usually triggers later in the apitraces I have :/
00:12 RSpliet: On the contrary then, maybe it forgets to wait for a fence. Cards with faster VRAM have a higher probability that the upload went through in time?
00:13 karolherbst: I think for uploads the pcie bus speed is actually more important, but yeah, that changes as well
00:14 karolherbst: or well both are more or less important I guess, but usually pcie is the more common bottleneck here
00:14 karolherbst: huh, wait, that's unusual
00:20 karolherbst: there is always a "mmu: user: getref 0 mapref 0 sparse 0 shift: 17 align: 0 size: 0000000000880000" before the timeout, but I think that's a big coincidence
00:21 karolherbst: point is, I have no idea how to actually debug that
00:25 RSpliet: I don't think I know what that means. Other than that it's saying something about a buffer of 8.5MiB?
00:25 karolherbst: yeah...
00:25 karolherbst: anyway, it would be good to know _which_ engine(s) are the issue here
00:26 karolherbst: skeggsb: is there any way to figure that out?
00:32 RSpliet: karolherbst: that message originates from nvkm_vmm_get_locked from the looks of it
00:33 karolherbst: I seriously think that mmu message is super useless to debug that
00:35 HdkR: karolherbst: Random out of the blue question. Did the threading fixes for Nouveau ever get upstreamed? :D
00:35 karolherbst: no
00:35 karolherbst: they aren't even finished
00:35 HdkR: ah
00:35 RSpliet: Idk. It's information, and if this is the exact allocation of the vertex buffer (and not a few allocations earlier) it requests a "shift" value that I hope is still adequate for the GPU you're investigating ;-)
00:36 karolherbst: RSpliet: same thing happens with vertex buffers inside sys mem
00:36 karolherbst: and there are tons of those messages overall
00:36 RSpliet: Yeah, every allocation is logged... so it's unlikely that allocation is the right one
00:37 karolherbst: maybe it's not even an allocation at all but something else?
00:37 karolherbst: I just want to have a solid way on how to debug those stupid issues
00:37 karolherbst: it's not like those are rare or something
00:38 karolherbst: I was actually trying out several games and hit that issue with 3 games in a row
00:38 karolherbst: and all three games are magically fixed when I put the vertex buffers into sysmem
00:41 RSpliet: The odd thing is that it sounds like a matter of probability. Like, it doesn't happen for every vertex buffer that's allocated in VRAM, but after so many frames/whatever it suddenly fails?
00:42 karolherbst: who knows
00:42 karolherbst: maybe it's not even the vertex buffer which is the problem
00:44 RSpliet: "what's curious is that with faster vram the error usually triggers later in the apitraces I have :/" <- for given GPU it always fails at the same call in the apitrace?
00:45 karolherbst: mhhh, well
00:45 karolherbst: not really
00:45 karolherbst: but
00:45 karolherbst: it's not entirely random either
00:45 karolherbst: it's not that random so you think you could bisect the call
00:45 karolherbst: but if you reach your bisect goal it moved like 2k calls earlier
00:45 karolherbst: or something
00:46 karolherbst: it's quite deterministic, but not quite
00:46 karolherbst: and changing the pstate moves the general area y quite a lot (multiple frames)
00:47 RSpliet: Since it's vertex buffers that expose the anomaly, what can you tell about them. When it crashes, how many of those vertex buffers have been allocated? What are their virtual addresses, physical addresses, sizes, alignments...?
00:47 RSpliet: (I'm in bed btw, but this would be my line of reasoning)
00:48 karolherbst: and I am sure this won't lead us anywhere really. There are essentially hundreds of those
00:48 karolherbst: maybe even thousends
00:49 karolherbst: and to be honest, I don't even want to have to spend that much time of guessing or debugging if the driver/firmware could just tell us what's up
00:49 karolherbst: so I want to focus on that first
00:49 karolherbst: so, fine, the context switching didn't happen
00:49 karolherbst: question is: why?
00:49 karolherbst: which engines didn't do whatever it has to do?
00:50 karolherbst: apperantly the firmware or something else has to be able to tell us
00:50 karolherbst: and if there is no way to figure that out, that's also a nice answer to have
00:51 skeggsb: the driver *tells* you which engine(s) have stalled in the logs
00:51 skeggsb: not sure why that's even a question :P
00:52 karolherbst: huh? I am sure I would have seen that
00:53 skeggsb: there should be messages about runlists/engine recovery being scheduled
00:54 karolherbst: okay, let me check again
00:58 karolherbst: mhh okay, I see
00:58 karolherbst: runlist 0 and engines 7,0
01:00 skeggsb: so it's gr/grce, more likely gr itself than the copy engine
01:01 karolherbst: yeah... would be cool to have the engine names directly in that message though
01:03 karolherbst: anyway, now it would be also helpful to know what they are up doing
01:05 karolherbst: skeggsb: any idea on how to track that down? I assume there is not really a register telling us that directly?
01:08 karolherbst: worst case we could check the last submitted pushbuffers or something
01:44 karolherbst: okay nice.. now I need to parse them commands :)
02:00 karolherbst: mhhh, maybe I should try mmt before that though
02:00 karolherbst: but I kind of thought that it doesn't really work anymore with nouveau
02:07 karolherbst: heh, it just aborts and doesn't do much...
02:20 imirkin: you can flip drm->nvif=false
02:20 imirkin: which helsp
02:20 imirkin: but it still dies
02:20 imirkin: (or maybe not - i did have the local changes)
03:08 karolherbst: imirkin: I think I will still write my own more or less hacky pb parser as running mmt slows things down so much :/
03:08 karolherbst: just dumping whatever gets passed to PUSH_DATA should be enough here anyway
03:56 karolherbst: mhh, we have some direct writes into the buffer as well outside of PUSH_DATA :/ oh well
03:57 karolherbst: at least this are the last bytes written: https://gist.githubusercontent.com/karolherbst/671d29f871350a17a712ec73a6fd9e3f/raw/1fb3613d7a0524270e900ccf3e1d039f0d2727c7/gistfile1.txt
03:57 karolherbst: still want to write up rnndb
03:57 karolherbst: *wire
04:03 karolherbst: imirkin: you wouldn't mind a DEBUG option for mesa where we can just dump the pushbuffer, right? mmt is so painful for seriously big applications :)
04:27 imirkin: that's fine
04:27 imirkin: might be easier to stick into libdrm though
04:27 imirkin: iirc it might already have something
04:44 imirkin: karolherbst: btw, still working on that piglit test. got side-tracked by ... not-work things. once i have some confidence that dropping the << ms thing doesn't hurt, i'll push it out
04:44 imirkin: i'm beyond confused about how this all works though
04:44 imirkin: in part because we cheat regarding drawing to and texturing from MS
04:44 imirkin: we're going behind the card's back, and that can never end well
04:44 imirkin: esp when we don't know what we're doing
13:18 karolherbst: ohh, there is NOUVEAU_LIBDRM_DEBUG
13:19 karolherbst: let's see if I can use that.. but normally it prints it out all string based, which is a little bit slow to parse with sscanf :/
14:09 imirkin: karolherbst: probably better to improve it a bit and have it dump to a side file
14:10 karolherbst: yeah... but the bigger problem in all that is parsing those files anyway
14:10 karolherbst: adjusting libdrm isn't really the problem here
14:15 imirkin: :)
14:15 imirkin: the side file could have an easy-to-parse format
14:16 karolherbst: yeah.. currently just writing a u32 binary file with the commands
14:16 karolherbst: still have to figure out how all that rnndb stuff works out
14:17 karolherbst: I already know I need different rnn contexts for each subchannel... but now I am wondering how I get the correct subchannel object names
14:18 karolherbst: like which obj-class to use for the SW methods
14:19 karolherbst: ohh wait, there is just the nv1 version anyway
14:19 imirkin: you look at the binding on method 0
14:20 imirkin: that gives you the class to use for that channel
14:20 karolherbst: ahhh :)
14:20 karolherbst: NV1_NULL is the name
14:21 karolherbst: imirkin: right, but I meant for the very first method
14:21 karolherbst: SUBCHAN.OBJECT sounds about right
14:21 imirkin: yes
14:21 imirkin: but since you'll have to update the rnn context anyways
14:21 imirkin: you kinda have to have special-case ahndling for it
14:22 karolherbst: right
14:22 karolherbst: or I just have one context for each subchannel
14:22 imirkin: you definitely need that anyways
14:23 imirkin: fun fact - you can execute methods without binding any class
14:23 imirkin: if they're the "subchan" methods
14:24 karolherbst: fun :/
14:32 karolherbst: okay... seems to work with hardcoded channel names for now :)
14:33 karolherbst: imirkin: what's the difference between those SQ/NI/1I thingies though? SQ I imagine that each write increases the method, but what are NI and 1I doing?
14:34 imirkin: NI = non-incrementing
14:34 imirkin: 1I = increment the first time, but not later times ;)
14:34 imirkin: i.e. one-increment
14:34 karolherbst: ahhh
14:34 imirkin: it's all in the hwdocs
14:34 imirkin: look for GF100 command formats
14:34 karolherbst: I guess those are the most common use cases
14:35 imirkin: gtg
15:00 karolherbst: nice, it's working :)
15:08 karolherbst: imirkin: last commands sent: https://gist.githubusercontent.com/karolherbst/01a99d487e9ece848fb1ba72f91b3350/raw/6f281ace6c6c5686d8f82e46f78a66d32f96e043/gistfile1.txt
15:08 imirkin_: that's just constbuf data
15:10 imirkin_: mode PB_CMD_FIFO_1I size 29 channel 3D mthd 238c
15:10 imirkin_: but there are only 19 commands...
15:10 imirkin_: which probably means it hung earlier
15:10 imirkin_: do you mark submits anywhere?
15:10 karolherbst: only 19?
15:11 karolherbst: that file is like 10k lines
15:11 imirkin_: the last group
15:11 karolherbst: ahh
15:11 imirkin_: sorry
15:11 karolherbst: no, I didn't mark the submits yet
15:11 imirkin_: ok, submits are going to be SUPER important for this
15:11 karolherbst: mhh, I'd expect that an earlier submit could also cause some issues?
15:11 karolherbst: but yeah, marking them should be helpful
15:12 imirkin_: either the last, or second-to-last, or some interaction
15:12 karolherbst: imirkin_: but I'd assume it could happen since the last GRAPH.SERIALIZE, no?
15:12 imirkin_: but either way, knowing what actually got sent and what's just in a buffer would be great
15:12 imirkin_: mmmmaybe
15:12 karolherbst: ohh, it's just what got sent
15:12 imirkin_: that's impossible
15:12 karolherbst: nothing not submited
15:12 imirkin_: that last group is impossible.
15:12 karolherbst: why?
15:13 imirkin_: that would imply we don't PUSH_SPACE correctly somewhere ... or something
15:13 karolherbst: mhhh
15:13 karolherbst: could be a unflushed write
15:13 karolherbst: let me check
15:15 karolherbst: let me just put some fflush in the dumping code, just to be sure
15:17 imirkin_: btw - minor little feature request for this thing
15:17 imirkin_: instead of
15:17 imirkin_: print
15:17 imirkin_: GP104_3D.VERTEX_BEGIN_GL = ...
15:17 imirkin_: that way it's immediately obvious which class is being used
15:18 karolherbst: yeah, makes sense
15:18 imirkin_: also, "channel 3D" is ... weird.
15:18 imirkin_: where is the "3D" coming from?
15:18 karolherbst: hard coded string
15:18 imirkin_: you're talking about the subchannel there?
15:18 karolherbst: yeah
15:18 imirkin_: or the literal channel?
15:18 imirkin_: ok
15:20 karolherbst: mhh rnndb only knows GF100_3D :)
15:20 karolherbst: ohh there is GK104..
15:20 karolherbst: what's the proper name for GM204?
15:21 karolherbst: GM204_3D
15:25 imirkin_: well, you can get the proper name from rnndb
15:25 cosurgi: imirkin_: welp. Another xserver died on me. But the old version 1.0.15. I will start a new one 1.0.16. Do you want to see the error this time?
15:25 imirkin_: can't hurt
15:26 imirkin_: we can see if it was the same thing as last time
15:27 imirkin_: i've definitely had X servers up for a couple months at a time. but my usage is probably not as intensive.
15:27 cosurgi: yeah. here's dmesg: https://paste.ubuntu.com/p/NPW4Sg5Mty/ crap! There's nothing in dmesg!
15:27 imirkin_: iirc xorg log had something
15:27 imirkin_: [last time]
15:29 cosurgi: imirkin_: yeah, and here's xorg log: https://paste.ubuntu.com/p/wDn85WDwyp/
15:30 imirkin_: curious.
15:30 imirkin_: this is something else, i think
15:30 cosurgi: I don't remember if I had comption running there. It didn't use that xserver for over a week. Then today I started switching back and forth between that one and the 1.0.16 one.
15:30 imirkin_: note how it's entirely inside Xorg too
15:30 cosurgi: and working on two things in parallel
15:30 imirkin_: inside the "VTEnter" handler
15:30 imirkin_: i.e. something when you switched back
15:30 imirkin_: there were also funny allocation errors while the VT was not active
15:32 cosurgi: yeah. I guess there wasn't comption running.
15:32 imirkin_: the two are likely related
15:32 imirkin_: i.e. some allocation failed, and then sadness on re-entry into the VT
15:33 imirkin_: but ... more symbols required to actually tell what and where
15:33 cosurgi: how can I give you more symbols?
15:33 imirkin_: you probably strip the Xorg binary?
15:33 imirkin_: "don't do that" :)
15:33 cosurgi: definitely not by hand, by myself.
15:33 imirkin_: not with your bare hands?
15:33 cosurgi: I also have this *-dbg.deb package
15:33 imirkin_: bit by bit
15:33 imirkin_: byte by byte
15:34 cosurgi: xserver-xorg-video-nouveau-dbgsym_1.0.15-3_amd64.deb
15:34 cosurgi: this one.
15:34 imirkin_: xorg itself
15:34 imirkin_: xserver-xorg-dbg
15:34 imirkin_: or ... something
15:35 imirkin_: debian creates very confusing names
15:35 imirkin_: but at least they achieve their goal -- of making things hard to find and debug.
15:37 cosurgi: hm. Inside this package I find this file: /usr/lib/debug/.build-id/11/f7db6795951feb143bc923a610ce9dc5b5586a.debug
15:37 cosurgi: and when I look inside it with midnight commander it shows me a log of debug submols
15:39 cosurgi: imirkin_: http://janek.kozicki.pl/tmp/ef04ed2344ec4dd344fec140dfd1639932feb4.debug
15:39 cosurgi: imirkin_: maybe that's what you need?
15:39 imirkin_: what do i do with this?
15:39 imirkin_: what i need is a symbolized stack trace
15:39 imirkin_: which Xorg does by default
15:40 imirkin_: unless it doesn't have symbols
15:41 cosurgi: hm. Okay. I don't know how to make this stack trace symbolized.
15:41 cosurgi: But perhaps when I startx this time I can make sure it will have symbols.
15:42 cosurgi: hm, wait. Strange
15:42 cosurgi: last week I had symbols in the stack trace, right?
15:42 cosurgi: Maybe that has something to do with me resinstalling xserver-xorg-video-nouveau
15:42 imirkin_: not sure that you did
15:43 cosurgi: A version 1.0.15 crashed, but a new package 1.0.16 was already installed and 1.0.15 was removed.
15:43 cosurgi: So when 1.0.15 crashed today, the files with symbols were removed.
15:43 imirkin_: that package in no way affects this backtrace
15:43 imirkin_: the symbols are in xorg
15:43 imirkin_: not xf86-video-nouveau
15:44 cosurgi: ah, so maybe don't have the *dbgsym*deb package for xorg installed, let me check.
15:44 karolherbst: imirkin_: oh wow... that paste I sent you, it wasn't even an entire submit
15:45 cosurgi: hm, this package I have installed: xserver-xorg-video-nouveau-dbgsym
15:45 imirkin_: right...
15:45 imirkin_: which has the debug symbols for nouveau_drv.so presumably
15:46 imirkin_: which you'll see is nowhere in that stack trace
15:49 cosurgi: imirkin_: ok, I know the problem. The version of this package which I have installed is 1:1.0.16-1
15:50 imirkin_: again, not the right package.
15:50 imirkin_: you're looking at nouveau
15:50 imirkin_: you need to look at xorg
15:51 cosurgi: something like xserver-xorg 1:7.7+19 ?
15:51 imirkin_: that's probably a meta-package
15:51 karolherbst: imirkin_: https://gist.githubusercontent.com/karolherbst/e85f3fa4798cd86dc4325a4d68c3749a/raw/5eedc08ccd0e6de00b929d4394ac55eab673506a/gistfile1.txt
15:51 imirkin_: cosurgi: check what package contains the Xorg binary
15:51 imirkin_: you want that one.
15:51 imirkin_: karolherbst: ok, that makes a lot more sense
15:52 imirkin_: i.e. way it ends
15:52 karolherbst: :)
15:52 karolherbst: well I also added the PB submissions
15:52 karolherbst: and their length
15:52 cosurgi: dpkg -L shows: xserver-xorg-core: /usr/bin/Xorg
15:52 imirkin_: cosurgi: ok, then that's what you want the debug version of
15:53 cosurgi: ok, I'm looking for it.
15:54 imirkin_: karolherbst: ok, and what happens here?
15:54 imirkin_: you get the timeout?
15:54 karolherbst: yeah
16:00 karolherbst: imirkin_: do you think it might make debugging easier if we limit the max size of pushbuffers a little? those 100k lines of print is a bit too much I guess
16:01 karolherbst: or can we actually do that?
16:04 karolherbst: uff, I have a wild idea
16:04 karolherbst: I run that with vertex buffers into sys mem and diff the prints...
16:12 karolherbst: I guess we aren't living in a time yet were we can just diff 100Mb+ sized files
16:12 karolherbst: ohh, actually, it kind of seems to work, allthough quite slowly
16:17 cosurgi: imirkin_: whew. I have this: xserver-xorg-core-dbgsym, now my stacktrace should look better :)
16:18 imirkin_: yay
16:18 cosurgi: imirkin_: it wasn't in devuan repo. Guys from #devuan told me to pull it directly from debian.
16:18 imirkin_: sounds like fun
16:18 imirkin_:uses gentoo
16:19 cosurgi: I had to add one line to /etc/apt/sources.list, this one: deb http://debug.mirrors.debian.org/debian-debug/ stretch-debug main non-free-contrib
16:19 imirkin_: it gives me the flexibility i need while still providing enough structure to not go mad
16:19 cosurgi: yeah, gentoo is nice for devs :)
16:19 imirkin_: you sound like a developer...
16:19 orbea: there is a very thin line between user and dev :)
16:20 cosurgi: I'm a user of linux, and a developer od yade which runs on linux :) https://yade-dem.org
16:20 cosurgi: gentoo is nice for linux developers :)
16:20 cosurgi: I used gentoo for one year maybe two.
16:21 imirkin_: meh. i used it since like ... 2003
16:21 imirkin_: or 2004?
16:21 cosurgi: :))
16:21 cosurgi: I think I tried it about 2000, or something ;)
16:21 imirkin_: it's evolved ever so slightly since then
16:21 cosurgi: and tried slack in 1998 or something ;)
16:21 karolherbst: my gentoo is the only system not causing my troubles all that much :p
16:21 imirkin_: yeah, i used slack before gentoo
16:21 karolherbst: *me
16:22 imirkin_: where i basically installed the base, and then built everything else from scratch
16:22 imirkin_: gentoo did all that, but easier.
16:22 cosurgi: then I stuck with debian, because I like pkg management ;P And now switched to devuan because *KILL* systemd.
16:22 imirkin_: gentoo = no systemd as well :)
16:23 karolherbst: ohh, I use it with systemd
16:24 cosurgi: ony other *dbgsym* packages which I should have installed?
16:24 imirkin_: dunno. probably all :)
16:24 cosurgi: ok! Let's wait till next crash ;)
16:24 imirkin_: if you ever want to debug anything
16:24 cosurgi: ah. All of them? Erp.
16:25 cosurgi: I suppose I have not enough free space ;>
16:25 imirkin_: there's gotta be a way to tell apt/whatever to just always get the debug packages too
16:25 cosurgi: now I have xserver-xorg-core-dbgsym, xserver-xorg-video-nouveau-dbgsym
16:26 karolherbst: mhhhh weird, why does the CB content differ?
16:26 karolherbst: ohh, nvm
16:31 karolherbst: imirkin_: "X_COUNT = 0x1aaaa0" isn't that a little too high?
16:31 karolherbst: for a copy?
16:33 karolherbst: mhh, allthough.. should be fine I guess
16:33 prOMiNd: guys is it possible that rtx gpus have more than one fan controller?
16:34 karolherbst: maybe?
16:34 prOMiNd: pretty odd situation where I use index0 to set fan but it does nothing, using index1+ it works
16:35 karolherbst: well
16:35 karolherbst: that's driver stuff anyway
17:13 prOMiNd: karolherbst, no thats NV stuff, you have no way of knowing which index is fan settled on
17:13 prOMiNd: nvml gives just FAN, no index, xnvctrl responds to whatever fan settings you have set with default index0
17:14 prOMiNd: punk you nvidia! :)
17:14 imirkin_: probably in the vbios tables
17:15 prOMiNd: not going that way again
17:22 karolherbst: ohh and only signed firmware has access to the fans anyway
17:22 karolherbst: imirkin_: soo ehm, used the maxwell copy class, but that doesn't change anything as well :/
17:23 karolherbst: anyway, no idea what I really should be looking for in the pushbuffers, it looks more or less saneish
17:24 karolherbst:is wishing we would have a way smaller way to reproduce this issue
17:29 imirkin_: smaller is nicer.
17:29 imirkin_: karolherbst: btw, do you have a GK208B?
17:30 karolherbst: in the office
17:30 imirkin_: and/or a GK208
17:30 karolherbst: I think
17:30 karolherbst: I have some kepler2 GPU, not entiry sure which one
17:30 imirkin_: when you're in the office next, could you try to confirm the same CTXSW_TIMEOUT issue that i got?
17:31 karolherbst: which test?
17:31 imirkin_: one theory is that it affects everything (kepler and/or kepler2). another theory is that it's GK208B specific because we miss some bit of init.
17:31 imirkin_: dEQP-GLES31.*.primitives_generated_instanced
17:32 karolherbst: was it tested on maxwell?
17:32 karolherbst: I think I only checked with pascal
17:32 imirkin_: by you :)
17:32 imirkin_: oh
17:32 imirkin_: then no.
17:33 karolherbst: but I also have access to a gk106 here at homee
17:33 karolherbst: so I could check there as well
17:34 imirkin_: that'd be useful
17:34 imirkin_: i don't have any kepler1's myself
17:42 karolherbst: apperantly maxwell2 is fine as well
17:43 karolherbst: uff, that kepler machine, still running 4.16 :)
17:44 karolherbst: imirkin_: do you know if that's some kind of regression or did it always happen?
17:44 karolherbst: the installed mesa here is quite outdated
17:49 karolherbst: imirkin_: I think it hangs on gk106 as well
17:50 karolherbst: "deqp-gles31[7710]: failed to idle channel 2 [deqp-gles31[7710]]" :)
17:54 imirkin_: yay
17:54 imirkin_: so not just a GK208B thing
17:55 imirkin_: karolherbst: you also get a CTXSW_TIMEOUT right? not some other thing?
17:56 karolherbst: I don't think the 3.16 kernel was as explicit about that
17:56 imirkin_: oh. 3.16
17:56 imirkin_: not 4.16?
17:56 karolherbst: uhm.. could be 4.16
17:56 imirkin_: but definitely 16? :)
17:56 karolherbst: yeah
17:57 karolherbst: it's 4
17:57 karolherbst: I get a "fifo: channel 2 [deqp-gles31[7710]] kick timeout"
17:57 imirkin_: 2, 3, 4, 5 -- who knows ;)
17:57 karolherbst: I think it's essentially the same
17:58 karolherbst: I want to do a proper clean installation on that machine at some point anyway
17:58 karolherbst: until then I more or less ignore that one
17:58 imirkin_: well, does primitives_generated work? without the _instanced?
17:59 karolherbst: yes
18:28 cosurgi: ok. so I started this xserver again, and I keep working on two different things. While one compiling I swtich to the second one. Lots of switching back and forth.
18:29 cosurgi: Let's see when it will crash again.
18:29 cosurgi: imirkin_: I prefer that it does not crash so I do not start compton, right? :)
18:30 imirkin_: well, one theory was that xf86-video-nouveau 1.0.16 should help that
18:30 cosurgi: yeah, but the second crash of 1.0.15 was different. And now it is 1.0.16.
18:31 cosurgi: btw, I still have some 1.0.15 running, in less often used sessions. And now that I have debug symbols for Xorg, this may someday yield some useful stack tarce for you.
19:33 Lyude: oh, I just realized I missed your response imirkin_: 2019-02-08 19:59:56 imirkin_ Lyude: define 'scratch'
19:33 Lyude: any mmio register that retains a value written to it that isn't known to control anything and isn't used by nouveau/vbios for anything
19:47 cosurgi: imirkin_: welp. That xserver just crashed again. And in a weird way. It left some rubbish in the text VT. Some of it got executed by zsh. Also, there is nothing in Xorg log!
19:47 cosurgi: this is dmesg (I removed all the rubbish from my USB-UPS, those are the empty lines): https://paste.ubuntu.com/p/dG4j599bTJ/
19:48 cosurgi: imirkin_: this is Xorg.log: https://paste.ubuntu.com/p/BJnkgtGdWP/
19:49 cosurgi: also, you should show you the photos of text VT after this crash, hold on.
19:49 cosurgi: I will have those photos uploaded soon.
19:50 imirkin_: i don't see a crash in xorg
19:50 cosurgi: imirkin_: I wasn't using compton definitely. In fact I was only using gimp
19:50 imirkin_: i think the rest of the messages are semi-normal
19:51 cosurgi: $ grep -E "1\.0\.16" ./X*
19:51 cosurgi: ./Xorg.14.log:[1702853.481] compiled for 1.19.2, module version = 1.0.16
19:51 cosurgi: ./Xorg.16.log:[1109162.349] compiled for 1.19.2, module version = 1.0.16
19:51 cosurgi: The Xorg.16.log is one session. The Xorg.14.log is the one which crashed. All other sessions use 1.0.15
19:51 cosurgi: This is correct file.
19:52 cosurgi: And there is no crash. I agree.
19:52 cosurgi: Except that it crashed.
19:53 Lyude: nice, so I at least figured out a way to get the GPU left on and unload nouveau :)
19:53 Lyude: looks like a kexec without calling shutdown() does the trick
19:54 imirkin_: great way to get it to scribble over random memory...
19:54 Lyude: imirkin_: oh certainly
19:55 cosurgi: imirkin_: so I switched to that VT, saw the xserver for half a second, then it crashed and I saw this: http://janek.kozicki.pl/tmp/crash1.jpg
19:56 cosurgi: imirkin_: but then I could not type anything. I had to log in remotely nd type `chvt 2` as root. Then I switched again to this VT and then I saw this: http://janek.kozicki.pl/tmp/crash2.jpg
19:57 cosurgi: weird thing is that the first scrrn looks almost exactly what I had in that VT before starting xserver. And the second one, is what I usually see after a crash.
19:59 cosurgi: imirkin_: also notice the mouse pointer on crash1.jpg, it's still there!
20:00 cosurgi: That's a super-weird crash without a stacktrace or whatever.
20:02 cosurgi: imirkin_: I was trying to prepare some wallpapersin gimp in that xsession. There are photos of size about 16000x4000 which I rotate crop, etc in gimp.
20:02 cosurgi: I do this while yade is compiling in my main xserver.
20:02 karolherbst: imirkin_: duh... I saw that the trace causes the text area to incrase in size... now I just set it to an appropiate size and guess what that changes as well?
20:03 karolherbst: mhh, nvm, got the channel crash in the other trace again now :/
20:03 karolherbst: it's still weird
20:08 cosurgi: imirkin_: do you think that indeed there was no crash, but gimp somehow asked Xserver to exit gracefully?
20:09 cosurgi: When starting again this xserver I got a new message in dmesg
20:09 cosurgi: Feb 11 21:04:05 absurd kernel: [1711862.380352] WARNING: CPU: 22 PID: 8191 at drivers/gpu/drm/nouveau/nvif/vmm.c:71 nvif_vmm_put+0x65/0x70 [nouveau]
20:10 cosurgi: here's the full message: https://paste.ubuntu.com/p/mDYRhyQxqR/ but it's not a message from this crash. But it's a message from startx again, after this crash.
20:16 karolherbst: uff, this game even requires GL_RGBA16F :/
20:17 karolherbst: imirkin_: there isn't anything weird with GL_RGBA16F, is there?
20:18 imirkin_: not especially
20:18 imirkin_: float formats have all kinds of clamping BS
20:18 imirkin_: but we cover that fairly well
20:19 imirkin_: on the vertex side, it's pretty normal too...
20:19 karolherbst: right, but do we even do fp16 textures yet?
20:19 imirkin_: "yet"?
20:19 imirkin_: we've always done fp16 textures
20:19 imirkin_: even nv30 does fp16 textures (although we don't expose it)
20:19 imirkin_: but ARB_texture_float is part of GL 3.0
20:19 karolherbst: ohh, there it is
20:19 karolherbst: I didn't find it in the table
20:19 imirkin_: :)
20:19 karolherbst: imirkin_: mhh, weird, because mesa 18.2 is kind of required for that game
20:20 karolherbst: otherwise I get "GL_INVALID_VALUE in glTexImage2D(internalFormat=GL_RGBA16F)"
20:20 imirkin_: was it perhaps compiled without --enable-texture-float?
20:20 karolherbst: duh...
20:20 karolherbst: thanks
20:20 imirkin_: yw
20:22 cosurgi: imirkin_: did you have look at those photos? Or you say that without stacktrace we are at complete loss?
20:23 imirkin_: cosurgi: i didn't. sorry, busy with Real Things (tm)
20:23 cosurgi: ok.
20:23 cosurgi: Will let you know when another crash happens. I have a feeiling that gimp working on 16000x4000 photos is the reason. So it might happen pretty soon again.
20:23 imirkin_: oh yeah
20:24 imirkin_: i think if the pixmap is too big, we explode
20:24 imirkin_: or at least wouldn't be surprised if that were the case
20:24 imirkin_: 16000x4000 is ... not small.
20:24 cosurgi: I do not zoom on these images. But perhaps gimp stores entire pixmap in memory.
20:24 imirkin_: probably does
20:25 cosurgi: yeah, it's huge. I made some panoramic photos during holidays, and I wanted to cut a part of it for wallpaper.
20:25 karolherbst: _fun_ with 18.0 we actually missrender something and the channel gets killed like immediatly
20:30 imirkin_: yay!
20:31 imirkin_: i guess later versions don't just add bugs -- they also fix some!
20:32 gnarface: cosurgi: gimp caches some static volume of the most recently opened files ALL in memory. you can change the setting, but i think the default is 64MB
20:33 gnarface: i think if you delete your ~/.gimp directory it should ask you again on startup, but maybe that's a distro-specific thing
20:34 cosurgi: hm. When resizing one photo gimp warned me that image will use 700MB.
20:34 gnarface: i have no idea what would happen if you are working on a single image larger than the cache value.
20:35 cosurgi: For now I will let xserver crash. Maybe we will eventually get some stack trace ;)
20:35 cosurgi: but thanks anyway :)
20:36 gnarface: eh, it might be related, it might not, i dunno. but i do remember, that yes you're right, gimp is doing that.
20:36 gnarface: gimp is caching the whole thing
20:37 gnarface: or trying, anyway
20:38 cosurgi: imirkin alreaady told me to switch to amdgpu :) I will do that when I have mondy for a GPU. And then I will pick one with largest RAM available, just for gimp ;)
20:38 cosurgi: s/mondy/money/
20:39 gnarface: well, it's conceivable that you can avoid the crash by just changing a gimp setting for disk caching or something
20:39 gnarface: the only real tradeoff would be startup times
20:40 gnarface: well, i guess that's not true. filters and post-processing would be slow too, but my point is it should still work
20:40 gnarface: gimp 2.8, right?
20:46 cosurgi: yes, 2.8.18
20:47 cosurgi: hmm
20:47 cosurgi: but the problem is with video-ram (my GPU has 6GB). I have no problem with RAM at all: I have 128GB, and 320GB swap.
20:48 cosurgi: I think the problem is with pixmax stored in vram.
20:48 cosurgi: *pixamps
20:48 cosurgi: https://docs.gimp.org/2.8/en/gimp-using-setup-tile-cache.html
20:49 cosurgi: from this I conclude that this cache is abour RAM, and not about vram.
20:50 cosurgi: Edit->Preferences->Environment, I see Tile cache size=65998286 kb, that 64GB, and I think thats fine with my. Half of my RAM
20:52 cosurgi: gnarface: so I suppose this wont help, right?
20:52 gnarface: cosurgi: hmm. no, i guess perhaps not.
20:53 gnarface: but i actually don't know enough about it to be sure it would be a waste of time to fiddle with the value
20:54 gnarface: wait 64GB though? i don't remember it being 64GB by default, i could have sworn it was in MB...
20:56 gnarface: hmm. here it's only set to 8GB, which is also half my RAM...
20:56 gnarface: i see it right next to a setting for "maximum undo memory" maybe that's what i was getting it confused with...
20:56 gnarface: it doesn't seem right though, i'm forgetting something else about this
20:57 gnarface: "maximum new image size" is set to 128MB here
20:58 gnarface: i dunno, maybe it's more different from older versions than i thought
21:00 cosurgi: gnarface: yeah. I also have "maximum new image size" is set to 128MB. Just the "Tile cache size" is set to half of RAM.
21:01 cosurgi: and actually that "maximum new image size" is the size which is accepted without this dialog box "Are you sure?"
21:10 karolherbst: so uhm.. what do I do when an application running inside gdb segfaults, but the address it tried to read from is a valid one?
21:17 gnarface: check permissions then suspect the hardware?
21:18 gnarface: i don't know, i'm sorry, i'll shut up now
21:47 Lyude: wow
21:47 Lyude: i just found a fix for that special laptop that I've been trying to debug that wasn't power cycling it's gpu
21:47 Lyude: i've never felt so powerful in my life
21:48 cosurgi: Lyude: *contratulations* :-D
21:49 cosurgi: Lyude: *congratulations* :-D
21:49 cosurgi: (:
21:49 Lyude: thank you!