00:23imirkin: w00t! my theory for that weird xfb GTF test panned out
00:42imirkin: Passed: 1742/1742 (100.0%)
00:42imirkin: for the gl45-gtf-master.txt tests
00:54imirkin: karolherbst: ok. i think we pass all CTS and GTF tests now, at least individually.
01:01airlied: imirkin: nice
01:01imirkin: now just have to pass when cts-runner does it :)
02:16imirkin: skeggsb: any clever ideas on how to debug the memory error?
02:18imirkin: (or non-clever ... i'll take anything, really)
02:20airlied: imirkin: is it a nouveau kernel memory error or something?
02:20imirkin: well ... it's an error
02:20imirkin: and it's reported by the kernel
02:20imirkin: but it doesn't quite point the finger
02:21imirkin: except squarely at me, since i'm the one debugging it
02:21imirkin: (and it laughs, tauntingly...)
02:22airlied: okay so unlikely in the CTS itself
02:22imirkin: highly unlikely
02:22airlied:is memory corrupting the cts with vallium, haven't worked it out yet
02:22imirkin: the tests in question all pass when run individually
02:23imirkin: and i can repro the hang when running the cts-master.txt list with glcts
02:23imirkin: [actually, double-checking that now]
02:24airlied: imirkin: probably worth trying to bisect it a bit
02:24airlied: by removing files from the test list
02:24imirkin: double-checking that it hangs with the full list, then will start doing that
02:24imirkin: there are a few super-long tests
02:24imirkin: like image formats, and texture swizzles
02:25imirkin: KHR-GL45.copy_image.functional takes like 10 minutes =/
02:25imirkin: (i dunno, i haven't timed it, but a LONG time)
02:36imirkin: airlied: skeggsb: https://hastebin.com/wodexefuba.css
02:37imirkin: pretty vanilla. an OK-looking address being read from.
02:39skeggsb: i say add some printf's figure out what buffer the address belongs to
02:39skeggsb: then see if/how userspace is screwing up validation :P
02:39imirkin: add some printf's where? you mean like just run with NOUVEAU_PUSHBUF_DEBUG or whatever it's called?
02:40skeggsb: i guess that works too, that'll help narrow it down
02:40imirkin: this is a GP108 running on kernel 5.0 -- anything jump to mind as having been fixed in that time frame that could affect this?
02:41skeggsb: i'd say there's a >85% chance this is a userspace issue
02:42imirkin: i'm thinking something delightful, like lingering bugs in like code page switching
02:43imirkin: coz there's not a lot of stuff that persists across contexts
02:43imirkin: and iirc this thing creates a fresh context for each test
02:44skeggsb: but with the same 'screen' right? we have a lot of stuff that persists there
02:44imirkin: yes, same screen
02:44imirkin: not a lot
02:44imirkin: a little
02:45skeggsb: far more than there should be :P
02:45imirkin: it's all stuff that makes sense
02:45imirkin: code page, surface descriptor list
02:46skeggsb: the push buffer does *not* make sense, and it's a big part of the MT issues
02:46imirkin: MT issues are caused by lack of ability to control kicks precisely
02:47imirkin: that was where both karol and i ended up giving up on our independent attempts
02:47skeggsb: threads stomping over each other's push buffer happens well before that
02:47imirkin: not if you add code to avoid that
02:47imirkin: or give them separate push bufs
02:47skeggsb: yep, i know :P
02:47skeggsb: i'll fix it... after volta+turing
02:48imirkin: and explicit VA management while you're at it
02:48imirkin: also, i'd like a pony
02:48skeggsb: i can think of better things than a pony, but, sure
03:19imirkin: heh. that high bit we set really trips up the pushbuf_dump logic
03:19imirkin: it's encoded into length, so it thinks the length is ... high
03:21imirkin: gr, i feel like i'm just going to be debugging this pushbuf dumper thing
03:24imirkin: probably ok time to spend though -- it's useful, but hasn't kept up with modern realities
03:30imirkin: gah, just realized that we don't actually print the *useful* buffer info in there
03:37airlied: make sure you don't have some leaky list
03:38imirkin: well, i just added printing of offset + size to each buffer attached to the pushbuf
03:39imirkin: which i suppose is bogus for pre-nv50
03:39imirkin: but ...
03:39imirkin: i wonder if we should split the drivers
03:39imirkin: at the nv50 line
03:40imirkin: there's so little overlap between the two, it seems
04:09imirkin: skeggsb: what's the significance of "client 1d [GPC0/T1_7]"?
04:10skeggsb: nfi what that one refers to exactly, but it's the specific block that did the memory access
04:11imirkin: so that address is used in an earlier test as the compute launch descriptor
04:11imirkin: (the test passes)
04:16imirkin: and then the next test passes, and the one after *that* test sets its own compute launch descriptor
04:16imirkin: also that's not the *precise* address...
04:17imirkin: the address for the original launch descriptor is 3aa0200, while the error is at 3aa0000
04:35imirkin: ok, so i'm going to have to go ahead and disagree with the hardware on this one
05:29imirkin: skeggsb: best i can tell, the test triggering the issue is the one *after* the one that uses that 3aa0000 address.
05:30imirkin: there's no reference to 3aa in the commands submitted by that last test
05:30imirkin: the buffer is a nouveau_scratch buffer, so perhaps something nefarious is happening
05:30imirkin: it's only ever used as a launch descriptor
05:30imirkin: (from what i can tell)
05:31imirkin: by the time the final test is doing a CS launch, it uses a different launch descriptor address
05:31imirkin: which i can see in the cmdstream
05:31imirkin: so ... something very awkward is happening somewhere
05:31skeggsb: hmm, i wonder if HW does something with the previous descriptor when launching the next one
05:32imirkin: there _is_ some sort of talk about "qmd" rings or whatever, right?
05:32skeggsb: QMD is what we call "launch descriptor", but yeah, there's a bunch of stuff in there we don't use/know what it is
05:33imirkin: #define NVC0C0_QMDV01_07_SEQUENTIALLY_RUN_CTAS MW(370:370)
05:33imirkin: hopefully we don't set that guy
05:33imirkin: and there's various talk of circular queues
05:34imirkin: yeah. all kinds of nastiness.
05:35imirkin: desc->unk0 = 0x40;
05:35imirkin: not scary at all.
05:40imirkin: #define NVC1C0_QMDV02_01_QMD_GROUP_ID MW(133:128)
05:40imirkin: assuming we're using V02_01 QMD
05:57imirkin: ok - random try time - moving up the launch desc address setting to earlier in the function
05:57imirkin: maybe something highly unexpected makes use of it
06:18imirkin: nope. same fail, same place.
06:30imirkin: ok. plan b. reset it to 0.
06:52imirkin: well, writing a 0 to the launch descriptor definitely doesn't help
06:52imirkin: either the address is sticking around from something else, or there's more going on
06:52imirkin: guessing the latter =/
14:15Rabid_Raven: hi guys, anyone around&
14:15diogenes_: Rabid_Raven, just ask :)
14:15Rabid_Raven: ha! you're here too
14:15diogenes_: lol i'm everywhere
14:16Rabid_Raven: Question: Does Nouveau support re-clocking on the 970m?
14:16Rabid_Raven: because that would totally fix all of my issues, methinks
14:17imirkin: Rabid_Raven: which GPU is that? GM20x or GM10x?
14:17imirkin: if GM20x - no. if GM10x - yes.
14:18Rabid_Raven: gm204 apparently
14:19imirkin: you should see it in lspci
14:19Rabid_Raven: yeah but i'm not on my personal laptop
14:19imirkin: same marketing name gets used for all kinds of chips, not always easy to tell based on what's online
14:20Rabid_Raven: i.m imaging that it's probably the gm2xx either way since the laptop is a late 2014 model and the chip came out then
14:20Rabid_Raven: that's too bad. i really want to fix the issue and kill a second bird with the same stone by removing blobs from my computer
14:20Rabid_Raven: I could manage with a slight deterioration in gaming performance
14:20imirkin: well, in the best case, it'd be a 40% deterioration
14:21imirkin: i.e. if reclocking worked
14:22imirkin: that said, if it's a laptop, and the fan is controlled by EC, there's actually a possibility of reclocking
14:22imirkin: the main reason it doesn't work on desktop is that we can't control the fan
14:22Rabid_Raven: makes sense and totally not your fault
14:23Rabid_Raven: the fact that you guys were able to reverse-engineer everything is a blessing already
14:23Rabid_Raven: however, i.m curious as to whether this would affect me: https://cgit.freedesktop.org/nouveau/xf86-video-nouveau/commit/?id=e472b47d15634a864c8c981ed588d882aceaf26b
14:23imirkin: karolherbst: maybe you can come up with some instructions for those people?
14:23imirkin: Rabid_Raven: not at all - that's for GP10x GPUs
14:23karolherbst: imirkin: related to what?
14:23Rabid_Raven: alright, i'll wait
14:23Rabid_Raven: i have no interest in chucking the laptop for at least another year or two
14:23imirkin: karolherbst: reclocking GM20x
14:23karolherbst: ohh maxwell2 reclocking stuff?
14:24karolherbst: I actually wrote a patch some time ago where users can enable that feature behind a module flag... but
14:24karolherbst: the thing itself is a bit tricky
14:24karolherbst: and power readings are broken
14:24Rabid_Raven: actually, perhaps you can help me get to the bottom of my issue
14:25Rabid_Raven: in trying to get wake from sleep working on my laptop, i tried a variety of strings in /etc/default/grub in addition to blacklisting nouveau entirely. nothing worked. the computer goes to sleep but refuses to wake
14:25karolherbst: huh, weird
14:25Rabid_Raven: however, i then removed all nvidia, restarted and tried to sleep again. the result was entirely the same with nouveau: sleep, no wake
14:25karolherbst: does it turn on?
14:25Rabid_Raven: perhaps i'm looking at the wrong thing
14:26Rabid_Raven: yeah it turns on when i lift the lid
14:26Rabid_Raven: but black screen
14:26imirkin: try ssh'ing in?
14:26imirkin: maybe it's just the backlight being dumb
14:26karolherbst: Rabid_Raven: hybrid or dedicated mode?
14:26imirkin: or an eDP panel with link-training issues?
14:27Rabid_Raven: karolherbst, not sure if i'm answering that right but i can tell you that you can't change the gpu through a simple software change. you have to restart the computer when going from intel to nvidia and back
14:27Rabid_Raven: it's an msi gt72. in windows, i can switch to intel if i press a button on the left side of the laptop but it restarts the machine
14:27Rabid_Raven: if i suddenly decide i need the 970m, i press the same button but again, restart
14:28karolherbst: Rabid_Raven: if you boot with intel, suspend/resume works, right?
14:28Rabid_Raven: (obviously, this functionality is not available to me in linux)
14:28Rabid_Raven: karolherbst, haven't tried but i know that intel, in general, causes no issue for people
14:28karolherbst: mind booting with intel and print check with lspci if there are two GPUs?
14:29Rabid_Raven: i actually can't do that since there is no windows on the machine any longer which would have enabled me to switch the gpu, heh
14:29karolherbst: you can use both GPUs if they get reported
14:29Rabid_Raven: so i'm stuck with the 970m for better or worse
14:29karolherbst: is there some option in the firmware settings?
14:29Rabid_Raven: nope, that would have been fantastic
14:31karolherbst: ufff, yeah.. did you check if somebody reverse engineered this feature and has some software flip for that?
14:33karolherbst: Rabid_Raven: https://forums.gentoo.org/viewtopic-t-1027986-start-0.html
14:34karolherbst: that might work
14:36Rabid_Raven: it also comes with a disclaimer at the bottom suggesting not to use it with different bios :)
14:36Rabid_Raven: heh, not sure which i am running so i will have to check once home
14:36Rabid_Raven: but thanks for that
14:36karolherbst: EC reverse engineering is _fun_
14:36karolherbst: did that a few times
14:36karolherbst: but normally offsets are stable because otherwise you need drivers handling all of that
14:37karolherbst: but.. uff
14:37Rabid_Raven: as much as i love msi, the complexity of getting the machine to work right in linux is discouraging enough to prevent me from looking at the same brand for the next one<
14:37karolherbst: there are a few companies who care more about linux support
14:37Rabid_Raven: karolherbst, any recommendation?
14:37karolherbst: normally dell and lenovos are the first comming into my mind having decent support. There is system76 as well and various others
14:37Rabid_Raven: considering gaming is still a factor
14:37karolherbst: but, it's not a recommendation ;)
14:37karolherbst: just a list of vendors caring more
14:38Rabid_Raven: i was thinking i'd go to dell. my wife swears by lenovo for some ridiculous reason but even she, with linux, finds her trackpad no longer working after waking from sleep
14:38karolherbst: I think HP also cares a bit more...
14:38Rabid_Raven: yeah, i wouldn't touch hp with a ten-foot pole
14:38karolherbst: I wouldn't buy lenovo soley out of political reasons, it's a terrible company doing terrible choices
14:39Rabid_Raven: karolherbst, i just find their hardware to be lacking in quality
14:39Rabid_Raven: build quality i mean
14:39karolherbst: ohh, well, I wouldn't care as much about that
14:39karolherbst: on some machines they started to install an "ad daemon", with a lenovo system CA inserting ads in all http and https traffic. They did this like 4 or 5 years ago
14:40Rabid_Raven: you would if the hinges connecting your screen to the base broke and the panel was left hanging like it did for my wife's older lenovo<
14:40karolherbst: but such companies are just dead for me
14:40Rabid_Raven: very cheap plastic trash
14:41karolherbst: Rabid_Raven: https://arstechnica.com/information-technology/2015/02/lenovo-pcs-ship-with-man-in-the-middle-adware-that-breaks-https-connections/
14:41Rabid_Raven: that lenovo, by the way, was a good machine but because the screen was no longer associated to the machine, it was embarrassing to even owjn
14:41Rabid_Raven: well, that's windows for you
14:42karolherbst: no, that's vendor installing crap software by default :p
14:42Rabid_Raven: sure, but such trash is only available on windows
14:42karolherbst: if a compnay cares more about ad revenues than security of their products, they are simply a bad company
14:42karolherbst: you can do the same on linux
14:42Rabid_Raven: you could, but nobody ever dares
14:42Rabid_Raven: (as far as we know)
14:43karolherbst: well, the thing is, that everybody just installs their own linux anyway
14:43karolherbst: so any preinstalled system is gone
14:43Rabid_Raven: alright, have to monitor some kids for 15 minutes
14:43Rabid_Raven: thanks for all the help. ttyl
15:08Rabid_Raven: just got paid 15 minutes for staring at a wall, more or less
15:08Rabid_Raven: exciting stuff
15:20imirkin_: fun job
15:24Rabid_Raven: we need to monitor areas from time to time
18:10rtwld: hi to all
18:12rtwld: after upgrading the kernel (debian 10), i have problem with the graphic driver. Unpredictable freeze happen, using programs such firefox, libreoffice, geogebra
18:14rtwld: i ought to set manualy the exa acceleration, card is an old geforce 8200, lspci gives NVIDIA Corporation C77 [GeForce 8200] (rev a2)
18:15imirkin_: did you only upgrade the kernel, or also some userspace packages?
18:16imirkin_: and if just the kernel, what was the old version, and what's the new version?
18:22rtwld: new installation on a new disk
18:22imirkin_: if you're using like gnome or kde or something - this is highly expected
18:23imirkin_: if you just want a working system and aren't too fussed about acceleration, just remove nouveau_dri.so or stick LIBGL_ALWAYS_SOFTWARE=1 into your /etc/environment
18:24imirkin_: right yeah, but firefox/libreoffice/etc all use GL now for reasons unknown
18:26rtwld: without .conf in /etc/X11/xorg.conf.d i got another method of acceleration and that resulted in a complete freeze with firefox (only Sysreq worked!)
18:27rtwld: now, as root i can start firefox, but as another normal user i must use --safe-mode
18:28rtwld: with gzdoom i got random freeze, doom playable only with lzdoom
18:30rtwld: i tried the nvidia legacy but the result was even worst: freeze immediatly after login and no Sysreq
18:37imirkin_: should have EXA accel
18:37imirkin_: that's really the only option
18:37imirkin_: (other than NoAccel)
18:38imirkin_: this is a nvaa, isn't it?
18:38imirkin_: i thought it was one of those nv68 or whatever's
18:38imirkin_: that should work moderately well...
18:54rtwld: yes it's nvaa, a normal user ought to be in the video group?