00:01pllz: in particular all seems to start at "[337499.219] (EE) NOUVEAU(0): failed to set gamma: Permission denied"
00:01imirkin: pllz: are you on a MCP77/MCP79 with 32MB or less of stolen memory?
00:01imirkin: oh, you have the log there
00:10pllz: iirc it happened months ago too, but i didn't give it much relevance... i thought it was a casual issue and didnt need to log again into a different tty for some time, so i forgot about it. as now its always reproducible on my system
00:18imirkin: basically you can't do anything when you don't have "master"
00:18imirkin: if you have a situation where an application doesn't surrender master while X starts running again, it dies
00:18imirkin: however it doesn't sound like that's what happened here
00:18imirkin: so ... not sure what happened
00:23pllz: i tried the same steps on a machine with totally different hw, i.e. startx on tty1, switch to tty2, login and logout, and the X session was still there
00:24pllz: i know this is not enough to imply its nouveau stuff
00:25pllz: i'd just like to report it somewhere, but i dont know whether it's an X thing, a window manager thing, or video driver, or something else... too many elements
00:29imirkin: and the whole gamma save/restore thing is wildly confusing
00:33pllz: imirkin, ok it's sleep time for me... thanks for your support, if you have some suggestions or ideas let me know, i'll stay in here for some time
00:33pllz: im glad i found a puzzle for your mind :D
00:35imirkin: well unfortunately for you i won't have much time to think about it
00:37pllz: oh np, i'll try askign some friends to crash their x session (obviously without warning them) and see if we can get something more useful
00:43imirkin: pllz: patches always welcome ;)
03:51Lyude: mupuf: btw: https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&highlight_names=&date=2018-01-15&show_html=true I think they are right; it seems like slcg is blcg but it takes a little longer to kick in
03:53Lyude: oh wait I think I meant to link this https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&highlight_names=agd5f%3BLyude&date=2018-01-15 this irc log viewer is new tom e
03:53Lyude: *to me
03:54Lyude: But yeah; I see the power consumption with slcg seem to lower a little more slowly then blcg does
08:25mupuf: Lyude: this would make sense
08:25mupuf: and SLCG could indeed be for SRAMs (registers)
08:25mupuf: SRAM Logic CG
08:26mupuf: and Block-Level CG
10:22pllz: imirkin, mhhh 2yrs ago: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801655
10:23pllz: the maintainer says it's not nouveau related, though
11:00feaneron: Lyude, really amazing work with clock-gating. Truly deserves a free <whatever your favorite drink is> out of gratitude
11:01feaneron:reading the patches, trying to learn something
11:01feaneron: so many magic values all around...
11:08RSpliet: mupuf: sure the S doesn't stand for "subdevice"?
11:10RSpliet: which makes sense right? The individual Blocks gate when idle, and when all of them idle you can gate the whole subdevice
11:11RSpliet: Lyude: ^ that
12:15RSpliet: feaneron: that's unfortunately the nature of clock gating. We roughly know what the magic values mean (they're just delay times), but determining them takes a lot of automated engineering that only NVIDIA can do. I suspect there's no such thing as a wrong value (well... within constraints), but all "we" can do is copy and make grateful use of their tweaking passes.
12:15RSpliet: Judging by the Tegra code they've pushed, it's unlikely that they tweak these values during run-time :-)
14:50volodia: I want to contribute. What you could say about Volta and problems with Secureboot?
14:51volodia: Thanks for replyes I will read it from backlog, CU!
14:51imirkin_: what kind of problems did you have in mind?
14:53volodia: @imirkin: I heard about problems with RE on newest platform for support interoperability...
14:54volodia: @imirkin_: I heard about problems with RE on newest platform for support interoperability...
14:54imirkin_: well, anything after GM200 needs signed firmware in order to operate most engines
14:54imirkin_: (well, not most by number, but most by usefulness)
14:55imirkin_: is volta publicly available yet? in any case, i dunno that much if any work on it has been done
14:55imirkin_: and to get beyond plain display will require nvidia to release firmware
15:00pmoreau: imirkin_: The only Volta card available on the market (as far as I’m aware), is the new Titan V, at ~$3,000.
15:01imirkin_: is it actually available, or just announced?
15:01pmoreau: I think it is available, but maybe on short supply?
15:03imirkin_: not like this is some definite thing, but i don't see it on newegg
15:03pmoreau: Hum, apparently it is still on “Notify me” on the US NVIDIA page. On the Swedish one, it says “Not in stock”.
15:04pmoreau: “Out of stock” on the french webpage.
15:04pmoreau: So, I think it was available.
17:04Lyude: feaneron: thank you! and yeah, tons of magic :P
17:04Lyude: although the more I look at this the less magic it seems
17:04Lyude: RSpliet: also yeah; that sounds about right
17:05Lyude: as well: got bored last night so I read through some of the fermi reclocking traces with CG_CTRL being used, I think I might have an idea what's happening but I'd need to play around with reclocking on fermi myself to confirm
17:05RSpliet: Lyude: which Fermi cards do you have to play with?
17:06Lyude: RSpliet: nvc8, maybe some mobile ones as well but I haven't gone through the RH laptops and checked
17:06Lyude: btw; see the section in my notes "Fermi CG_CTRL maintenence"
17:07RSpliet: Wait... presumably, contrary to the information on Wikipedia, those are DDR3 cards?
17:08Lyude: nvc8? I'm not sure but I can plug my card in and check, sec
17:08RSpliet: No rush
17:09cpuheater: I don't know much about IRC etiquette so I'm just gonna chirp in here: I'm trying to turn off my graphics card while keeping the rest of the system running. 1. What would be the proper way to do this? 2. When I do `echo -n '0000:01:00.0' > /sys/bus/pci/drivers/nouveau/unbind` I get a kernel oops, is this expected or should I report it?
17:09Lyude: RSpliet: btw; I have full traces for that card (no reclocking yet though) otherwise on the vbios repo already; I try to trace all the NV hardware I get my hands on if I have the blob installed
17:09RSpliet: I might have one or two at home, been meaning to test Bens work on them, but haven't found the time yet
17:09imirkin_: cpuheater: disconnect vtcon first
17:09Lyude: there's an easier way imirkin_ cpuheater
17:10Lyude: let me find the command I use
17:10Lyude: echo 1 > /sys/class/drm/card0/device/remove
17:10Lyude: that also unbinds vtcon if applicable
17:10RSpliet: Lyude: And today isn't going to be the day either for me ;-)
17:10imirkin_: hmmmm ... interesting.
17:10Lyude: mhm, I've been using it a good bit for reclocking stuff
17:11cpuheater: Lyude: so apart from vtcon this would also unbind all other things using the card? Like the wayland compositor?
17:11Lyude: No, you need to stop those services beforehand either way
17:11Lyude: there are some problems right now with nouveau unload though iirc
17:12Lyude: I know my machine becomes unhappy if I go into wayland, go back into fbcon then remove the card
17:13cpuheater: Ok then I'm gonna try your command quickly (I'm writing on that machine so I'll back in a few minutes)
17:14cpuheater: Oh btw what would be the corresponding command to get the card back up again?
17:15cpuheater: Will `echo 1 > /sys/bus/pci/rescan` do the job?
17:15imirkin_: if you unbind the driver, you have to rebind manually
17:15imirkin_: i.e. echo 'the-same-thing' > /.../nouveau/bind
17:15imirkin_: not 100% sure if a rescan is a good move
17:16cpuheater: ok I'll try
17:21cpuheater: ok I still got a kernel oops: `kernel: IP: nouveau_backlight_exit+0x2b/0x70 [nouveau]`
17:23Lyude: cpuheater: that is a bug
17:23Lyude: cpuheater: what card are you using?
17:25Lyude: RSpliet: btw; this card appears to be gddr5 https://paste.fedoraproject.org/paste/9GEPp-uLc0z~vMnudoCNgw
17:26cpuheater: Lyude: GK107M [GeForce GT 750M Mac Edition] (rev a1)
17:26cpuheater: nouveau-fw 325.15
17:27cpuheater: oh that is the firmware package, not the actuall driver
17:28Lyude: I'm more concerned about the kernel version, could you also get a dmesg of the splat?
17:28cpuheater: yeah right, the kernel version is 4.14.13-1-ARCH
17:29cpuheater: well the problem with dmesg is that it resets after a reboot, so I only have the part that is visible in journalctl
17:30cpuheater: but I can dump the dmesg to a file right after killing the driver (just have to reboot again)
17:31Lyude: ah; your best bet is usually netconsole or kdump for that sort of thing
17:32Lyude: also I've got a laptop with that same GPU on it, going to see if I can get it to crash here
17:34cpuheater: Ok yeah later today I'll have access to another laptop so I could SSH, but for now I guess dump-to-file will do the job
17:44cpuheater: Hmm I'm currently unable to obtain a dmesg log of the crash because bash seems to terminate at the line `echo 1 > /sys/class/drm/card0/device/remove` (even though it is clearly executed)
17:45Lyude: btw: installing fedora onto the machine now, should know in a little bit if I can reproduce it here or not
17:45cpuheater: Ok cool
17:46cpuheater: I'll check back in a few hours, and then I should also have a secondary PC
17:54Lyude: btw nrr you said that kernel update made the problems a lot less frequent
17:54Lyude: running 4.15-rc8 on this test machine with my kepler and not seeing any crashes when resizing windows anymore
17:55nrr: Lyude: correct. it went down from literally every gnome-shell session to ~once per day
17:56nrr: i had a loaded weekend this past weekend, and my sleep schedule is kinda sorta back on track now, so i'll (finally) see about isolating what actually did it sometime this week.
17:56nrr: i got as far as setting up some kprobes and promptly had to drop things. :P
17:57nrr: if you can, though, try giving one of your displays a 90° rotation and see what happens then since that's part of the configuration i was running when i had egregiously bad problems.
17:57Lyude: that's probably just making the race condition easier to hit
18:05Lyude: nrr: got it to happen
18:05Lyude: I just wasn't running in wayland apparently
18:22nrr: Lyude: \o/
20:02RSpliet: Lyude: ah, GF110, that's a different code name. I kind of sort of suspect that most Fermi's are either DDR3 or GDDR5 like with Kepler
20:05RSpliet: We kind of don't have GDDR3 ram calc code either
20:14cpuheater: Lyude: I retrieved the dmesg output from after the nouveau crash: https://pastebin.com/mZJueLJn
20:14cpuheater: this was obtained after typing `echo 1 > /sys/class/drm/card0/device/remove`
20:18Lyude: cpuheater: alright; i've also gotta look at a PTE bug that's causing some nasty kernel panics as well so I may or may not have enough time today to check it out a little more
20:24cpuheater: Lyude: ok great, however I'm usually not on IRC so I might miss your updates. Maybe we can move later updates to the mailing list?
20:25Lyude: cpuheater: mhm; that works for me. Could you send a message to the ML and cc me in it? (lyude @AT@ redhat.com)
20:26cpuheater: Yes I'll do that, I'll just include the dmesg and the info I gave here before
20:26Lyude: cool, thanks!
21:18vedranm: imirkin_: good judgement, radeon kmod has the same rmmod issue as nouveau, and it doesn't reboot nicely after it fails (details: https://bugs.freedesktop.org/show_bug.cgi?id=104608)
21:18vedranm: wonder if that could be addressed
21:18imirkin_: of course it can
21:18vedranm: imirkin_: you mean, linux could ignore modules broken by rmmod somehow?
21:30Processus42: Hello again guys. I am trying to build efifb with my Gentoo. I believe the right option is "CONFIG_FB_EFI". However, this does not build the module efifb. After reading some Kconfig files, it appears that efifb is mentionned as a "Legacy" driver. I have the kernel 4.9. What should I do ? Does it mean the things you called "efifb" is already built into my kernel ?
21:33vedranm: imirkin_: OK
21:33imirkin_: vedranm: but patches could be submitted so that it doesn't do that :)
21:34imirkin_: Processus42: just because it is written, doesn't make it so
21:34imirkin_: efifb is not in the least bit legacy
21:34imirkin_: it's 100% required for all but the weirdest setups if you have an efi boot
21:35imirkin_: do you have CONFIG_FB_EFI=y in your .config?
21:41vedranm: imirkin_: I know, when I decide to learn kernel hacking, I will start from there :)
21:47Processus42: imirkin_: Yes, I have the configuration set to yes
21:50imirkin_: Processus42: ok, tbh i don't really remember what your issue was
21:53Processus42: imirkin_: Originally, when nouveau was getting loaded, I lost my screen output, and my displays went to sleep. It was solved with config=NvForcePost=1. And someone (I believe it was you imirkin) told me that it wasn't necessary to load the module with the latter option if efifb was loaded before nouveau.
22:01imirkin_: Processus42: ok, so your displays work until nouveau is loaded right?
22:01imirkin_: Processus42: and you're using an efi boot?
22:28DottorLeo: what is the support status about Nvidia 610M (Fermi)? Need manual recklocking?
22:33imirkin_: DottorLeo: yes. and not upstream.
22:34DottorLeo: uhm ok i'll wait :)
22:36DottorLeo: i have an Intel HD3000 that works fine but is stuck to 3.3 and i'd like to use the more powerful card for the extra (OGL 4.5 and i think it's faster than the integrated one)
22:42Processus42: imirkin_: Yes, they are working until nouveau is loaded. And yes I am using EFI boot :)
22:43imirkin_: ok, so that's a different issue than the one i thought you had
22:44Processus42: I have no issue right now. Since I'm using the option config=NvForcePost=1
22:44Processus42: However, you talked about efifb also solving the problem in another way. Or maybe I misunderstood.
22:45Processus42: Oh, wait, maybe you are talking about the fact that efifb is loaded and that I still need to use the option. Right ?
22:47BootI386: imirkin: Can you explain me again why when the GPU hangs in one app, it hangs the whole graphic stack too (even the TTYs)?
22:47imirkin_: BootI386: ever been on a ferris wheel?
22:49imirkin_: then explain why when one seat stops, that causes the others to stop
22:49BootI386: It's stuck waiting one context and can't switch to others?
22:49imirkin_: that explanation should carry over to the graphics scenario nicely ;)
22:49imirkin_: the fucking thing is stuck
22:50imirkin_: there isn't a separate gpu for each application
22:50imirkin_: there's just one total
22:50imirkin_: and it's stuck.
22:51BootI386: And there are no things such as watchdogs? Interrupts?
22:51imirkin_: interrupts. we got 'em
22:52imirkin_: long story short, error recovery is one of the hardest things out there
22:52imirkin_: getting the gpu in a working state after an error is not trivial.
22:52imirkin_: you don't just clear the "stuck" bit and move on with life
22:52imirkin_: to do it properly, you have to power it off and on again. which is impossible with modern hardware.
22:53imirkin_: often it's enough to cycle the PMC enable bits.
22:53imirkin_: but that requires reconfiguring a ton of state.
22:54BootI386: Isn't it better than remaining stucked?
22:55Booti386: Is it currently implemented?
22:56Booti386: Ah, that explains :D
22:56Booti386: Is there a reason to that?
22:56Booti386: (Except it's really hard)
22:56imirkin_: well, that's never a reason
22:56imirkin_: the more common reason is "no one's done it"
22:57Booti386: Oh. Ok.
22:57imirkin_: (which if you ask "why", there can be reasons like "it's a giant pain")
22:57Booti386: Why is it a giant pain?
22:57imirkin_: you're like a small child who keeps asking why...
22:58Booti386: (I'm wondering if I can try to. Maybe if it's painful enough I might be interested. :))
22:58skeggsb: we actually do the recover-the-GPU part fairly alright for a lot of errors, we just fail at all the nasty plumbing and userspace recovery parts
22:59imirkin_: skeggsb: i dunno. in practice, it's gotten much worse for me since the error recovery stuff was added
22:59imirkin_: since now almost any error kills my box
22:59skeggsb: that's... strange
22:59imirkin_: whereas before it would kinda hobble along and i could often kill the offending app in time
22:59skeggsb: i can generally kill the offending app and all is good
22:59imirkin_: from the same machine?
22:59imirkin_: or ssh'd in?
23:00imirkin_: i can ssh in and kill the app and then things come back
23:00Booti386: Do you keep copy of all the state outside of the GPU?
23:00skeggsb: no, ssh'd.. that'd have been worse before recovery though, killing the bad channel wouldn't have unstuck the gpu before that
23:00imirkin_: but the machine in question's display is frozen until i do
23:00imirkin_: (mouse might move, i forget)
23:01imirkin_: i'm not saying that the stuff you did was a bad idea
23:01imirkin_: and yeah, before if i didn't get to it in time, baaaad things would happen
23:01imirkin_: but now i'm guaranteed a freeze that i have to walk over to another box for to kill the process
23:01skeggsb: i'm genuinely concerned/curious, not offended :P
23:02imirkin_: this was happening most recently with the code segment missing bug
23:02imirkin_: it would throw a PTE error, the recovery would catch it, and i'd have to ssh in and kill the process to get shit going again
23:02Booti386: Why don't you kill directly the app if t stucks a channel for too long? Wouldn't it avoid the need to have a ssh server?
23:03skeggsb: imirkin_: it'd be interesting to know exactly what was stuck where when that happens
23:03imirkin_: skeggsb: i agree
23:03imirkin_: unfortunately the cost of that is getting things stuck on my box ;)
23:03skeggsb: i assume X or whatever is waiting for something
23:03imirkin_: anyways, i can repro at will
23:03imirkin_: and it does seem like there's no ill effect
23:03imirkin_: other than having to have another machine handy with the kill -9
23:04imirkin_: my repro, btw, is running Dirt Racer with bindless forced on, but without the fix to make sure the "old" code segment was attached to the pushbuf submit
23:04imirkin_: er, Dirt Rally?
23:04imirkin_: i forget. dirt something.
23:04skeggsb:would probably just deliberately sabotage the 3d driver ;)
23:05Booti386: Why don't you kill directly the app if t stucks a channel for too long? Wouldn't it avoid the need to have a ssh server?
23:05imirkin_: yeah, getting things to not hang is so hard
23:05imirkin_: that just getting something to hang should be pretty easy
23:05imirkin_: Booti386: we don't know which app.
23:08Booti386: Oh. Right. Xorg create the buffer (or whatever), than sends it to the app (prime fd to handle? I think?) and nouveau has no idea which app it was given to.
23:09Booti386: Or maybe I'm just plain wrong :D
23:11imirkin_: i'm unclear on all the subtleties
23:11imirkin_: but apparently it's "not that easy"