02:14dboyan: By the way, is nouveau accepting gsoc project this year? Also, does the nouveau section on the x.org idea page still apply?
02:14dboyan: That section seems to be last updated one or two years ago
02:14imirkin: dboyan: sure, if a qualified candidate applies with a reasonable project that someone is willing to mentor
02:20dboyan: okay, thanks. Probably I'll apply for mesa or nouveau or something, just trying to get familiar with how they work this days
02:22imirkin: dboyan: i think the general policy is to only accept gsoc applications from people who have made prior contributions (don't have to be enormous ones of course)
02:22imirkin: also ensure that you have a mentor lined up
02:29dboyan: I did make some contribution to mesa but haven't touched nouveau before. The main knowledge I got was mainly from digging the recent bug with compute shader that I reported.
02:29dboyan: But I did find something interessting
02:30imirkin: yeah, i haven't looked at your updated findings yet in detail
02:30imirkin: i've been pretty sick this week, although i'm suddenly feeling a lot better. hopefully i'll be back to 100% tomorrow. that'd be nice...
02:32dboyan: oh I'm sorry about that
02:32imirkin: i'll survive. hopefully.
02:35dboyan: Do you know why nouveau make sched value of join instruction 0x0? nvidia driver on my gk208 is generating 0x2f
02:36imirkin: there is no such thing as a join instruction
02:36imirkin: there's a join flag though
02:36imirkin: (which in nvdisasm shows up as .S btw)
02:38dboyan: yeah, join flag. I only learned it a few days, but most instruction with join flags are 'join nop' though
02:38imirkin: in which case it's NOP.S
02:39imirkin: anyways, you can look at the SchedDataCalculator in emit_nvc0.cpp
02:39imirkin: for how it determines this stuff
02:39imirkin: please note that it's not perfect
02:39imirkin: also note that i didn't write it :)
02:41dboyan: It says, if (insn->op == OP_JOIN || insn->join) insn->sched=0x00
02:41imirkin: ok then.
02:42dboyan: I know, that's why there's an gsoc idea on instruction scheduing ;-)
02:42imirkin: that's for scheduling instructions
02:42imirkin: not for computing sched codes
02:42imirkin: i.e. if you have instructions a;b;c should you instead emit them as c;b;a
02:42imirkin: or whatever other order
02:43dboyan: yeah, but even the scheduling info seems not perfect now
02:45dboyan: actually I altered that to 0x2f as a hack, I saw no problem at least with Portal 2, and it seems running just a little bit faster when playing dem files
02:45imirkin: well, it'd need some investigation as to what's better there
02:45dboyan: I agree
02:47imirkin: however there should be significant gains possible from even a mild amount of instruction scheduling work
02:50dboyan: I know there has been work done on maxwell. Maybe I'll study that a little bit these days when I have time, and read some books about compilers.
02:52imirkin: again, that has nothing to do with scheduling instructions
02:52imirkin: it has to do with providing delay info to the execution units
02:52imirkin: in compilers, there's the idea that instruction order matters...
02:52imirkin: here's a simple example
02:52imirkin: a = b + c + d + e
02:53imirkin: now, you could do this as
02:53imirkin: t1 = b + c; t2 = d + e; a = t1 + t2;
02:53imirkin: er hm. no. this is a bad example.
02:53imirkin: (rather, it's an example of something else)
02:53dboyan: I can get what you're saying
02:54imirkin: depending on how you order things, you can end up with more or less values live at a time
02:54imirkin: which means more or fewer registers required
02:54imirkin: and also you can use up what are effectively delay slots by performing useful work that you have to do anyways
02:55dboyan: yeah, about instruction delays and dependencies
02:55dboyan: Do you know what reference materials are available either on web or in books about this topic? I think I'd better learn with things like that
02:56imirkin: well, you could look at the scheduler that i965 uses for some inspiration
02:56imirkin: unfortunately it's all an extremely difficult topic
02:56imirkin: because it is so intertwined with register usage
02:56imirkin: and unlike CPUs, where you have N registers, and it doesn't matter how many you use
02:57imirkin: the number of registers you use on a GPU affects the parallelism of the program
02:58dboyan: I see, thanks a lot for the detailed information
08:28mupuf: dboyan: glad to hear you are willing to participate
14:33mmaret: Hi !
14:33mmaret: I already came here with the same issue but I've some progress I can share
14:34mmaret: I'm using a K520(Gk104) on a amazon aws grid instance.
14:34mmaret: inserting the nouveau module produce, on some instance, a huge amount of IRQ
14:34mmaret: like 100000/sec
14:35mmaret: I'm trying to understand where all those IRQ came from
14:36mmaret: using ftrace, I can some trace showing that the nvkm_therm_intr is call very often
14:37mmaret: (1500/sec, so far less than 100000/sec. But may be it's normal because of the activation of ftrace. Heisenberg spotted...)
14:37mmaret: no other nvkm related interrupt are visible in my trace
14:38mmaret: And I don't know what to do with this information
14:39mmaret: my kernel knowledge is limited, and it's even worse with my nvidia knowledge ...
14:39mmaret: So any suggestion will be appreciated
14:52mmaret: In case it was not clear enought, this card is available to me in a Xen guest. I did not have access to the Xen host
15:00RSpliet: mmaret: interesting... I think mupuf might be the best to help
15:00RSpliet: but can you just double check (using sensors) what the temperature and trip points of your GPU are?
15:01RSpliet: and make sure you're using a very very modern kernel
15:05mmaret: currently i'm testing a 4.4. But i've the same issue with a 4.10-rc5
15:07RSpliet: mmaret: If you have the opportunity of getting sensor information using the 4.10 rc that'd be very helpful. I've lost track (like presumably most people here) of all the changes already made that could impact behaviour since 4.4
15:07mmaret: ok I'll do that
15:07RSpliet: thanks a lot!
15:11mmaret: The hardest part is to have a instance from amazon at this time of the day :)
15:12mupuf: well, I really need to stop using ptimer.alarm
15:12mupuf: it is too difficult to use properly and to multiplex
15:12mupuf: will use the kernel timers ... but that requires writing an abstraction and a re-implementation for nouveau to work in the userspac
15:13RSpliet: mupuf: I was hoping it'd simply be a weird corner case caused by a more fancy fan or a missing/misinterpreted trip point from the VBIOS ;-)
15:15RSpliet: (or an odd Amazon change to the PCB that forbids nouveau from touching the fan config at all...)
15:16mupuf: RSpliet: I doubt it :s
15:16mupuf: 1500 irq/s :o
15:17mmaret: only 1500/sec with the trace activated. otherwise /proc/interupts show about 100000/sec for nvkm
15:20RSpliet: mupuf: well, if nouveau doesn't do anything about the actual reason of the IRQ, it'll keep screaming for more attention
15:20mmaret: I finally got an instance with a 4.10-rc10
15:20mmaret: sensors give a temp° of 29°
15:22RSpliet: hah, that looks plausible and valid
15:22mmaret: I'm not very confidente in the fact that the thermal part could generate soo much IRQ. but that's the only one I can get in a trace
15:23mmaret: any idea about how I can get more information
15:23RSpliet: mmaret: perhaps you could take a copy of the VBIOS ( a binary that you can copy from /sys/kernel/debug/dri/<number>/vbios.rom ) and send it to mmio dot dumps at gmail etc etc
15:23RSpliet: mupuf: anything else that might be useful?
15:24mmaret: I forgot to tell that inserting nouveau with config="NvMSI=0" reduce the number of irq to about 1000/sec (instead of 100000/sec)
15:25RSpliet: ah okay. Not sure if that's because of bus inefficiency or a clue
15:25RSpliet: mupuf: think there's some debugging info that he can extract from the kernel to find out what the IRQs are trying to say? could you propose a good nouveau.debug="" param to continue investigation ;-)
15:26RSpliet: (mmarit: bear with us btw, we're both entangled in full-time jobs not related to nouveau.)
15:27mupuf: nouveau.debug=ptherm could help
15:27mupuf: but honestly, can't remember now
15:27mupuf: and I am still at work
15:27mupuf: so this will have to wait
15:28mmaret: I completly understand. Thank again for your help and indication
15:28mupuf: mmaret: thanks for the initial debug!
15:29mupuf: this is quite valuable!
15:29RSpliet: mmaret: try and obtain a kernel log when booting with nouveau.debug=ptherm - get the dmesg and send it along with your vbios to the mmio dumps e-mail address (or attach it to a bug report on freedesktop). That way mupuf can look at it in his own time :-)
15:30mmaret: btw, about the vbios, I notice that, some computer don't have the issue but print the same vbios version in dmesg than the ones with issues
15:30RSpliet: oh sorry
15:30mupuf: yes, I was about to say that making a bug report would be good
15:30RSpliet: nouveau.debug="PTHERM=debug" would be the right way to go
15:30mmaret: it's a bit old now and missing some new information that I will complete !
15:31mupuf: amazing... I have been ignoring this bug for a long time it would seem
15:33mmaret: RSpliet, nouveau.debug="PTHERM=debug" should be given on nouveau modprobe or kernel cmdline ?
15:34RSpliet: mmaret: which ever way nouveau is loaded
15:34mmaret: ok thx
15:35RSpliet: [ 53.971504] nouveau 0000:00:03.0: therm: FAN target request: 0% [ 53.971508] nouveau 0000:00:03.0: therm: FAN update: 0
15:35RSpliet: that already provides quite a clue :-P
15:38mmaret: Hopefully I don't burn a card with every test :)
15:40RSpliet: with 29 degrees, I'm sure you'll be fine ;-)
15:41RSpliet: What's your motivation for getting nouveau to run on Amazon in the first place btw? Did you already get the warning that we can't do OpenCL yet?
15:41mmaret: The card are doing nothing. Even X is not started. Otherwise aws may have already knock on my door :)
16:29karolherbst: mupuf: I tried to suggest ben to not use ptimer anymore, but his responds was "nvidia does it and if the ptimer/gpu is screwed we are screwed anyway"
16:42imirkin: mmaret: iirc delroth had a similar issue when he was playing around with it for dolphin's CI... did i already mention this to you? i don't remember whether he had any workaround...
16:48mupuf: karolherbst: maybe, maybe
16:48mmaret: imirkin, Yes, you mention it ! but he does not respond
16:48mupuf: but nvidia is not multiplexing it
16:48mmaret: I'll retry
16:49karolherbst: mupuf: I don't really see the point in using the GPU timers on the host, if we could simply use the host timers. Or are those too unreliable?
16:55mmaret: imirkin, good memory btw ;)
16:56imirkin: mmaret: not a lot of crazies running around trying to get nouveau working on aws :p
16:59mmaret: ho, another one. AWS also have instances with 4 GPU on them. Running nouveau on them produce a kernel crash.
16:59mmaret: but I ve not doing any research on those one
17:00imirkin: this might come as some surprise, but nouveau doesn't get a lot of workout on such systems
17:02mmaret: you don't have such a expensive system in our basement just for testing ? :)
17:04imirkin: well, i do regularly run nouveau on 3-gpu systems
17:04imirkin: my current configuration: https://hastebin.com/ohuciqonis.go
17:05mmaret: are you playing some game on it ?
17:07mmaret: like, if I give you a Kerbal Space Program license, chance are that few bugs will be solved in mesa ? Or that you will be absorbed by the game and loose any interest in nouveau ?
17:08imirkin: i actually very seldom play games, and even more rarely anything more graphically intensive than minesweeper
17:09imirkin: these GPUs are there for ease of testing
17:09imirkin: that's why they're from different families :)
17:09mmaret: less fun but more interesting :)
17:10mmaret: I have to collect my kid. Thanks for the help and the chat
17:17AndrewR: karolherbst, hello. I tried to apply your patches from https://lists.freedesktop.org/archives/nouveau/2016-November/026622.html to my 4.10-rc7+ build (because now I have new motherboard w/ pci-e 2.0 slots!) but X start locked-up :/ reverting both patches restored X ...As far as I understand those patches will be part of drm-next/4.11 nouveau ..so, I think I like to debug this early....
17:22AndrewR: karolherbst, http://pastebin.com/ri7KmNzs - lspci -vvv . It looks like card already running at pci-e 2.0 speeds? May be UEFI/BIOS set it up this way?
17:27imirkin: AndrewR: is that without the patches in question?
17:28AndrewR: imirkin, yes, without
17:30imirkin: AndrewR: surprising. G92 was still made at a time when PCIe 2.0 was new and flaky...
17:31AndrewR: imirkin, I wonder if there way to downspeed it temporarily (sorry, not looked at earlier code for enabling pcie link speed changes in general)?
17:32imirkin: wait, i thought karol's code should only be run on pstate changes...
17:32imirkin: maybe not?
17:34AndrewR: imirkin, I also added my \patch\ for enabling pstate changes, but they themselves seems to work..and as far as I see there still no auto reclocking, so I definitely was not changing pstate during those tests with pcie patches
17:35AndrewR: imirkin, https://paste.fedoraproject.org/551183/ - currently running with this
17:37imirkin: yeah, that won't affect any boot-time stuff
17:37AndrewR: imirkin, strangely enough, unigine series of benchmarks tend to hang now on highest settings even w/o any pstate changes ..may be PSU not powerful enough for new hw...or this pcie 2.0 by default thing really not that stable ...
17:38imirkin: or nouveau is doing something wrong - always my favorite
18:00karolherbst: AndrewR: I would need the output of dmesg
18:02AndrewR: karolherbst, with patches applied, I assume ..moment, I was rebuilding kern for another debugging (sometimes something tend to hang on resume from STR ...sometimes it survives 3-4 susped/resume cycles, sometimes it hangs at first..)
18:02karolherbst: AndrewR: could you then also boot with nouveau.debug=pci=debug
18:03AndrewR: sure ...
18:04karolherbst: oddly enough, your GPU seems to be on 5.0 GT/s pcie already, weird
18:06AndrewR: karolherbst, I only have this mb for ~day, not played with all those knobs in UEFI setup ....
18:07karolherbst: imirkin: I know that some of those cards report wrong speeds through their mmio regs, this could confuse nouveau maybe? Sadly all g92 I tested this with didn't show any issues
18:07karolherbst: and getting an mmiotrace today is a bit tricky sadly
18:08AndrewR: karolherbst, may be if I change something there it will come up at 1.0 speeds ... But then I\m not sure if hang is sane idea depending on "BIOS" settings ... (sorry, I know this is not your fault..just ..a lot of components to try and disable/change..for example this new iommu thing ...)
18:09karolherbst: ohh, it will be most likely my fault, cause I failed to detect something, maybe
18:09karolherbst: who knows
18:09karolherbst: the dmesg output would help
18:11AndrewR: karolherbst, reboot
18:13karolherbst: ohhh crap
18:13karolherbst: imirkin: g84: .msi_rearm = nv46_pci_msi_rearm, g92: .msi_rearm = nv40_pci_msi_rearm,
18:14imirkin: iirc there's a reason for that
18:15karolherbst: nv40 does nvkm_pci_wr08(pci, 0x0068, 0xff);
18:15karolherbst: nv46 does pci_write_config_byte(pdev, 0x68, 0xff);
18:16karolherbst: buggy implementation I guess?
18:16imirkin: something like that
18:16karolherbst: maybe we could get pcie v2 to work on g8x this way as well :D but honestly... it's just broken there, like completly
18:17imirkin: g8x doesn't support pcie v2 properly
18:17karolherbst: I know
18:17karolherbst: the mmio interface is there, but well
18:24karolherbst: AndrewR: we found the issue
18:24karolherbst: AndrewR: drm/nouveau/nvkm/subdev/pci/g92.c replace ".msi_rearm = nv40_pci_msi_rearm," with ".msi_rearm = nv46_pci_msi_rearm,"
18:24karolherbst: then it should work
18:25AndrewR: karolherbst, thanks!
18:25AndrewR: but I bothed my kern build :/ (something with newly enable lcok debugs)
18:26karolherbst: need to write the patch until this weekend
18:26karolherbst: so you really should test it today if that fixes it
18:26AndrewR: karolherbst, https://paste.fedoraproject.org/551204/ - dmesg from 4.10.rc3
18:26karolherbst: airlied: there will be a bugfix for nouveau most likely for 4.11
18:27imirkin: karolherbst: why should that help?
18:27karolherbst: I have no clue, but it's there for a reason
18:27karolherbst: those things are odd
18:28karolherbst: anyway, there will be a fix one way or the other
18:29AndrewR: karolherbst, in one hour I hopefully will have new kernel with this change and bootable
18:29karolherbst: imirkin: anyway, g94+ always used nv40_pci_msi_rearm
18:31karolherbst: AndrewR: okay, I will drive home in the meanwhile
19:34AndrewR: imirkin, patch from Karol seems to be working (at least X start now)
19:36AndrewR: https://paste.fedoraproject.org/551240/ - current dmesg
19:43AndrewR: imirkin, and on pstate change ... nouveau 0000:01:00.0: pci: set link to 2.5GT/s x0 :)
20:10karolherbst: ohh wait, we still have time for 4.11 ... silly me
20:11karolherbst: AndrewR: are you able to increase the pstate and so also the pcie link speed?
20:39gregory38: does anyone have a hint of what to look in gallium hud for GPU bottleneck
20:40imirkin: there are a ton of counters exposed that you can find via GALLIUM_HUD=help
20:40imirkin: but sorry, no, nothing particularly specific =/
20:40gregory38: yes but I don't know which one to look
20:40imirkin: sad to say, i don't have a ton of experience with that
21:27NanoSector: hmm, just checked if my gpu would turn off with pcie_port_pm=off, but it still throws "DRM: failed to idle channel 0 [DRM]"
21:29NanoSector: uh, derp, this is kernel 4.9
21:29NanoSector:updates to 4.10rc7
21:32imirkin: there's some weird acpi_osi=whatever thing you have to use now
21:32NanoSector: oh this is after suspend btw, it works on a 'fresh' boot
21:34NanoSector: imirkin, should I try setting that to Windows?
21:35imirkin: there's a specific set of strings
21:35imirkin: i think acpi_osi=! acpi_osi='Windows 2012' or something along those lines.
21:35imirkin: there's a bug about it
21:40nyef`: Bug: ACPI tables are still being stupid about supported OSI for certain features?
21:40NanoSector: hmm I can find https://bugs.freedesktop.org/show_bug.cgi?id=94725, is this waht you're referring to?
21:40NanoSector: someone recommends adding acpi_osi="!Windows 2013"
21:42imirkin: NanoSector: see https://bugzilla.kernel.org/show_bug.cgi?id=156341#c22
21:42imirkin: depends on the specific acpi thing
21:43NanoSector: imirkin, ah, I see, thanks
21:44NanoSector: 2013 doesn't work. this is a laptop that shipped with windows 10 so I suppose I should try 2015?
21:46imirkin: i think that doing acpi_osi=! acpi_osi=foo is different than !foo
21:46imirkin: do what the comments say :p
21:46imirkin: esp the ones from Peter Wu
21:50NanoSector: hmm i see
21:50NanoSector: so acpi_osi is just telling acpi what OS version it is
21:51imirkin: i have no idea how it works
21:51NanoSector: that makes two
21:51imirkin: which is why you should just follow directions
21:52NanoSector: i' m working on it :P
21:54NanoSector: ok setting acpi_osi=! acpi_osi="Windows 2009" doesn't fix the nouveau issue but causes my touchpad, backlight, and keyboard backlight to stop working upon resume :x
21:55imirkin: so... almost perfect ;)
21:55imirkin: try 2016 instead of 2009
21:55imirkin: and/or other values
21:56NanoSector: i'll try everything between 2009 and 2016
21:57NanoSector: 2016 breaks backlight but not touchpad
21:58NanoSector: doesn't fix nouveau
22:04NanoSector: nope, none between 2009 and 2016 work
22:04NanoSector: and only 2015 makes my backlight and all that work