01:32AndrewR: imirkin, for example PhoneWire from http://www.humus.name/index.php?page=3D&start=0 run into wine bug: https://bugs.winehq.org/show_bug.cgi?id=42953 and demos from http://developer.download.nvidia.com/SDK/10/direct3d/samples.html mostly crash.. only StencilRoutedKBuffer, Denoising, and Clipmaps show any window at all.
03:27imirkin: AndrewR: hmmm ... DX10 should not require independent blend
03:27imirkin: that may be a DX10.1 feature
03:29AndrewR: imirkin, may be demo just used dx11->dx 10.x setup...
03:30imirkin: yeah no clue
03:30imirkin: feels like a wine-internal fail
03:33AndrewR: imirkin, https://pastebin.com/BZxArnv4 - yeah .... so, may be after one or two more years even this will work :}
03:34imirkin: this looks like something quite silly going on
03:35imirkin: e.g. a "different" mask is set for rt4-7, but also none is bound, which means it doesn't matter
03:35imirkin: or ... something like that
03:35imirkin: nobody notices, since i doubt these GPUs are too popular
03:38AndrewR: imirkin, anyway, it all just for fun for me ...still, for some reason my machine hardly can stay up for more than 3-5 days now (with suspend-to-ram). Does your multi-GPU monster PC suspend/resume at all ?
03:38imirkin: dunno, never tried
03:38imirkin: i imagine not.
03:39AndrewR: imirkin, :}
10:24alkisg: Hi guys, this card started segfaulting in Ubuntu 18.04, while it was working fine in 16.04: 01:00.0 VGA compatible controller : NVIDIA Corporation NV5 [Riva TNT2 Model 64 / Model 64 Pro] [10de:002d] (rev 15)
10:24alkisg: The Xorg.log is at http://termbin.com/haem
10:24alkisg: Do you know of any such recent issues/patches that I could test, or should I file a bug report about it?
10:25alkisg: It works at the login screen, and crashes right after login where I assume compositing etc are used. Btw, it's still segfaulting with NoAccel, while modesetting works fine.
10:25alkisg: So far 3 schools reported the issue, and I proposed to them to use modesetting as a temporary workaround...
10:31Sarayan: Hi all. A number of question. First, is there a git for the referene linux kernel (thinking drm here) for nouveau?
10:31Sarayan: freedesktop wiki points to one not updated since 2015, so urgh
10:32Sarayan: https://cgit.freedesktop.org/nouveau/xf86-video-nouveau/ has its last commit in march, so are things slow or are they happening somewhere else?
10:33Sarayan: (otoh the xorg driver may not need much)
10:40pendingchaos: I believe things are slow because it does not need much
10:40pendingchaos: I think the code is purposely simple and direct
10:41Sarayan: it's really hard to be sure where works happen
10:42Sarayan: nouveau does not support my new laptop like I'd like, but it's non trivial to see where I should help
10:42pendingchaos: if you're talking about https://github.com/skeggsb/linux/, there is some newer commits in the other branches
10:44Sarayan: oh, I see, master is irrelevant then
10:44Sarayan: heh, why not
10:44Sarayan: that ensures basing on linus' releases after all
10:55Sarayan: so, bumblebee seems dead, great
10:56gnarface: what about dri_prime?
10:56gnarface: something like that...
10:56Sarayan: good question
10:56Sarayan: It's hard to find up-to-date docs
10:57Sarayan: seems my stuff is "modern" optimus, where the nvidia gpu does its things and transfers in the intel memory for display
11:19RSpliet: Sarayan: that "modern" optimus just works out of the box on a new Linux installation
11:19RSpliet: Gnome Shell even has a "launch on GPU" menu-item in the right-click menu of an application. Bumblebee is dead because it's obsolete
11:20Sarayan: RSpliet: It should be that good? I'm getting an oops on nvkm_pmu_reset timeouting
11:20RSpliet: It should be. That doesn't mean there's no bugs in nouveau ;-)
11:20Sarayan: GP108M/GeForce MX150
11:20Sarayan: of course :-) But it's hard to know how it sohuld be
11:21RSpliet: Didn't NVIDIA *just* release new firmware files for that GPU?
11:21Sarayan: I have no idea, where should I see that?
11:21Sarayan: While I'm not new to linux or reverse engineering for the matter, I'm new to nouveau itself
11:21RSpliet: Hmm... no that's for other Pascal GPUs
11:21RSpliet: My bad :-D
11:22RSpliet: Are you on a 4.18 kernel?
11:23karolherbst: RSpliet: Nvidia didn't
11:24karolherbst: ohh, you posted that link
11:24RSpliet: Eh oh... just saw the author of the patch. Care to elaborate?
11:24karolherbst: we use the gp108 acr firmware on other pascal GPUs as it fixes bugs
11:24RSpliet: And Gourav Samaiya signed it off :-P
11:25karolherbst: like performing secboot
11:25karolherbst: yeah well
11:25karolherbst: that's nvidais "yeah... it is correct to do that"
11:25karolherbst: anyway, we had random secboot issues on other pascal gpus
11:26karolherbst: like I couldn't perform secboot without some local hacks
11:26karolherbst: using the gp108 stuff fixed that
11:26Sarayan: and gp108 and gp108m close or it's just marketing fluff on very different hardware?
11:26RSpliet: It's marketing fluff on the same hardware pretty much :-P
11:27karolherbst: mobile GPUs tend to have less SMX' enabled
11:27karolherbst: or lower clocks enabled
11:27karolherbst: but physically those are pretty much the same chips
11:27Sarayan: yeah, you don't want to melt the laptop case
11:27karolherbst: RSpliet: did anybody actually looked into if it's possible to enable more SMs?
11:27karolherbst: like generally?
11:28RSpliet: karolherbst: Not as far as I'm aware. I don't think it's nouveau's aim to frustrate NVIDIAs business model...
11:28RSpliet: ... well :-P
11:29karolherbst: I really don't care much about that :p
11:29karolherbst: you still have the heat problem
11:30karolherbst: the difference isn't all that much anyhow
11:30RSpliet: And the "we now have 15 SMs running at the lowest clock speed with a limp DRAM bus"
11:30RSpliet: instead of 12 :-P
11:30karolherbst: +25% perf is what I see
11:30karolherbst: maybe >D
11:30RSpliet: On Kepler yeah, on Pascal... ;-)
11:31karolherbst: but this is more signifcant on low end GPUs
11:31karolherbst: like gk208 ones
11:31karolherbst: where you either have 1 or 2 SMX'
11:31RSpliet:stares at his GK107s
11:31Sarayan: that signed firmware issue, it's not something that would have fazed anyone ten years ago. It's a matter of purity (don't rip off firmwares from windows drivers, open source or bust, etc) or there are real technical reasons?
11:32karolherbst: sometimes a GPU chip has "broken" parts
11:32karolherbst: and you can disbale them
11:32karolherbst: but it also allows you to either sell the same hw for more to some customers willing to pay more
11:32karolherbst: kind of a mix of both
11:32karolherbst: ohh wait
11:32RSpliet: karolherbst: usually that's done using one-time fuses, not firmwares.
11:32karolherbst: signed firmware issue
11:33RSpliet: Oh ah, i thought you were talking about signed firmware :-)
11:33karolherbst: Sarayan: well.. some people say there are real technical reasons behind it
11:33RSpliet: Sarayan: There's an official reason for signed firmware... the kind of stuff that comes out of the behind of a male cow
11:33karolherbst: nobody was able to convince me it is the case
11:33Sarayan: No, I mean a reason for nouveau not to use the windows ones
11:34karolherbst: doesn't matter on nvidia
11:34Sarayan: which must do everything linux needs
11:34karolherbst: the linux and windows drivers are basically equal
11:34karolherbst: except there is no dx support on linux
11:34RSpliet: Sarayan: We could technically even strip them from the Linux driver. Just... nobody's identified them and extracted them. And we wouldn't be able to redistribute them, which means every nouveau user would have to run a firmware-cutter tool on a 100MiB blob to get their Linux driver working
11:34Sarayan: true, I meant the ones that come with the official closed-source nvidia drivers
11:35karolherbst: same thing
11:35Sarayan: Not able to redistribute them is an interesting nontrivial legal issue, not a technicaly one (even if legal is important :-)
11:36karolherbst: RSpliet: there might be a loophole though, because nvidia allows redistribution in other form for distributions. I don't know if this allows to repackage firmwares :D
11:36RSpliet: karolherbst: tell your boss you need to clip and redistribute firmwares...
11:37karolherbst: I already know his answer, but it only makes sense, because I know stuff you don't :p
11:37RSpliet: Sarayan: there's no technical issue. Anything the official driver can do, we can mimic. It's just a matter of effort.
11:38karolherbst: ohh interesting news btw: the switch homebrew devs released their toolkit with mesa/nouveau enabled for hw accelerated OpenGL 4.3 support
11:38karolherbst: RSpliet: using meas/nouveau against the prop nvidia kernel on the switch os: https://twitter.com/fincsdev/status/1036650654566109185
11:39RSpliet: Not surprising it works, but pretty cool they poured in the effort!
11:39Sarayan: ok, any recommendations to track a nvkm_pmu_reset failure or I should just poke at it?
11:39Sarayan: (poke meaning recompile a kernel, add printks to see where it breaks, etc)
11:40karolherbst: RSpliet: they had big issues within their egl backend code, which limited them to 1 fps or something.. super trivial fixes, but took time to find the mistakes... nearly killed the effort as they just didn't found the cause for quite a long time
11:40Sarayan: at least I can modprobe/rmmod nouveau and trigger the failure while everything else continues working
11:40karolherbst: Sarayan: ohh, wait, mhh maybe
11:40karolherbst: Sarayan: do you have a log somewhere?
11:41karolherbst: I think I might know what the issue might be
11:41Sarayan: karol: dmesg? something else?
11:41Sarayan: I can reproduce at will
11:41Sarayan: I currently have a vanially arch
11:43Sarayan: did a rmmod and a modprobe in that log
11:43karolherbst: what kernel?
11:44Sarayan: Haven't tried building a custm kernel in ages, will have to I guess though
11:45karolherbst: I don't think we have much code added up on top of the 4.18 stuff
11:45RSpliet: Sarayan: if it's a 3rd party module, you can build nouveau out-of-tree... Not that compiling a kernel is that much more difficult, but might save you some time
11:46RSpliet: "ahem" 3rd party module -> loadable module
11:46RSpliet: (in contrast to a built-in module)
11:46Sarayan: yeah, it's a loadable module
11:46Sarayan: I probably can even get the eact compiled sources through the arch build system
11:47karolherbst: Sarayan: does it help to modprobe with config=NvForcePost=1?
11:50Sarayan: still timeouting in dmesg, want me to try restarting X?
11:52pendingchaos: karolherbst: it seems yuzu (one of the switch emulators) have come to the conclusion that there are 32 1-bit control code registers, not 1 1-bit control code register
11:52pendingchaos: (on Maxwell/Pascal)
11:53Sarayan: og.kervella.org/log2.txt fwiw
11:53karolherbst: pendingchaos: "control code"?
11:54pendingchaos: flags register(s)?
11:54pendingchaos: I think that's a more correct name
11:54RSpliet: We call them predicate registers?
11:54karolherbst: I see
11:54pendingchaos: not predicate registers
11:54karolherbst: I mean, we kind of know that there is more than one bit
11:54karolherbst: but we just dont use it
11:56pendingchaos: from what I've heard, it was believed that it was one 1-bit register on Fermi and later?
11:56karolherbst: might be, I just know for sure that with volta having multiple carry bits is a thing
11:56RSpliet: pendingchaos: I've been looking at kepler assembly later, and I've definitely seen $p0 and $p1 in demmt code... so I don't think that's quite right :-)
11:56karolherbst: RSpliet: flags, not predicates
11:56Sarayan: restarted X, xrandr only sees one privider
11:57pendingchaos: predicates are a bunch of boolean registers used for conditions
11:57pendingchaos: flags are used for storing the carry from addition and such
11:57RSpliet: Ah ok, didn't really get there was a distinction on NVIDIA HW
11:58karolherbst: of course there is :p
11:58karolherbst: it is a special reg called $flags
11:58karolherbst: you can even read it out
11:58karolherbst: which you have to do for like trap handlers and so on
11:59RSpliet: Sarayan: yeah that has a lot to do with the timeout in your kernel logs. Try actually rebooting with the nouveau.config="NvForcePost=1" kernel param... Just to make sure it's not stuck in an invalid state from earlier attempts...
11:59karolherbst: the problem is, that usually on laptops we always post the GPU
12:00karolherbst: maybe it is indeed stuck in some weirdo state
12:00karolherbst: which I doubt
12:00RSpliet: So do I, but it's worth verifying
12:00karolherbst: I am sure there is some other stupid issue again
12:00karolherbst: maybe even within the firmware
12:00karolherbst: the problem is, that PMU stuff we run is from the vbios
12:03karolherbst: mhh, okay so we have a few errors the we can't really recover from
12:04karolherbst: mhh, weird
12:05karolherbst: and then it leads to CTSX_TIMEOUTS as we aren't able to context switch those shaders
12:05karolherbst: because they are already dead?
12:05karolherbst: or ... something?
12:11karolherbst: imirkin: we should bound check c mem access, shouldn't we?
12:11karolherbst: getting things like INVALID_CONST_ADDR_LDC in some piglit tests
12:14karolherbst: anyway, seems like I am hitting some cases were we don't really recover that well from
13:06Sarayan: sorry, had to do a thing, back
13:08Sarayan: added the option in modprobe.conf, rebooting
13:13Sarayan: No noticeable difference, see og.kervella.org/log3.txt
13:37karolherbst: imirkin: do you think bound checking indirect c access may add a significant perf penalty? I doubt many applications are actually doing it that much, but.... seems like shaders could actually do this and it causes hangs and engine resets
13:37karolherbst: Sarayan: "nouveau: unknown parameter 'NvForcePost' ignored"
13:37karolherbst: it isn't NvForcePost
13:38karolherbst: it is config=NvForcePost
13:40imirkin: karolherbst: don't worry about bound-checking c
13:40imirkin: those errors aren't "bad" errors
13:40karolherbst: well, nouveau seems to reset the engines though
13:40imirkin: that's bad
13:40imirkin: it shouldn't
13:40karolherbst: it doesn't to it because of that
13:40karolherbst: but because of "fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]"
13:42karolherbst: seems like the fp is doing something and the pipeline aborts/hangs? dunno
13:42karolherbst: doesn't seem to happen for the vp tests
13:42karolherbst: no indirect acces inside the vp ...
13:43Sarayan: karol: subtle, lemme fix that
13:44Sarayan: ohhh, the timeout seems gone
13:44Sarayan: lemme reboot to see if xrandr is happier
13:45karolherbst: nice, that test just crashes with nir... it does some hard coded -0x20 in the array access, oh well
13:45karolherbst: anyway, we get that ctxsw_timeout
13:46karolherbst: anoying part is, it takes a few seconds and sometimes doesn't happen at all, so X just hangs
13:46karolherbst: well in the former case it hangs for a few seconds (~5), so this is acceptable
13:47karolherbst: imirkin: any idea if that error could mess up something inside the pipeline? Or is it probably caused by a faulty handling of the trap? or do you have no idea?
13:47Sarayan: Damn, no change, og.kervella.org/log4.txt
13:47Sarayan: (but without the unknown parameter)
13:48Sarayan: options nouveau config='NvForcePost=1'
13:48karolherbst: did you rebuild initramfs?
13:48karolherbst: also, check "cat /sys/module/nouveau/parameters/config"
13:49Sarayan: I don't like these quotes
13:49Sarayan: remove 'em?
13:49karolherbst: yeah, might help
13:49karolherbst: no idea what they chnge
13:51Sarayan: rebuilt the initramfs and rebooting
13:53karolherbst: mhh, at least robuts_buffer_access_behavior is clear about out of bound array access: "In the sub-sections described above for array, vector, matrix and structure accesses, any out-of-bounds access produced undefined behavior."
13:54Sarayan: still no change, og.kervella.org/log5.txt
13:54Sarayan: galibert@titi:~ #6 >sudo cat /sys/module/nouveau/parameters/config
13:54karolherbst: "However, if robust buffer access is enabled via the OpenGL API, such accesses will be bound within the memory extent of the active program. It will not be possible to access memory from other programs, and accesses will not result in abnormal program termination."
13:54Sarayan: feels like that video card is sulking
13:54karolherbst: imirkin: seems like we actually have to?
13:54karolherbst: I mean, do bound checks
13:54imirkin: hw does it
13:55karolherbst: imirkin: ahh, okay
13:55imirkin: so no need
13:55karolherbst: wondering about that ctxsw timeout
13:55karolherbst: I mean, we get the trap
13:55karolherbst: maybe we should just disable that one?
13:55karolherbst: and move on with life?
13:55imirkin: i'm sure it has nothing to do with the sched_error
13:55imirkin: despite being proximate in your logs
13:55karolherbst: I don't see where that sched error might come from elsewhere
13:56imirkin: but feel free to disable it.
13:56karolherbst: this is a super trivial shader_test file
13:56imirkin: or nouveau is being over-active re traps
13:56karolherbst: I think this is the case
13:57imirkin: i didn't have issues before...
13:57karolherbst: does the test hang on your gpus?
13:57karolherbst: maybe it is some maxwell+ thing?
13:57karolherbst: or only happens there
13:58imirkin: i dunno
13:58imirkin: i don't have time now to play with it
14:19karolherbst: oh wow, how annoying is that. I disabled the faults inside the kernel, but something enables them again
14:30karolherbst: ... I really start to get super annoyed by those signed firmwares
14:31karolherbst: or is there something we might to from within mesa?
14:31karolherbst: (I mean, the firmwares also set those values somewhere
14:45imirkin: karolherbst: iirc we force-enable it at screen creation time
15:01karolherbst: imirkin: you wouldn't happen to know where we are doing it? that 0x3fffff write into 0x02d0 inside nvc0_magic_3d_init looks suspicious
15:01imirkin: something like that.
15:01imirkin: i don't ermember
15:09karolherbst: ohh wait, I found it I think
15:09karolherbst: ahh yeah
15:09karolherbst: in nve4_compute
15:09karolherbst: SET_SHADER_EXCEPTIONS 0xffffffff
15:10karolherbst: nice, without it, the gpu doesn't hang
15:12karolherbst: okay, that kind of messes with the hardware
15:12karolherbst: even now having that bit enabled doesn't mess it up, just disabling setting it from inside mesa
15:12karolherbst: I guess we have to enable it for evey channel?
15:14karolherbst: imirkin: what do you say, we change that sw methog into something where the kernel decides what is safe to enable? Or do we want to not enable bits we know hang gpus or there is no point in enabling them?
15:16karolherbst: ohh right, the kernel just enables everything anyway
15:17karolherbst: through that sw method I mean
15:19karolherbst: nice, now I don't get those ctxsw_timeouts anymore
15:23karolherbst: imirkin: out of bound access to l gives us a OOR_ADDR :/ this is annoying
15:23karolherbst: or is OOR_ADDR only for l access in the first place?
15:24karolherbst: this happens for arrays which are lowered to l
19:27karolherbst: imirkin: ohh, it seems like those ctxsw_timeouts were actually caused by my trap_handler patches and enabling debugging on the GPU actually messed it up...
19:36karolherbst: sooo apperantly I am able to invoke the trap handler from fragment shaders
19:38karolherbst: but there is one thing the GPU really doesn't like (or ctxsw)
19:42karolherbst: mhh maybe the address is wrong now of the trap handler...
19:42karolherbst: oh well
19:50Sarayan: bad vectors are always a real problem
22:07karolherbst: pendingchaos: mhhh, that tressfx just seems to work for me...
22:09karolherbst: but I am not sure if it is actually using tressfx.. hard to tell
22:13pendingchaos: it should say so in the settings menu?
22:14pendingchaos: and maybe the preferences file
22:14pendingchaos: (it might just be a number)
22:15karolherbst: yeah, I know. But I see no visual difference
22:16pendingchaos: that sounds wrong
22:16pendingchaos: is there a performance difference perhaps?
22:16karolherbst: mhh, didn't check
22:18pendingchaos: it looks like something you would be able to tell the difference with: https://www.blogcdn.com/www.engadget.com/media/2013/03/tressfx-3-1-13-03.jpg
22:21karolherbst: no idea what is going wrong
22:22pendingchaos: maybe it's something that's only used if the hair setting is set to TressFX and if some other quality setting is high enough
22:22pendingchaos: seems unlikely though
22:22karolherbst: I guess I try maxed out settings
22:22karolherbst: I am on a gp107 though... so I assume the worst
22:22pendingchaos: though when I tried it and it hanged, I think it might have been on otherwise lowest quality settings?
22:24karolherbst: I know I had some issues one a kepler GPU
22:24karolherbst: maybe it works on pascal?
22:24karolherbst: or doesn't?
22:24karolherbst: or maybe some extension check?
22:24karolherbst: or feature?
22:25pendingchaos: I don't think it worked on Pascal
22:25pendingchaos: (it hung)
22:44pendingchaos: TressFX isn't hanging for my for some reason
22:44pendingchaos: not sure if the hair looks correct though
22:46pendingchaos: *for me for some reason
22:59pendingchaos: the hair was probably wrong when I ran it with TressFX just then
22:59pendingchaos: I think the ponytail was missing
23:07karolherbst: pendingchaos: yeah, I know that issue. It happened on my kepler from time to time
23:24pendingchaos: I think I'm having some hanging problems with Tomb Raider though
23:25pendingchaos: I'm no longer sure if TressFX has anything to do with it? Tomb Raider seems to have worked fine every single time before now except once when I enabled TressFX
23:46pendingchaos: it seems in the past two hangs (one with TressFX, the other without), a shader eviction warning is printed at the same time it hangs
23:47karolherbst: well, that shouldn't matter
23:47karolherbst: but yeah, on my maxwell I was able to crash the channel as well
23:47pendingchaos: yeah, it shouldn't
23:48pendingchaos: I wonder if it has anything to do with it though
23:58karolherbst: I got a "fifo: fault 00 [READ] at 000000013343f000 engine 00 [GR] client 07 [GPC0/T1_2] reason 00 [PDE] on channel 13 [017d97f000 TombRaider]"
23:58karolherbst: ayway, what I want to mainly test is if that notification stuff just ends up killing the application and Xorg won't freeze (or not for too long)
23:59karolherbst: so the user won't end up hard resetting the machine
23:59pendingchaos: I'm getting the same
23:59karolherbst: yeah, requires patches to the entire stack though... so not that easy to test
23:59karolherbst: currently setting it up on my machine witht he maxwell GPU