15:16angry_dev: hey all, i'm getting SCHED_ERRORs a few times a day that sometimes lock up my system (gentoo/gnome) and was wondering if i could get pointed in the right direction
15:17angry_dev: NVIDIA Corporation GK107 [GeForce GTX 650] (rev a1) and kernel 4.16.13
15:17Lyude: angry_dev: try updating to the latest 4.16 kernel first
15:18angry_dev: okay, will do
16:15pendingchaos: karolherbst: could you perhaps look at these patches sometime: https://gist.github.com/pendingchaos/3eb3e2a0007a442912a59d4234cf0bdc?
16:15pendingchaos: about the LateALgebraicOpt one: it seems it helps because LateAlgebraicOpt doesn't optimize adds with immediates
16:15pendingchaos: and there are no immediates before LoadPropagation
16:15pendingchaos: so perhaps, if that one is pushed, the checks could be removed
16:17karolherbst: pendingchaos: ohhhhhh. I know why
16:17karolherbst: I fixed that in a different patch differently
16:18karolherbst: pendingchaos: https://github.com/karolherbst/mesa/commit/a12472f107209000ac013c2f23423ca272d129c4 and https://github.com/karolherbst/mesa/commit/56d13872868547c90cd0412b732ecb693cdfdbcc
16:18pendingchaos: that's only for constant loads though? it doesn't optimize adds with immediates
16:18karolherbst: if you apply those patches first and check the effectiveness of moving the pass up, you should see that it should get close to 0%
16:18karolherbst: two patches ;)
16:20karolherbst: I think with those two patches applied your patch moving the pass might actually produce worse code in avg
16:20karolherbst: I think Split64BitOpPreRA is the one pass producing some combinations which LateAlgebraicOpt is able to improve
16:21pendingchaos: after creating a temporary mov for insnCanLoad() and removing the checks, it has a similar effect to moving the pass
16:22karolherbst: well, my patches have a bigger impact
16:22karolherbst: and less hurts
16:23karolherbst: well except on the gprs side
16:23karolherbst: but this is always painful to avoid
16:24karolherbst: maybe the other passes are also generating some code which could be cleaned up by LateAlgebraicOpt
16:24karolherbst: I think I would rather investigate this a bit more
16:25karolherbst: pendingchaos: regarding the merging of 32 bit constant loads, the gprs increase is quite huge
16:26pendingchaos: yeah, it's a bit of a questionable patch
16:26karolherbst: the we kill only a few movs, right?
16:26karolherbst: movs are so cheap, that they don't matter compared to any kind of texture/memory operation anyway
16:26pendingchaos: a few loads/stores
16:26karolherbst: and killing parallelism is always bad :/
16:27pendingchaos: it should only touch them
16:27pendingchaos: not movs
16:27karolherbst: well, loads, because you can't store to const mem
16:28karolherbst: is there maybe an application hit significantly by this?
16:29karolherbst: might be worth to benchmark a little
16:29karolherbst: of pixmark_piano is hit, aways worth to check this
16:29karolherbst: because the result is like super stable
16:29karolherbst: and you can easily notice change in a 0.5% range with high confidence
16:30karolherbst: even 0.1% if you do a few more runs
16:30pendingchaos: I don't think so? at least not with some Unigine benchmarks IIRC
16:30pendingchaos: I haven't used pixmark_piano
16:32karolherbst: pixmark_piano is great
16:32karolherbst: you can run it like 10 times in a row and always get the same result
16:32RSpliet: pixmark piano also has an unreasonably large shader. You can see pretty much anything in it you want. It's like using Gem5 to propose an architectural change that gives you 10% more perf that you'll never see happening in real life
16:33karolherbst: well, actually not
16:33RSpliet: (which is coincidentally why a lot of university research sucks :-P but that's a different matter)
16:33karolherbst: because if you do something stupid, you get a negative feedback
16:33karolherbst: of course you don't know if the thing you changed is good for other applications though
16:35karolherbst: RSpliet: but a lot of other benchmarks just suck, because they might have a 5% error margin by default
16:35karolherbst: which kind of sucks for micro optimizations
16:36pendingchaos: not sure if pixmark_piano would be useful for something changing just constant buffer loads, apparently it's based off a shadertoy shader
16:36pendingchaos: haven't tried yet though
16:36RSpliet: Let me put it differently: pixmark_piano's DRAM-bw-to-instruction ratio is not representative for real applications. This puts an emphasis on developing peephole optimisations, which is currently not where application's bottlenecks really are.
16:37RSpliet: I'm not telling you not to play more with the peephole opts, as long as you realise the changes will have limited impact to the bulk of the users :-)
16:39RSpliet: well, that disregards the fact that the state of nouveau scares off gamers to binary land, so the real users are probably just running a browser and a desktop compositor, so let's say prospective users that we'd get if only we got proper DVFS support for all cards :-P
16:40karolherbst: I really should get back to my patches
16:40RSpliet: which ones? DVFS? NIR? Insn scheduling?
16:41karolherbst: dvfs and other clocking related ones
16:41RSpliet: Haha, cool. I'll let RH and you set your schedule :-P happy to see a growing workforce for nouveau these days
16:41karolherbst: my hope with nir is, that we can kind of take advantage of all the good CFG based opts, but I never really cared about perf there and a few things still suck big times
16:42karolherbst: zcull would be a big win
16:42RSpliet: CFG based == code motion? Yeah I'd like to see that happen. It would provide such a good basis for instruction scheduling
16:43karolherbst: RSpliet: we have qute a lot of opts we don't do which are all kind of control flow or program flow based
16:43karolherbst: like we don't kill stores after the last emit in a geometry shader
16:44karolherbst: or I am sure we could do plenty of opts in regards to discard as well
16:44RSpliet: karolherbst: that sounds like a reachability, DCE problem...
16:44karolherbst: well, ...
16:44karolherbst: not if you loop out of a sudden
16:45RSpliet: I know too little about the matter to say something more sane about it :-)
16:45karolherbst: and discard is also painful if you have branched code and some semi conditional discard
16:45karolherbst: but still instructions inside that block
16:46karolherbst: but anyway, no idea how much impact this all could have
16:46karolherbst: RSpliet: there was this discussion about a nir discard opt and where d3d shaders put discard at the bottom
16:47karolherbst: this could explain why we such with those eon ported games
16:50karolherbst: or uhm, it was dxvk related actually
16:56nyef: ... I'm going to have to figure out how to tolerate working with C++ if I want to get in on the interesting compilery bits, aren't I? That sucks.
17:11pendingchaos: karolherbst: no changes in pixmark_piano for the constant load combination patch
17:13pendingchaos: also: comparing the LateAlgebraicOpt movement vs your two patches: https://pastebin.com/raw/wD7TU3WX
17:14pendingchaos: your patches seem to have a large increase in size for hitmanpro/1464?
17:16karolherbst: yeah might be, I didn't look too closely I think
17:21karolherbst: pendingchaos: anyway, I plan to take a deeper look at the patches this weekend
17:21pendingchaos: seems it's the "nv50/ir: handle SHLADD in IndirectPropagation" patch decreasing the size of the shader from 150 instruction to 132 instructions
17:24kiljacken: To all the clever heads: Is there an easy way to identify write to falcon io space in an mmiotrace?
17:25karolherbst: kiljacken: not if the write is done on the falcon
17:28kiljacken: It's from the driver (or at least I'd assume so). I'm looking into the issue I brought up three weeks ago, of clearing interrupts from nouveau failing in some piece of startup code on gp104, so I thought I'd take a look at what the nvidia driver sends to the card
17:30kiljacken: So my first goal is just locating the equivalent write to clear falcon interrupts in the mmiotrace of the nvidia driver, while slowly getting a grab on the tooling
17:31karolherbst: well, each falcon has some interrupt regs, you could check with rnndb at which offset those are
17:31karolherbst: but those should also just show up inside demmio
17:33pendingchaos: karolherbst: your patches with indirect propagation for shladd: https://pastebin.com/raw/sEJqimKy
17:33pendingchaos: they seem just as effective as the LateAlgebraicOpt move patch
18:01kiljacken: From what I can gather PSEC2 would be the falcon responsible does secboot? Or am I getting that wrong?
18:13kiljacken: Ahhh, the names in readthedocs.io falcon io space documentation doesn't match the names in rnndb, that explains a lot
18:32kiljacken: Hmm, nouveau and nvidia seem to be sending rather different firmware blobs to PSEC2
18:55kiljacken: Okay, so it appears the firmware blob is actually loaded to the card despite the timeout I'm experiencing. But the HS bootloader returns an error code.
18:55kiljacken: That doesn't seem great
20:04RSpliet: nyef: well, the compiler isn't quite MVC style C++... presumably for performance reasons. You'll have to tolerate not just C++, but the codegen compiler in general ;-)
20:04RSpliet: Most bits are pretty straightforward though, quite pragmatically designed