00:03karolherbst: but I think I need get a better understanding on TLDS anyway
00:10karolherbst: imirkin, skeggsb_: do you know how to generate TLD from PTX code?
00:55karolherbst: ohhh okay, tlds is a bit different
00:56karolherbst: so src0 is a single one
00:56karolherbst: and src1 is the double one
00:56karolherbst: uhm... no
00:56karolherbst: it is a bit more weird
01:08karolherbst: "mov $r0 0x0 0xf mov $r1 0x0 0xf tlds lz nodep 0x0 $r0 $r0 $r1 0x8 t2d g"
01:09imirkin: karolherbst: i think if you follow the code, it's actually a bit different
01:09imirkin: but i agree that it's a bit confusing
01:10imirkin: that ... = ms thing is for the ms case
01:10karolherbst: well, I have a CTS test failing which is super trivial
01:10imirkin: but then we set it to levelZero later on
01:10imirkin: note that implicitly !levelZero = level-explicit-lod
01:10imirkin: except for ms, where it's actually the sample
01:10karolherbst: ahh, right
01:10karolherbst: makes sense
01:10karolherbst: anyway, that's not currently the issue right now I have
01:10karolherbst: something like "texfetchs 2D $r0 $s0 r f32 $r0 u32 $r0 $r0 (8)" doesn't seem to work
01:10karolherbst: and I don't know why
01:11imirkin: how does nvdisasm decode it?
01:11karolherbst: I mean, the same
01:11imirkin: can i see it?
01:12karolherbst: /*0010*/ TLDS.LZ RZ, R0, R0, R0, 0x0, 2D, R; /* 0xda40000ff0070000 */
01:12imirkin: ok, so
01:12imirkin: that's actually a texelFetch(lod=0)
01:12imirkin: is that what you want?
01:12imirkin: i.e. what's the input code?
01:12karolherbst: TXF_LZ TEMP.x, TEMP, SAMP, 2D is the TGSI
01:13karolherbst: all vals are 0
01:13imirkin: ok. that sounds right ;)
01:13imirkin: i wonder if the same arg can't be reused
01:13imirkin: you can teach RA about shit like that
01:13imirkin: there's a thing to make sure that dst and src are different
01:13imirkin: that's slightly far-fetched though, i guess =/
01:14karolherbst: the same dest and src can be used
01:14karolherbst: nvidia: 00000018: f0170000 da42008f tlds lz nodep 0x0 $r0 $r0 $r1 0x8 t2d r
01:14imirkin: note how it uses different src's
01:14karolherbst: might be a different shader though
01:14imirkin: i was just giving an example for how RA can know about things like that
01:14karolherbst: for no apperant reason
01:14imirkin: shouldn't be too hard to teach it
01:14karolherbst: the most abdsurd thing is even that:
01:15karolherbst: movs 0 into $r0,$r1 and $r2
01:15karolherbst: then tlds lz nodep 0x0 $r0 $r0 $r2 0x8 t3d r
01:17karolherbst: imirkin: InsertConstraintsPass::addConstraint is for the def/src stuff, right?
01:18karolherbst: uhm... it isn't used
01:18imirkin: i don't remember
01:19imirkin: let's see....
01:19karolherbst: for the dest == src thing we use addRegPreference
01:19imirkin: addHazard is the thing for dst vs src
01:19imirkin: the way that works is it adds a fake use after the instruction
01:20karolherbst: yeah well
01:20karolherbst: that trick won't work with srcs
01:20imirkin: which makes that reg not eligible for dst selections
01:20imirkin: so what we want here
01:20imirkin: is to insert a nop
01:20imirkin: or something
01:20imirkin: which SETS the reg to some other value
01:20imirkin: and therefore the RA will think they're truly different things
01:21imirkin: i do believe that addConstraint will do something like it
01:21karolherbst: it uses that OP_CONSTRAINT I never saw
01:21imirkin: worth a shot
01:21imirkin: src/gallium/drivers/nouveau/codegen/nv50_ir.h: OP_CONSTRAINT, // copy values into consecutive registers
01:22imirkin: which will work out fine with 1 dst and 1 src
01:22imirkin: assuming it works :)
01:22imirkin: it does get pushed onto the constrList...probably fine
01:23imirkin: but untested.
01:26karolherbst: okay yeah, "i - i" doesn't make much sense
01:26karolherbst: nice "1: texfetchs 2D $r0 $s0 r f32 $r0 u32 $r1 $r0 (8)"
01:28karolherbst: imirkin: wondering if the same is true for TEXS, but the full CTS run was fine
01:28karolherbst: so yeah, less fails now
01:30karolherbst: imirkin: maybe the second source actually needs to be the "next" reg
01:30karolherbst: mhh, no, that would be stupid
01:31karolherbst: I think it is something else
01:43karolherbst: actually 2D was fine
01:43karolherbst: 2D_ARRAY wasn't
01:43karolherbst: and the log was kind of screwed
01:51karolherbst: okay, found the mistake
01:58karolherbst: for 2DArray we have to use texfetchs 2D_ARRAY $r0 $s0 r f32 $r0 u32 $r0 $r2d
02:25imirkin: karolherbst: oh, that can happen too... there's at least one instance of that
02:25imirkin: in that case, just coalesce
02:25imirkin: and then at emit time, do n, n+1
02:27karolherbst: imirkin: or that: https://github.com/karolherbst/mesa/commit/f4d91b3d93f3fcb1c7397a11e29db82d0df90d84#diff-70ffabf33cc5a58af0db0774e917b125R2228
02:30imirkin: that's dangerous
02:30imirkin: should be done based on number of arguments.
02:31karolherbst: you mean the TexTarget.getArg() stuff?
02:31karolherbst: uhm, getArgCount actually
02:32karolherbst: yeah, I know, just keeping things simple for now
02:33karolherbst: for the defs I should bitcount the mask as well
02:34karolherbst: allthough the original code uses "tex->srcCount(0xff)"
02:39imirkin: yeah, srcCount.
02:39imirkin: everything else is lies.
11:19RSpliet: karolherbst: I like the sqrt patch :-P No chance of that existing on 1st gen Maxwell is there?
11:19karolherbst: RSpliet: not according to nvdisasm
11:20RSpliet: karolherbst: perhaps I could double-check with a two-line OpenCL/CUDA program on my laptop at some point...
11:20karolherbst: RSpliet: it gives a 10% perf boost in pixmark_piano, but unigine heaven was like 1% at most
11:20karolherbst: RSpliet: well, I already did
11:20RSpliet: Ah ok... I won't bother in that case :-D
11:20karolherbst: RSpliet: do you know clcc?
11:20RSpliet: very very vaguely. You can feed it ptx right?
11:20karolherbst: lets you compile OpenCL kernels to PTX code
11:20RSpliet: oh right
11:21karolherbst: makes a lot of things really easy to RE
11:21RSpliet: Yeah that's a lot easier than valgrind-mmt
11:21karolherbst: or writing your own PTX code
11:21karolherbst: a glsl to SASS thing would be nice as well
11:22RSpliet: sounds sassy...
11:22karolherbst: but mhh, something is odd with tld4s...
11:24karolherbst: those scalar texture ops are realy nice
11:24karolherbst: if you understand how they work I mean
11:25karolherbst: forgot to emit the color mask
13:22karolherbst: mhh, KHR-GL45.geometry_shader.limits.max_combined_texture_units fails
15:09nikos: Hi, I'm getting pretty bad tearing during scrolling in firefox and video playback. I'm using a NV117 card with i3wm, tried with and without compton running. Any clues?
15:11diogenes_: nikos, try compiz
15:13orbea: or compton
19:00karolherbst: one furmark shader changed form 21 to 20 gprs
19:00karolherbst: 622 -> 646 points
19:20mooch2: karolherbst, hey, any ideas on how i could speed shit up with nouveau on my gm107?
19:22karolherbst: mooch2: plenty ideas
19:22karolherbst: mooch2: try the XMAD patches
19:22mooch2: aight, anything else?
19:22mooch2: like, stuff i could work on, maybe?
19:22karolherbst: instruction scheduler
19:22karolherbst: figureing out why things are slow
19:23karolherbst: figure out ZCULL
19:24karolherbst: what we really miss is a way to really micro benchmark stuff to know what exactly make things bad
19:24karolherbst: but I think instruction scheduler + ZCULL would be plenty helpful already
19:32karolherbst: mooch2: you can test the patches I just sent out
19:33karolherbst: but don't expect much
19:33karolherbst: more like a 1% speedup in avg
19:33karolherbst: okay, now the xmad stuff
19:33HdkR: GM107 is a pretty low end part already
19:34karolherbst: HdkR: sure, but GDDR5 isn't that slow
19:35karolherbst: ohh there were DDR3 ones as well
19:35karolherbst: DDR3 is really pointless
19:35mooch2: how do i tell which memory type i have?
19:35karolherbst: mooch2: nouveau prints it out
19:35HdkR: The DDR4 GT 1030 should die as well
19:35mooch2: where tho?
19:36karolherbst: HdkR: well, at least it is DDR4
19:36karolherbst: but yeah
19:36karolherbst: HdkR: but nvidia tends to clock higher than what your sys mem is clocked to though
19:36mooch2: ah good, i have gddr5
19:36HdkR: Doesn't stop it from murdering the memory bandwidth compared to GDDR :P
19:37HdkR: That DDR3 GM107 is just...really bad
19:37karolherbst: mooch2: good, then your GPU isnt that bad, 950 ti?
19:37mooch2: 750 ti
19:37mooch2: it runs doom 4 at like 50 fps medium
19:37karolherbst: I jsut wanted to correct myself
19:37mooch2: ah lol
19:37karolherbst: mooch2: yeah... try those xmad patches
19:38karolherbst: allthough they only really help with integer MULS/MADs
19:38mooch2: uh, how do i get doom 4 running on linux tho? wine 3.13 can't run it *shrug*
19:38karolherbst: chances are, there are some
19:38karolherbst: mooch2: I am sure it can
19:38karolherbst: ohh wait
19:38karolherbst: we don't have vulkan
19:38mooch2: doom 4 supports opengl :/
19:38karolherbst: doesn't matter
19:39karolherbst: all dx10-12 games are basically translated for free
19:39mooch2: the opengl and vulkan engines are different executables
19:39mooch2: okay, fair enough
19:39karolherbst: you just need to use vulkan
19:39karolherbst: there is a dx10 to dx11 conversion library, and a dx11 to vulkan one
19:39mooch2: well, is that kernel feature that nouveau vulkan needs merged yet?
19:39karolherbst: and then we have that upstream wine dx12 to vulkan thing
19:39karolherbst: mooch2: yes
19:40mooch2: oh :c
19:40mooch2: well, is there any other way i can measure nouveau's performance other than running emulators? :p
19:40karolherbst: run native games?
19:43mooch2: i don't own any linux native games besides tf2 lol
19:43karolherbst: you don't want to know how many I own
19:43mooch2: i literally JUST switched over fully
19:43mooch2: nyef, that's... probably not too gpu-intensive either
19:43nyef: Details, details.
19:44karolherbst: I think I might even start this week with the vulkan driver then....
19:44karolherbst: or at least the preperations
19:44HdkR: xmoto :D
19:44mooch2: HdkR, teeworlds? xmoto?
19:44mooch2: what are these/
19:44karolherbst: moving codegen out of gallium
19:44HdkR: Native Linux games
19:44karolherbst: which will be qute painful
19:44mooch2: karolherbst, no i was talking to HdkR
19:46mooch2: HdkR, neither of those look gpu-intensive AT ALL
19:46mooch2: are there any gpu-intensive native 3d games?
19:46mooch2: i want to be gpu-bound here, not cpu-bound
20:05HdkR: mooch2: Can't you run most anything with AA or higher resolution to become bounded with a GM107? :P
20:06mooch2: eh, okay
20:06mooch2: i guess i'll run dolphin then :p
20:06HdkR: Dolphin will most benefit from that xmad optimization
20:06mooch2: with like, over... 4x IR maybe? i know on the blob, i could go up to like 6x
20:07mooch2: oh? nice
20:10HdkR: Turns out that most games don't use integer mads as much as Dolphin. Need to start looking at compute tasks to end up like that
21:10RSpliet: mooch2: I've played with instruction scheduling quite a while ago. Worked decently on Kepler, same code had no significant impact on my GM107M
21:13RSpliet: Similarly, I had played with bank-aware register allocation, which when done right should make a difference... but didn't
21:13RSpliet: So... they're nice to play with, but could not be the bottleneck at this point
21:13RSpliet: Or my policies were too simplistic :-P
21:18karolherbst: RSpliet: On kepler instruction sceduling has a big impact, because you are more likely to dual issue if you do something better
21:18karolherbst: or that's what I assume was the actual effect of your pass
21:19karolherbst: RSpliet: and on maxwell you need the proper sched stuff anyway, don't know if you played with it before or after we added that
21:20karolherbst: but, we could actually count the stalls on maxwell and try to reduce that
21:20RSpliet: karolherbst: I tried stuff after hakzsams work was merged
21:21RSpliet: With a simple "issue load/stores early, then try to schedule instructions that don't depend on them for a long while"
21:22RSpliet: I saw it make a positive difference on pixmark piano I think, but nothing on real games or benchmarks (and it took a lot of tweaking to not regress one of the other pixmark ones)
21:23RSpliet: But I'd expect the impact to be smaller for real games as they are more likely to have <32GPRs, hence more opportunities to hide DRAM access latencies through thread parallelism rather than pipeline parallelism
21:24RSpliet: Perhaps armed with a stall counter you could tweak policies much easier :-) I never did things too fancy, just a bit of kick-and-monitor
21:29karolherbst: I think we really have other things to worry about right now
21:30karolherbst: RSpliet: increasing the dual issueing rate made quite a big difference on kepler though
21:31karolherbst: but it was still only around 3%
21:31RSpliet: karolherbst: 3% is pretty good.
21:31karolherbst: doesn't make a difference in the end
21:31karolherbst: except we have like 10 things which improve by that
21:32karolherbst: I got the dual issue rate to increase from 21% to 27% in pixmark_piano
21:32RSpliet: Yeah, that's the nature of the beast. And they get progressively more complex to tackle
21:32karolherbst: sooo.. yeah
21:33karolherbst: I am sure if we get close to 50%, that this would make a huge difference
21:33RSpliet: Dual issue rate of 20%/40% (depending on how you look at it) I believe is pretty good
21:33karolherbst: 20% is pretty bad actually
21:33RSpliet: I was amazed by how much the difference can be in this OpenCL kernel of mine for proper loop unrolling
21:34karolherbst: I am kind of looking in improving the codegen generation when coming from nir
21:34RSpliet: with 20/40% I mean two in five instructions are issued in pairs. 60% as individual.That's the rate I found for a set of OpenCL kernels I wrote
21:34karolherbst: as we get all those fancy CFG based opts there
21:36karolherbst: RSpliet: well, the hardware can do 100% in pairs
21:36karolherbst: sure, that's not possible for most kernels/shaders
21:36karolherbst: but I am sure that you can hit 50% in pairs quite easily
21:37karolherbst: I mean, in pixmark_piano I got to <46% issued alone
21:37RSpliet: 40% is what I got from the NVIDIA compiler on my OpenCL kernels, that's my reference point.
21:37karolherbst: which means 54% in pairs ;)
21:38RSpliet: Pixmap is a weird benchmark. 4000 lines of shader assembly, tons of registers. That means there's lots of opportunities to issue pairs with no register dependencies among each other.
21:38karolherbst: let me check what our rate is in avg
21:39karolherbst: still, it isn't a shader with a high rate
21:40RSpliet: I'm sure that my scheduling patches improves dual issue by separating such insns with register dependencies, scheduling them further apart, making it more likely for adjacent insns to pair up
21:40RSpliet: Or... I at least tested with code to do that
21:40karolherbst: RSpliet: I see shaders where we even do 86% dual issueing
21:40RSpliet: we == nouveau?
21:41RSpliet: Are these non-trivial shaders?
21:42karolherbst: 51 instructions, allthough let me check for others
21:43karolherbst: RSpliet: https://gist.githubusercontent.com/karolherbst/68bc6b3ff69fb22f239808041b2b4767/raw/0c54d81081986e64aebc1efe73b80defacf917f4/gistfile1.txt
21:43karolherbst: dual issues / instruction count
21:43karolherbst: 0.1 means 20% in pairs, 10% issued with the 0x04 sched
21:44karolherbst: but the bigger ones seem to be around 60%
21:44karolherbst: and that's without my pass
21:45karolherbst: avg seems to be around 50%
21:45karolherbst: or something
21:49karolherbst: RSpliet: anyhow, the effort put into this isn't really worth the time for us right now
21:49karolherbst: I am sure figuring out ZCULL would be worth it
21:50karolherbst: or figure out where our actual bottlenecks are
21:50karolherbst: or something like that XMAD thing, which is just much faster for compute like shaders
21:57RSpliet: yeah... I must admit I know too little about graphics to say anything meaningful about that
21:58RSpliet: But... could we also be too cautious with launching independent shaders in parallel (stuff that Maxwell can do to a limited degree)?
21:58RSpliet: E.g. stick too many fences and syncs in the command FIFOs
22:41karolherbst: RSpliet: the actual issue we have is, that we don't know