00:03 karolherbst: but I think I need get a better understanding on TLDS anyway
00:10 karolherbst: imirkin, skeggsb_: do you know how to generate TLD from PTX code?
00:55 karolherbst: ohhh okay, tlds is a bit different
00:56 karolherbst: so src0 is a single one
00:56 karolherbst: and src1 is the double one
00:56 karolherbst: uhm... no
00:56 karolherbst: actually
00:56 karolherbst: it is a bit more weird
01:08 karolherbst: huh
01:08 karolherbst: "mov $r0 0x0 0xf mov $r1 0x0 0xf tlds lz nodep 0x0 $r0 $r0 $r1 0x8 t2d g"
01:09 imirkin: karolherbst: i think if you follow the code, it's actually a bit different
01:09 imirkin: but i agree that it's a bit confusing
01:09 karolherbst: mhhh
01:10 imirkin: that ... = ms thing is for the ms case
01:10 karolherbst: well, I have a CTS test failing which is super trivial
01:10 imirkin: but then we set it to levelZero later on
01:10 imirkin: note that implicitly !levelZero = level-explicit-lod
01:10 imirkin: except for ms, where it's actually the sample
01:10 karolherbst: ahh, right
01:10 karolherbst: makes sense
01:10 karolherbst: anyway, that's not currently the issue right now I have
01:10 karolherbst: something like "texfetchs 2D $r0 $s0 r f32 $r0 u32 $r0 $r0 (8)" doesn't seem to work
01:10 karolherbst: and I don't know why
01:11 imirkin: how does nvdisasm decode it?
01:11 karolherbst: correctly
01:11 karolherbst: I mean, the same
01:11 imirkin: can i see it?
01:12 karolherbst: /*0010*/ TLDS.LZ RZ, R0, R0, R0, 0x0, 2D, R; /* 0xda40000ff0070000 */
01:12 imirkin: ok, so
01:12 imirkin: that's actually a texelFetch(lod=0)
01:12 imirkin: is that what you want?
01:12 imirkin: i.e. what's the input code?
01:12 karolherbst: TXF_LZ TEMP[1].x, TEMP[0], SAMP[0], 2D is the TGSI
01:13 karolherbst: all vals are 0
01:13 imirkin: ok. that sounds right ;)
01:13 imirkin: i wonder if the same arg can't be reused
01:13 karolherbst: yeah
01:13 imirkin: you can teach RA about shit like that
01:13 imirkin: there's a thing to make sure that dst and src are different
01:13 imirkin: that's slightly far-fetched though, i guess =/
01:14 karolherbst: the same dest and src can be used
01:14 karolherbst: nvidia: 00000018: f0170000 da42008f tlds lz nodep 0x0 $r0 $r0 $r1 0x8 t2d r
01:14 imirkin: note how it uses different src's
01:14 karolherbst: might be a different shader though
01:14 karolherbst: yeah
01:14 imirkin: i was just giving an example for how RA can know about things like that
01:14 karolherbst: for no apperant reason
01:14 imirkin: shouldn't be too hard to teach it
01:14 karolherbst: the most abdsurd thing is even that:
01:15 karolherbst: movs 0 into $r0,$r1 and $r2
01:15 karolherbst: then tlds lz nodep 0x0 $r0 $r0 $r2 0x8 t3d r
01:17 karolherbst: imirkin: InsertConstraintsPass::addConstraint is for the def/src stuff, right?
01:18 karolherbst: uhm... it isn't used
01:18 imirkin: i don't remember
01:19 imirkin: let's see....
01:19 karolherbst: for the dest == src thing we use addRegPreference
01:19 imirkin: addHazard is the thing for dst vs src
01:19 karolherbst: huh
01:19 imirkin: the way that works is it adds a fake use after the instruction
01:20 karolherbst: yeah well
01:20 karolherbst: that trick won't work with srcs
01:20 imirkin: which makes that reg not eligible for dst selections
01:20 imirkin: right.
01:20 imirkin: so what we want here
01:20 imirkin: is to insert a nop
01:20 imirkin: or something
01:20 imirkin: which SETS the reg to some other value
01:20 imirkin: and therefore the RA will think they're truly different things
01:20 karolherbst: ohhh
01:21 imirkin: i do believe that addConstraint will do something like it
01:21 karolherbst: it uses that OP_CONSTRAINT I never saw
01:21 imirkin: yeah
01:21 imirkin: worth a shot
01:21 imirkin: src/gallium/drivers/nouveau/codegen/nv50_ir.h: OP_CONSTRAINT, // copy values into consecutive registers
01:22 imirkin: which will work out fine with 1 dst and 1 src
01:22 karolherbst: ohh
01:22 imirkin: assuming it works :)
01:22 imirkin: it does get pushed onto the constrList...probably fine
01:23 imirkin: but untested.
01:24 karolherbst: mhh
01:26 karolherbst: okay yeah, "i - i" doesn't make much sense
01:26 karolherbst: nice "1: texfetchs 2D $r0 $s0 r f32 $r0 u32 $r1 $r0 (8)"
01:28 karolherbst: imirkin: wondering if the same is true for TEXS, but the full CTS run was fine
01:28 karolherbst: mhh
01:28 karolherbst: so yeah, less fails now
01:30 karolherbst: mhhhhhhh
01:30 karolherbst: imirkin: maybe the second source actually needs to be the "next" reg
01:30 karolherbst: mhh, no, that would be stupid
01:31 karolherbst: I think it is something else
01:43 karolherbst: actually 2D was fine
01:43 karolherbst: 2D_ARRAY wasn't
01:43 karolherbst: and the log was kind of screwed
01:51 karolherbst: duh....
01:51 karolherbst: okay, found the mistake
01:58 karolherbst: for 2DArray we have to use texfetchs 2D_ARRAY $r0 $s0 r f32 $r0 u32 $r0 $r2d
02:25 imirkin: karolherbst: oh, that can happen too... there's at least one instance of that
02:25 imirkin: in that case, just coalesce
02:25 imirkin: and then at emit time, do n, n+1
02:27 karolherbst: imirkin: or that: https://github.com/karolherbst/mesa/commit/f4d91b3d93f3fcb1c7397a11e29db82d0df90d84#diff-70ffabf33cc5a58af0db0774e917b125R2228
02:30 imirkin: that's dangerous
02:30 imirkin: should be done based on number of arguments.
02:31 karolherbst: you mean the TexTarget.getArg() stuff?
02:31 karolherbst: uhm, getArgCount actually
02:32 karolherbst: yeah, I know, just keeping things simple for now
02:33 karolherbst: for the defs I should bitcount the mask as well
02:34 karolherbst: allthough the original code uses "tex->srcCount(0xff)"
02:39 imirkin: yeah, srcCount.
02:39 imirkin: everything else is lies.
02:42 karolherbst: okay
11:19 RSpliet: karolherbst: I like the sqrt patch :-P No chance of that existing on 1st gen Maxwell is there?
11:19 karolherbst: RSpliet: not according to nvdisasm
11:20 RSpliet: karolherbst: perhaps I could double-check with a two-line OpenCL/CUDA program on my laptop at some point...
11:20 karolherbst: RSpliet: it gives a 10% perf boost in pixmark_piano, but unigine heaven was like 1% at most
11:20 karolherbst: RSpliet: well, I already did
11:20 RSpliet: Ah ok... I won't bother in that case :-D
11:20 karolherbst: RSpliet: do you know clcc?
11:20 RSpliet: very very vaguely. You can feed it ptx right?
11:20 karolherbst: lets you compile OpenCL kernels to PTX code
11:20 RSpliet: oh right
11:21 karolherbst: makes a lot of things really easy to RE
11:21 RSpliet: Yeah that's a lot easier than valgrind-mmt
11:21 karolherbst: or writing your own PTX code
11:21 karolherbst: a glsl to SASS thing would be nice as well
11:22 RSpliet: sounds sassy...
11:22 karolherbst: but mhh, something is odd with tld4s...
11:24 karolherbst: those scalar texture ops are realy nice
11:24 karolherbst: if you understand how they work I mean
11:24 karolherbst: ohhhh
11:25 karolherbst: forgot to emit the color mask
13:22 karolherbst: mhh, KHR-GL45.geometry_shader.limits.max_combined_texture_units fails
15:09 nikos: Hi, I'm getting pretty bad tearing during scrolling in firefox and video playback. I'm using a NV117 card with i3wm, tried with and without compton running. Any clues?
15:11 diogenes_: nikos, try compiz
15:13 orbea: or compton
19:00 karolherbst: nice
19:00 karolherbst: one furmark shader changed form 21 to 20 gprs
19:00 karolherbst: 622 -> 646 points
19:20 mooch2: karolherbst, hey, any ideas on how i could speed shit up with nouveau on my gm107?
19:22 karolherbst: mooch2: plenty ideas
19:22 karolherbst: mooch2: try the XMAD patches
19:22 mooch2: aight, anything else?
19:22 mooch2: like, stuff i could work on, maybe?
19:22 karolherbst: instruction scheduler
19:22 karolherbst: figureing out why things are slow
19:23 karolherbst: figure out ZCULL
19:24 karolherbst: what we really miss is a way to really micro benchmark stuff to know what exactly make things bad
19:24 karolherbst: but I think instruction scheduler + ZCULL would be plenty helpful already
19:32 karolherbst: mooch2: you can test the patches I just sent out
19:33 karolherbst: but don't expect much
19:33 karolherbst: more like a 1% speedup in avg
19:33 karolherbst: okay, now the xmad stuff
19:33 HdkR: GM107 is a pretty low end part already
19:34 karolherbst: HdkR: sure, but GDDR5 isn't that slow
19:35 karolherbst: ohh there were DDR3 ones as well
19:35 karolherbst: *sigh*
19:35 karolherbst: DDR3 is really pointless
19:35 mooch2: how do i tell which memory type i have?
19:35 karolherbst: mooch2: nouveau prints it out
19:35 HdkR: The DDR4 GT 1030 should die as well
19:35 mooch2: where tho?
19:35 karolherbst: dmesg
19:35 mooch2: ah
19:36 karolherbst: HdkR: well, at least it is DDR4
19:36 karolherbst: but yeah
19:36 karolherbst: HdkR: but nvidia tends to clock higher than what your sys mem is clocked to though
19:36 karolherbst: so....
19:36 mooch2: ah good, i have gddr5
19:36 HdkR: Doesn't stop it from murdering the memory bandwidth compared to GDDR :P
19:37 HdkR: That DDR3 GM107 is just...really bad
19:37 karolherbst: yeah
19:37 karolherbst: mooch2: good, then your GPU isnt that bad, 950 ti?
19:37 mooch2: 750 ti
19:37 karolherbst: uhm
19:37 karolherbst: right
19:37 mooch2: it runs doom 4 at like 50 fps medium
19:37 karolherbst: I jsut wanted to correct myself
19:37 mooch2: ah lol
19:37 karolherbst: mooch2: yeah... try those xmad patches
19:38 mooch2: aight
19:38 karolherbst: allthough they only really help with integer MULS/MADs
19:38 karolherbst: but
19:38 mooch2: uh, how do i get doom 4 running on linux tho? wine 3.13 can't run it *shrug*
19:38 karolherbst: chances are, there are some
19:38 karolherbst: mooch2: I am sure it can
19:38 karolherbst: ohh wait
19:38 karolherbst: no
19:38 karolherbst: we don't have vulkan
19:38 mooch2: doom 4 supports opengl :/
19:38 karolherbst: doesn't matter
19:38 mooch2: oh?
19:39 karolherbst: all dx10-12 games are basically translated for free
19:39 mooch2: the opengl and vulkan engines are different executables
19:39 mooch2: okay, fair enough
19:39 karolherbst: you just need to use vulkan
19:39 karolherbst: there is a dx10 to dx11 conversion library, and a dx11 to vulkan one
19:39 mooch2: well, is that kernel feature that nouveau vulkan needs merged yet?
19:39 karolherbst: and then we have that upstream wine dx12 to vulkan thing
19:39 karolherbst: mooch2: yes
19:39 karolherbst: uhm
19:40 karolherbst: no
19:40 mooch2: oh :c
19:40 mooch2: well, is there any other way i can measure nouveau's performance other than running emulators? :p
19:40 karolherbst: run native games?
19:42 mooch2: like?
19:43 mooch2: i don't own any linux native games besides tf2 lol
19:43 karolherbst: ...
19:43 nyef: Tuxracer?
19:43 karolherbst: you don't want to know how many I own
19:43 mooch2: i literally JUST switched over fully
19:43 mooch2: nyef, that's... probably not too gpu-intensive either
19:43 nyef: Details, details.
19:44 karolherbst: I think I might even start this week with the vulkan driver then....
19:44 HdkR: Teeworlds
19:44 karolherbst: or at least the preperations
19:44 mooch2: yays!!!
19:44 HdkR: xmoto :D
19:44 mooch2: HdkR, teeworlds? xmoto?
19:44 mooch2: what are these/
19:44 mooch2: *?
19:44 karolherbst: moving codegen out of gallium
19:44 HdkR: Native Linux games
19:44 karolherbst: which will be qute painful
19:44 mooch2: karolherbst, no i was talking to HdkR
19:46 mooch2: HdkR, neither of those look gpu-intensive AT ALL
19:46 mooch2: are there any gpu-intensive native 3d games?
19:46 mooch2: i want to be gpu-bound here, not cpu-bound
20:05 HdkR: mooch2: Can't you run most anything with AA or higher resolution to become bounded with a GM107? :P
20:06 mooch2: eh, okay
20:06 mooch2: i guess i'll run dolphin then :p
20:06 HdkR: Dolphin will most benefit from that xmad optimization
20:06 mooch2: with like, over... 4x IR maybe? i know on the blob, i could go up to like 6x
20:07 mooch2: oh? nice
20:10 HdkR: Turns out that most games don't use integer mads as much as Dolphin. Need to start looking at compute tasks to end up like that
21:10 RSpliet: mooch2: I've played with instruction scheduling quite a while ago. Worked decently on Kepler, same code had no significant impact on my GM107M
21:13 RSpliet: Similarly, I had played with bank-aware register allocation, which when done right should make a difference... but didn't
21:13 RSpliet: So... they're nice to play with, but could not be the bottleneck at this point
21:13 RSpliet: Or my policies were too simplistic :-P
21:18 karolherbst: RSpliet: On kepler instruction sceduling has a big impact, because you are more likely to dual issue if you do something better
21:18 karolherbst: or that's what I assume was the actual effect of your pass
21:19 karolherbst: RSpliet: and on maxwell you need the proper sched stuff anyway, don't know if you played with it before or after we added that
21:20 karolherbst: but, we could actually count the stalls on maxwell and try to reduce that
21:20 RSpliet: karolherbst: I tried stuff after hakzsams work was merged
21:21 RSpliet: With a simple "issue load/stores early, then try to schedule instructions that don't depend on them for a long while"
21:22 RSpliet: I saw it make a positive difference on pixmark piano I think, but nothing on real games or benchmarks (and it took a lot of tweaking to not regress one of the other pixmark ones)
21:23 RSpliet: But I'd expect the impact to be smaller for real games as they are more likely to have <32GPRs, hence more opportunities to hide DRAM access latencies through thread parallelism rather than pipeline parallelism
21:24 RSpliet: Perhaps armed with a stall counter you could tweak policies much easier :-) I never did things too fancy, just a bit of kick-and-monitor
21:29 karolherbst: yeah
21:29 karolherbst: I think we really have other things to worry about right now
21:30 karolherbst: RSpliet: increasing the dual issueing rate made quite a big difference on kepler though
21:31 karolherbst: but it was still only around 3%
21:31 RSpliet: karolherbst: 3% is pretty good.
21:31 karolherbst: well
21:31 karolherbst: doesn't make a difference in the end
21:31 karolherbst: except we have like 10 things which improve by that
21:32 karolherbst: I got the dual issue rate to increase from 21% to 27% in pixmark_piano
21:32 RSpliet: Yeah, that's the nature of the beast. And they get progressively more complex to tackle
21:32 karolherbst: sooo.. yeah
21:33 karolherbst: I am sure if we get close to 50%, that this would make a huge difference
21:33 RSpliet: Dual issue rate of 20%/40% (depending on how you look at it) I believe is pretty good
21:33 RSpliet: I
21:33 karolherbst: 20% is pretty bad actually
21:33 RSpliet: I was amazed by how much the difference can be in this OpenCL kernel of mine for proper loop unrolling
21:33 karolherbst: yeah
21:34 karolherbst: I am kind of looking in improving the codegen generation when coming from nir
21:34 RSpliet: with 20/40% I mean two in five instructions are issued in pairs. 60% as individual.That's the rate I found for a set of OpenCL kernels I wrote
21:34 karolherbst: as we get all those fancy CFG based opts there
21:36 karolherbst: RSpliet: well, the hardware can do 100% in pairs
21:36 karolherbst: sure, that's not possible for most kernels/shaders
21:36 karolherbst: but I am sure that you can hit 50% in pairs quite easily
21:37 karolherbst: I mean, in pixmark_piano I got to <46% issued alone
21:37 RSpliet: 40% is what I got from the NVIDIA compiler on my OpenCL kernels, that's my reference point.
21:37 karolherbst: which means 54% in pairs ;)
21:38 RSpliet: Pixmap is a weird benchmark. 4000 lines of shader assembly, tons of registers. That means there's lots of opportunities to issue pairs with no register dependencies among each other.
21:38 karolherbst: let me check what our rate is in avg
21:39 karolherbst: still, it isn't a shader with a high rate
21:40 RSpliet: I'm sure that my scheduling patches improves dual issue by separating such insns with register dependencies, scheduling them further apart, making it more likely for adjacent insns to pair up
21:40 RSpliet: Or... I at least tested with code to do that
21:40 karolherbst: RSpliet: I see shaders where we even do 86% dual issueing
21:40 RSpliet: we == nouveau?
21:41 karolherbst: yes
21:41 RSpliet: Are these non-trivial shaders?
21:42 karolherbst: yes
21:42 karolherbst: 51 instructions, allthough let me check for others
21:43 karolherbst: RSpliet: https://gist.githubusercontent.com/karolherbst/68bc6b3ff69fb22f239808041b2b4767/raw/0c54d81081986e64aebc1efe73b80defacf917f4/gistfile1.txt
21:43 karolherbst: dual issues / instruction count
21:43 karolherbst: 0.1 means 20% in pairs, 10% issued with the 0x04 sched
21:44 karolherbst: but the bigger ones seem to be around 60%
21:44 karolherbst: and that's without my pass
21:45 karolherbst: avg seems to be around 50%
21:45 karolherbst: or something
21:49 karolherbst: RSpliet: anyhow, the effort put into this isn't really worth the time for us right now
21:49 karolherbst: I am sure figuring out ZCULL would be worth it
21:50 karolherbst: or figure out where our actual bottlenecks are
21:50 karolherbst: or something like that XMAD thing, which is just much faster for compute like shaders
21:57 RSpliet: yeah... I must admit I know too little about graphics to say anything meaningful about that
21:58 RSpliet: But... could we also be too cautious with launching independent shaders in parallel (stuff that Maxwell can do to a limited degree)?
21:58 RSpliet: E.g. stick too many fences and syncs in the command FIFOs
22:41 karolherbst: RSpliet: the actual issue we have is, that we don't know