02:50 karolherbst: imirkin: I think this NV50PostRaConstantFolding needs still a lot of work :/
02:51 karolherbst: if (i->getDef(0)->reg.data.id >= 64 || i->getSrc(0)->reg.data.id >= 64) break;
02:51 karolherbst: for example
02:56 RSpliet: karolherbst: what's so bad about that? the imm notation doesn't let you address beyond reg 63..
02:57 karolherbst: RSpliet: also on those kepler gen2 chips?
02:57 RSpliet: oh that you'd have to verify in the docs
03:06 karolherbst: imirkin: so for gk110 there is a assert(!isLIMM(i->src(1), TYPE_F32)); check for MAD :/
03:09 karolherbst: RSpliet: by the way, could a post RA Pass leave holes in the used gprs? like if something removes all references to $r5, would $r5 be just unused and the max gprs used count not "fixed"?
03:09 RSpliet: karolherbst: potentially... yes
03:10 karolherbst: RSpliet: so there should be a last pass which detects such holes and replaces the highest reg id with that hole
03:10 RSpliet: karolherbst: or perhaps easier, just do a linear renaming pass, although that could get tricky with input, output and splits
03:11 karolherbst: ohhh wait
03:11 karolherbst: there is also stuff like vfetch...
03:12 karolherbst: nah, I am not doing this :d
03:12 RSpliet: think you can easily find out whether it'd be worth it?
03:12 karolherbst: yes
04:04 RSpliet: forgive me for throwing this out here: https://github.com/jbush001/NyuziProcessor
04:18 glennk: does it matter if there are holes?
04:19 glennk: think the physical regs are dynamically allocated on demand either way
04:25 karolherbst: glennk: no idea
04:25 karolherbst: I was just wondering about that
04:26 karolherbst: but I think I will try to write up some dual issuing pass, I just have no good idea how to do that the right way :/
04:26 RSpliet: glennk: it does matter, because holes could appear post-RA; demand is not updated
04:28 karolherbst: RSpliet: other question
04:28 karolherbst: RSpliet: could nouveau demand 5 registers, when 0,5,6,9 and 16 are used?
04:29 RSpliet: I'd assume not - but double check with the code, mwk or imirkin :-P
04:29 karolherbst: mhh anyway, I think I will spend my time rather on a dual-issueing pass, because I get the feeling that this is more worth it
04:29 karolherbst: the blob has a really high dual issuing rate in general (above 35%)
04:30 karolherbst: where nouveau seems to be around 20%
04:44 karolherbst: yay... https://gist.github.com/karolherbst/3c528aeb055be8ca86aa
04:45 karolherbst: and I just find a pretty annoying scenario of this
04:46 karolherbst: mul ftz f32 %r1215 %r1170 %r1195; 230: mul ftz f32 %r1222 %r1215 0.500000; 233: mad ftz f32 %r1232 %r1215 0.500000 neg %r1195; 239: mad ftz f32 %r1252 %r1215 -0.500000 %r1242
04:49 karolherbst: mhh, mul => mul_x2^-1 could safe us one instruction here :/
04:49 karolherbst: but it is a pain to detect this
04:49 karolherbst: *save
04:59 kugel: who would be the person to ask about https://bugs.freedesktop.org/show_bug.cgi?id=93732 ?
05:01 imirkin: kugel: it's not at all clear what your setup is there...
05:02 kugel: GeForce 650 Ti with SW cursor enforced through xorg.conf
05:02 imirkin: solution: don't do that.
05:03 kugel: SW cursor is enforced automatically in another setup (reverse prime) so not a solution for me
05:03 imirkin: there's a reason that the *very* first acceleration that was done by graphics cards was a separate cursor
05:05 kugel: I understand that...I don't use SW cursor by choice
05:05 imirkin: anyways... not the faintest clue how sw cursor works or how it'd interact with dri
05:05 kugel: i currently can't use reverse prime
05:08 karolherbst: imirkin: do you think it is worth to update this optimization? https://github.com/chrisbmr/Mesa-3D/commit/b2e5a4b5ff2e37d84e1ef29d5af367721113cb24
05:10 imirkin: karolherbst: did you just reimplement my indirect load propagation?
05:11 karolherbst: no?
05:11 karolherbst: this isn't mine
05:11 karolherbst: and it is from 2012
05:11 imirkin: oh
05:11 imirkin: oh this is calim's thing
05:11 karolherbst: yeah
05:11 imirkin: yes, i've reimplemented it
05:11 karolherbst: okay
05:11 imirkin: without realizing that he had already written it
05:11 karolherbst: so it is already done I assume
05:11 imirkin: and i think in a much better way
05:12 imirkin: but what do i know
05:12 karolherbst: both optimizations by the way?
05:12 imirkin: no, i never got around to adding the shift-add
05:12 imirkin: that's ISCADD btw
05:12 karolherbst: though I don't know if the add,shl=>shl,add does much
05:12 imirkin: it's a single op
05:12 imirkin: which does a << imm + b
05:12 karolherbst: ohhh
05:13 karolherbst: is iscadd support currently?
05:13 imirkin: no
05:13 imirkin: not at all
05:13 karolherbst: okay
05:13 imirkin: would have to be plumbed through everywhere
05:14 karolherbst: yeah, but I was more like thinking if the emiter already supports it
05:14 RSpliet: is that stuff used in shaders a lot? it sounds like a useful op for compute (composing bitfields), but graphics too?
05:14 karolherbst: no idea
05:14 imirkin: RSpliet: addresses
05:15 imirkin: basically you want to do like a[5].asdf
05:15 imirkin: but a has a stride of 16, and asdf has an offset of 4
05:15 imirkin: so you do i << 4 + 4
05:15 RSpliet: ah yes!
05:15 imirkin: not _super_ common, but does come up a bunch
05:16 imirkin: floating point multiply is more common :)
05:23 karolherbst: imirkin: there is not even a conflict with calims rescheduling patch :/
05:23 karolherbst: but segfaults :D
05:23 karolherbst: well make clean will do the trick I guess
05:24 imirkin: it was WIP :)
05:24 karolherbst: not the first part :p
05:24 imirkin: not pushed = WIP
05:24 karolherbst: :D
05:24 karolherbst: k
05:27 karolherbst: imirkin: I think I might try to make a really trivial pass to increase dual issuing: can i dual issued with i->next? if not, check if swaping i->next with i->next->next helps as long as those instruction doesn't depend on each other
05:27 imirkin: we do that already
05:27 karolherbst: ohh really?
05:27 imirkin: er hmmm
05:27 imirkin: maybe we don't?
05:28 karolherbst: no idea, where would it be done?
05:28 imirkin: we do something similar on nv50
05:29 karolherbst: wow
05:29 imirkin: CodeEmitter::prepareEmission
05:29 karolherbst: calims rescheduling helps :O
05:29 imirkin: in nv50_ir_target.cpp
05:29 karolherbst: dual issued instructions: 180k => 280k
05:29 karolherbst: pixmark_piao frame time dropped by 4ms
05:29 karolherbst: from 60 to 56 ms
05:31 karolherbst: imirkin: but guess what, unigine crashes cause of spilling :D
05:32 karolherbst: okay, and my patch helps with that
05:33 karolherbst: imirkin: I think we really want to have that: https://github.com/karolherbst/mesa/commit/4538c5c1a59952e97b657784d0419f1cfea8b052
05:33 imirkin: karolherbst: yeah. i thought we did that.
05:34 imirkin: didn't you write that like... a long time ago?
05:34 karolherbst: yeah
05:34 karolherbst: that's the patch
05:34 imirkin: never sent it?
05:34 karolherbst: right
05:34 imirkin: doh
05:34 karolherbst: totally forgot about it
05:34 karolherbst: will send it now
05:36 karolherbst: sent out
05:37 mwk: RSpliet: the used GPRs need to be contiguous
05:37 karolherbst: imirkin: I fear a big rescheduling pass always might messes up somewhere
05:37 mwk: you cannot use $r0, $r2, $r3 and claim you only use 3 regs
05:37 karolherbst: imirkin: especially if this pass ignores spilling
05:39 RSpliet: mwk: I reckoned, thanks for the confirmation
05:40 imirkin: karolherbst: rescheduling has to take register pressure into account
05:41 karolherbst: which is not trivial to do :/
05:41 imirkin: i thought we discussed a way
05:41 imirkin: keeping track of live values within the bb
05:41 karolherbst: I was there already
05:41 karolherbst: and
05:42 karolherbst: in the end, I trippled gpr usage
05:42 karolherbst: by accident
05:42 karolherbst: I tried to reduce it real hard :/
05:42 imirkin: you need to work on your commit messages.
05:42 karolherbst: I know
05:47 karolherbst: imirkin: okay, but prepareEmission doesn't take dual issuing into account I assume
05:47 imirkin: no
05:48 imirkin: it's just a demonstration of a pass that does swapping
05:48 karolherbst: k
05:48 imirkin: to accomodate various hw requirements
05:48 imirkin: in nv50's case, attempting to minimize binary size
05:48 imirkin: since 4-byte encoded instructions have to run in pairs
05:48 jeremySal: imirkin: I checked the ViewportIndex test and it's also just writing to a[0x68]
05:48 karolherbst: ahhh
05:49 imirkin: jeremySal: yeah, i assumed it would
05:49 imirkin: jeremySal: the interesting bit is the viewport mask thing
05:49 jeremySal: Should I write a test case for writing the value from the tesselation shader?
05:49 imirkin: which allows you to broadcast a single primitive to multiple viewports
05:49 imirkin: no, i'm sure it's the same locations in all the shader stages
05:50 karolherbst: imirkin: I am thinking about doing a post-ra pass, which just assumes the current instructions will be emited and tries to reorder it a bit better. And if that helps noticeable, we could use that until someone spend months doing it the right way
05:50 imirkin: huh?
05:50 imirkin: look at prepareEmission - it does not take *months* to copy that :p
05:50 karolherbst: :D
05:50 imirkin: or you mean do real scheduling?
05:50 karolherbst: yes
05:50 imirkin: oh yeah wtvr
05:51 karolherbst: imirkin: ohh I see the key is i->isCommutationLegal
05:51 imirkin: :)
05:53 karolherbst: imirkin; dual issueing is a thing since kepler?
05:54 karolherbst: ohh k, if (getChipset() >= 0xe4)
06:09 imirkin: mmmm unclear if fermi has it
06:09 imirkin: or how it works
06:23 karolherbst: mhh what was the best way to swap instructions in a BB again? :/ I hoped for a util::swapInstructions functions somehow :/
06:24 imirkin: check the function that already does this?
06:24 imirkin: bb->permuteAdjacent
06:25 karolherbst: ohh k thanks
06:27 jeremySal: imirkin: gl_ViewportMask[0]=1 OK but gl_ViewPortMask[viewport]=1 BAD?
06:27 imirkin: mmmm
06:27 imirkin: i think you only ever want [0]
06:27 imirkin: it's a bitmask
06:27 imirkin: and there are only ever 16 viewports
06:27 imirkin: this stuff is defined like that for the hypothetical situation where you have, say, 100 viewports
06:27 imirkin: that bitmask will be multiple 32-bit values
06:28 jeremySal: oh I see
06:28 imirkin: same with gl_SampleMask{,In}
06:28 jeremySal: it's not an array of bits :)
06:28 imirkin: it's a bit of arrays :)
06:31 jeremySal: does << 0 fill in the shader language or wrap around?
06:31 karolherbst: imirkin: isCommutationLegal just allowed 1: linterp pass f32 $r0 a[0x74] (8) 2: ld u32 $r1 c0[0x34] (8) to be swapped :/
06:35 karolherbst: or is this okay? :/
06:35 karolherbst: mhh weird
06:35 karolherbst: then something else is odd
06:45 mwk: okay, so there's no i2i64 instruction
06:45 mwk: good that's confirmed...
06:51 karolherbst: imirkin: 3% more dual issuing in the first unigine heaven scene :/
06:51 karolherbst: I hoped for more
06:57 jeremySal: imirkin: It seems like gl_ViewportMask is stored in a[0x3a0]
07:22 karolherbst: imirkin: any idea how this can be made a bit smarter? https://github.com/karolherbst/mesa/commit/9661c9bccf45dfe80bd11f8e9ce536fc27be99bc
07:25 mwk: whee, another hw bug
07:25 mwk: cvt u64 -> f32/f64 with neg modifier has wrong sense of rm/rp rounding modes
07:26 RSpliet: karolherbst: are you sure you need an extra pass for this, rather than moving the original pass to post-post-RA-optimisations?
07:26 mwk: ie. it negates *after* rounding, as opposed to s64/u32/s32/u16/s16 -> f*, which negate before rounding
07:27 karolherbst: RSpliet: no idea, I want to do that post-ra first just to check how high the gain is or if this opt needs more information
07:27 karolherbst: RSpliet: in fact this only reordered two instructions in a 3850 instruction program :/
07:34 mwk: huh
07:34 mwk: I found a u64/s64 -> f16 convert insn
07:34 mwk: curiouser and curiouser.
07:38 mwk: hm, no, wait
07:38 mwk: it's a u32/s32 -> f16 duplicate
07:39 mwk: with some missing flag
07:39 karolherbst: RSpliet: ohhh I ran this pass only for the first bb :(
07:40 karolherbst: nice
07:41 karolherbst: now I also see the result
07:41 karolherbst: RSpliet: 180k dual issued => 200k dual issued
08:04 karolherbst: hakzsam: thanks for the counter work by the way, now it also helped me a bit :D
08:04 hakzsam: karolherbst, nice to hear :)
08:06 karolherbst: RSpliet: which original pass do you mean by the way?
08:07 RSpliet: karolherbst: sorry, I read your code wrongly, didn't see you tried pairwise local sched decisions
08:07 karolherbst: ahh k
08:07 RSpliet: it seems though as if this approach might conflict with a desire for scheduling for minimum liveness - to reduce the number of registers used
08:08 karolherbst: anyway here is the newest version: https://github.com/karolherbst/mesa/commit/dccd14057ca79458a7679fc4aaec9e6a76ec4dbb
08:08 karolherbst: if you have any ideas how to gain more out of this
08:08 karolherbst: please feel free to share :)
08:08 karolherbst: RSpliet: this is post ra
08:08 karolherbst: there is no change in reg usage
08:08 karolherbst: I just swap instruction which don't depend on each other
08:09 karolherbst: like add $r2 $r1 $r1; mul $r4 $r1 $r3
08:09 RSpliet: mmm, I take it that that constraint is part of isCommutationLegal then?
08:09 karolherbst: yeah
08:10 RSpliet: makes sense
08:10 karolherbst: it actually improves performance a bit in pixmark_piano so I guess it works
08:10 karolherbst: also the max gpr count stays the same
08:10 RSpliet: good stuff
08:11 karolherbst: but the benefit isn't as big as I hoped ./
08:11 mwk: huh, cvt f64 -> u64 is nice and sets an "overflow/underflow/nan" flag in CC
08:11 karolherbst: my goal is 30% dual issueing :D
08:11 mwk: as opposed to f32 -> u32
08:13 mwk: and f64 -> f64 preserves NaN payload... f32 just always stuffs 0x7fffffff everywhere
08:13 mwk: except it converts SNaN to QNaN
08:13 karolherbst: RSpliet: I want to keep that pass as simple as possible though, the "better" work should be then a real rescheduling pass
08:13 mwk: seems someone started taking IEEE 754 seriously
08:14 karolherbst: :D
08:14 RSpliet: karolherbst: good stuff
08:15 karolherbst: RSpliet: is there something from the post-ra thing which gets removed in the emit step by the way?
08:16 RSpliet: what do you mean?
08:16 karolherbst: because then I could also do stuff like : ins1, removed_by_emit, ins2, ins3 => ins1, removed_by_emit, ins3, ins2
08:17 karolherbst: RSpliet: like phi instructions which gets removed in the RA step
08:17 RSpliet: oh idk, I think not tbh
08:23 karolherbst: set and mov can't be dual issued?
08:29 karolherbst: set u8 $p0 ge s32 $r11 2 <== this instruction is fixed because it sets a predicate?
08:43 karolherbst: mupuf: wanna check if you see any benefit from this commit in julia or other non memory bottlenecked stuff? https://github.com/karolherbst/mesa/commit/dccd14057ca79458a7679fc4aaec9e6a76ec4dbb
08:49 karolherbst: RSpliet: running the pass twice: 200k->220k dual issued
08:51 mwk: and of course f32 -> u64 works differently than f32 -> u32
08:51 mwk: it supports denormals and the nice CC overflow flag
08:53 imirkin_: jeremySal: could you keep track of your various mmt's -- you could be missing bits of it... like what bit gets set in the shader header?
08:53 imirkin_: jeremySal: and are there additional items that get set in the cmd stream?
08:53 imirkin_: jeremySal: look for unknown methods which get set when using the viewportmask thing compared to when not
08:59 jeremySal: imirkin: I am storing the mmts produced, I just don't really know what's significant
09:00 imirkin_: yeah, as expected :)
09:00 imirkin_: would you terribly mind posting them somewhere, and i can upload to a shared location?
09:00 imirkin_: preferably including the shader_test's that they're traces of
09:01 jeremySal: imirkin: the string "unk" shows up in both traces
09:01 jeremySal: a lot
09:01 imirkin_: that's fine
09:01 imirkin_: you're looking for differences
09:01 imirkin_: not similarities ;)
09:02 jeremySal: yeah, I ran one shader where it writes an array output
09:02 jeremySal: another where it writes the ViewpointMask
09:03 imirkin_: it should compare one where it writes ViewportMask to one that writes ViewportIndex
09:03 jeremySal: oh, ok
09:03 imirkin_: (i think)
09:04 imirkin_: for all i know, nothing extra needs to be done
09:04 imirkin_: the question does come of where in the shader header this should go though
09:04 imirkin_: i assume it's one of the "obvious" bits
09:04 imirkin_: but hunting around for that stuff is a *huge* chore
09:06 karolherbst: hakzsam: what could I do to improve the "gallium_hud: all queries are busy after 8 frames, can't add another query" situation?
09:07 imirkin_: karolherbst: you should be careful not to *mess up* dual-issue
09:07 imirkin_: karolherbst: i.e. if i + next can be dual-issued, you should skip ahead by 2
09:08 karolherbst: imirkin_: we don't know that at that point
09:08 imirkin_: karolherbst: huh?
09:08 karolherbst: it could be that the first one can't be dual issued
09:08 imirkin_: if (!next || next->fixed || next->asFlow() || target->canDualIssue(i, next))
09:08 imirkin_: + continue;
09:08 karolherbst: because it is the 7th instruction in a block
09:08 karolherbst: which is never dual issued
09:08 jeremySal: imirkin: r
09:08 karolherbst: then we would skip the first of the next block
09:08 jeremySal: http://www.columbia.edu/~jas2312/dump.tar.xz
09:08 imirkin_: karolherbst: that's less likely
09:09 imirkin_: karolherbst: i'm talking about let's say you have A, B, C
09:09 imirkin_: A + B can dual-issue, C + B can dual-issue, but A + C can't
09:09 imirkin_: nor can B + C
09:09 imirkin_: you might end up trying to swap B,C
09:10 imirkin_: so once you've decided that A + B can dual-issue, you should jump directly to C
09:10 karolherbst: how?
09:10 karolherbst: if A+B can be dual issued
09:10 karolherbst: I check B,C,D
09:10 karolherbst: but I can only swap C and D at that point
09:11 karolherbst: the current instruction won't be swaped around
09:12 karolherbst: I only swap if A + B can't be dual issued, but A+C can
09:13 karolherbst: but I also added a A+B+C+D check now, because there is a significant benefit from this
09:13 karolherbst: so if A+D can be dual issued, I swap to A+D+B+C
09:13 imirkin_: karolherbst: if A + B can be dual-issued
09:13 imirkin_: then you hit the continue case
09:13 imirkin_: then i becomes B
09:13 imirkin_: and the logic repeats
09:14 imirkin_: at which point you might decide to swap B+C
09:14 imirkin_: but perhaps A + C can't be dual-issued
09:14 imirkin_: so you lose that dual-issue
09:14 karolherbst: no, I never loose any dual issue from the instruction I already handled
09:14 imirkin_: dual-issue takes 2 instructions
09:14 imirkin_: but you process instructions one at a time
09:15 karolherbst: if I look at A+B+C+D
09:15 karolherbst: and after I am done with A
09:15 karolherbst: B stays where it is
09:15 karolherbst: so the only change possible is A+B+D+C
09:15 karolherbst: and this can't effect A
09:15 karolherbst: and can only benefit B
09:16 karolherbst: If I would skip to C, I might loose the benegit of B+D being dual issued able
09:16 karolherbst: and A in a bad spot where it won't dual issue with B
09:17 karolherbst: so I can in fact loose the oppurtunity of dual issueing an instruction when I jump by 2
09:19 karolherbst: or I did a mistake and didn't see it, :/
09:23 mwk:likes the Tesla fp64 instructions
09:23 mwk: they behave so much better than fp32
09:26 karolherbst: mwk: I guess there are much slower though
09:26 mwk: karolherbst: can't have everything :(
09:28 karolherbst: hakzsam: I get this already when just having inst_executed and inst_issued2 :/
09:30 mwk: ugh, I'll need another refactor now
10:04 wvuu: is the overhead of mem copy a lot?
10:04 imirkin_: compared to?
10:05 wvuu: I the context of the new AMD design HSA, where many diverse vendors are on board.
10:05 imirkin_: zero-copy is a great thing to have, yes
10:05 wvuu: imirkin_: don't you want to change your nick back withou the underscore?
10:05 wvuu: your nick has been splitted :/
10:06 imirkin_: that said it's not magic -- if you want to transfer data between CPU-side memory and GPU-side memory, no amount of HSA will avoid a copy :)
10:06 imirkin_: nah, i just have a clone.
10:06 wvuu: so aside from coding simplicity will there be performance gains?
10:06 imirkin_: between what and what? :)
10:06 wvuu: I've seen the round trips that current graphics code do and is crazy.
10:06 wvuu: from current method and HSA.
10:07 imirkin_: current method and HSA for...?
10:07 imirkin_: HSA doesn't magically make anything faster
10:07 imirkin_: it just unifies virtual memory areas
10:07 wvuu: ok I know there are other uses more advances such as modeling etc, but for sake of simplicity and emulator or a PC game.
10:07 imirkin_: thoroughly unlikely
10:07 wvuu: oh really?
10:08 imirkin_: i probably haven't considered all the various effects, but i can't think of anything offhand
10:08 wvuu: so the extra code layer that now is require to pass everything from CPU/GPU back and for is not really an overhead?
10:08 imirkin_: HSA does enable programming models other than GL's that would benefit on a UMA platform, but... meh
10:09 imirkin_: the slowness comes from passing data back and forth
10:09 imirkin_: not from the code which performs the movement
10:09 wvuu: I see.
10:09 imirkin_: HSA doesn't eliminate the GPU's need to acecss the data
10:09 imirkin_: and GPUs today can DMA from system memory directly
10:10 imirkin_: without any sort of HSA
10:12 wvuu: so where's the benefit, portability as it eliminates some arch specific code?
10:12 imirkin_: it only eliminates code if everyone does it
10:12 imirkin_: otherwise it just adds code :)
10:13 imirkin_: from what i can tell it's *primarily* a marketing stunt
10:13 wvuu: lol
10:13 imirkin_: the benefits will come primarily in compute applications
10:13 imirkin_: but not for graphics apps
10:13 wvuu: that's what I figured.
10:14 wvuu: oh sad, I was expecting some magical performance boost.
10:14 imirkin_: convenient to create a bunch of structures on the CPU, have them point to one another, and then just ship that data to the GPU and have the GPU be able to also chase the same pointers
10:14 wvuu: overclock.net --> 'HSA + AMD APUs means 500% increases in application'
10:16 wvuu: that gives the impression that the high end emulator could run at 100% speed, 120FPs, 4k custom textures, etc.
10:16 karolherbst: hakzsam: with the pgpu stuff from apitrace, I just write the output out, check which are the most expensive draw calls and check what shaders are attached to that?
10:18 jeremySal: imirkin: not trying to bug you, but I think you may have missed my message that I uploaded the vertex shaders and mmts: http://www.columbia.edu/~jas2312/dump.tar.xz
10:19 imirkin_: jeremySal: i didn't miss it... i saw it... but then forgot all about it :)
10:20 jeremySal: up to an isomorphism
10:21 karolherbst: wut
10:21 karolherbst: https://gist.github.com/karolherbst/af38f5cfc8968b549d92
10:21 karolherbst: why are glClear calls so expensive?
10:22 karolherbst: these are the top 7 calls in my saints row iv trace
10:22 imirkin_: probably because they end up having to wait on something
10:22 karolherbst: ohhh wait
10:22 karolherbst: wrong coloum....
10:23 karolherbst: the third one is a timestamp :/
10:23 imirkin_: hehehe
10:23 karolherbst: now it looks sane
10:23 karolherbst: there is one expensive glclear though
10:23 imirkin_: jeremySal: http://people.freedesktop.org/~imirkin/traces/gm206/ -- thanks!
10:24 karolherbst: that trace is 7GB by the way :/
10:26 jeremySal: imirkin: is ... something done?
10:27 imirkin_: i just uploaded your files to my traces dir :)
10:28 imirkin_: jeremySal: basically it'd be cool if you went through all the new stuff that GM20x added and traced it
10:28 imirkin_: there's conservative raster now too, whatever that is
10:28 imirkin_: and sample locations
10:28 jeremySal: Is there are nvidia marketing term or an open version or something that corresponds to GM20X features?
10:28 jeremySal: *opengl version
10:28 imirkin_: maxwell 2nd generation maybe?
10:28 imirkin_: you can also go here: https://www.opengl.org/registry/
10:29 imirkin_: and look towards the end for GL_NV_stuff
10:31 karolherbst: imirkin_: who does something like that? :/ Temp[2] = uintBitsToFloat(uvec4(greaterThanEqual(vec4(uintBitsToFloat(1056964608u), uintBitsToFloat(1056964608u), uintBitsToFloat(1056964608u), uintBitsToFloat(1056964608u)), Temp[2])) * 0xFFFFFFFFu);
10:32 imirkin_: karolherbst: HLSL -> GLSL converters
10:32 karolherbst: I suppose
10:33 karolherbst: but they enable GL_ARB_separate_shader_objects when it is there
10:34 karolherbst: and then like nothing really depends on that :/
10:34 imirkin_: layout (location=x) does
10:34 imirkin_: on varyings
10:34 imirkin_: i think.
10:34 jeremySal: imirkin: I found a list http://blog.icare3d.org/2014/09/maxwell-gm204-opengl-extensions.html
10:35 imirkin_: yeah that sounds right
11:03 karolherbst: imirkin_: okay, that is odd, same glsl, no loop, no branching, just a bunch of instructions, but the blob has higher instruction count as mesa :/
11:05 imirkin_: what can i say... our compiler is better? :)
11:10 mwk: so apparently fp64 instructions are NaN-preserving... except sNaNs are converted to qNaNs
11:11 karolherbst: mhhh
11:11 karolherbst: imirkin_: the blob doesn't use floor?
11:11 karolherbst: and it does a few cvts
11:12 karolherbst: imirkin_: instruction counts: https://gist.github.com/karolherbst/a58c9fd70f9c5cf52c47
11:13 karolherbst: the shader is rather small too, so maybe I am possible to get some information out of that
11:16 karolherbst: imirkin_: the only really obvious change is that they do much more between tex and texbar
11:16 imirkin_: karolherbst: yeah that's way better
11:16 imirkin_: allows the texture latency to be hidden
11:17 karolherbst: imirkin_: I could do a post_ra pass like the dual-issue one, which just moves those texs up and texbars down
11:17 karolherbst: :D
11:17 imirkin_: karolherbst: well, the texbar's are added post-ra
11:18 imirkin_: the key is to move the uses of the tex as far away as possible before doing RA
11:18 karolherbst: imirkin_: okay, so the texbars are already added as late as possible
11:18 imirkin_: texbar's are added as needed
11:18 karolherbst: so only the texs need to be moved up before RA
11:18 imirkin_: basically a tex doesn't write to its dst regs immediately
11:18 imirkin_: it does it "eventually"
11:18 imirkin_: and a texbar will serialize it
11:18 imirkin_: so if you do $r1 = tex()
11:19 imirkin_: and then try to use $r1, the tex may or may not have completed
11:19 imirkin_: so we throw in a texbar
11:19 imirkin_: and there are additionally annoying WaW scenarios
11:19 imirkin_: where you *don't want* to use the tex's output
11:19 imirkin_: but you just happen to use $r1
11:19 imirkin_: the tex can still come in and smash your value
11:20 imirkin_: WaW = write-after-write btw
11:20 karolherbst: k
11:20 imirkin_: there used to be some holes in that logic
11:20 imirkin_: but i rewrote it to be much dumber
11:20 imirkin_: and guaranteed to work
11:20 karolherbst: okay
11:20 jeremySal: imirkin: How do I enable something like glEnable(GL_CONSERVATIVE_RASTERIZATION_NV); in the shader_test language
11:20 karolherbst: then what are the sources of a tex instruction?
11:21 karolherbst: those "$r2:$r3" thingies?
11:24 imirkin_: regs 2 and 3?
11:24 imirkin_: jeremySal: mmmmm.... iirc there's a "enable ASDF"
11:24 imirkin_: jeremySal: but it might be that you can't do some of the stuff and will have to write actual code
11:25 karolherbst: imirkin_: ohh that was the layout from demmt
11:25 karolherbst: tex 2D $r10 $s0 f32 $r0t $r0d
11:25 imirkin_: karolherbst: right
11:25 imirkin_: $r0d = $r0:$r1
11:26 Jayhost: karolherbst how would you recommend learning about power management. Engine and memory reclocking?
11:27 karolherbst: Jayhost: depends which area
11:27 karolherbst: Jayhost: and what the goal is
11:29 mwk: so hmm... I need a uint128_t
11:30 mwk: with addition, multiplication and shifts
11:30 mwk: I could throw in GMP, but that sounds like an overkill..
11:31 imirkin_: there's a gcc __int128_t type
11:31 mwk: imirkin_: which is, quite inconveniently, not supported on 32-bit platforms
11:31 imirkin_: that is quite inconvenient...
11:31 imirkin_: #if 64bit? :)
11:32 mwk: that'd be quite a useful extension otherwise
11:35 Jayhost: karolherbst on gtx750ti if I do cat /sys/class/drm/card0/device/pstate I only have 2 frequencies. When I echo to 1450mhz my screen will go black. I assume this is because engine/memory reclocking are not implemented. I would like to do what I can to build and test experimental reclocking code.
11:35 karolherbst: ahh maxwell
11:36 karolherbst: mhh
11:36 karolherbst: actually this should kind of work, just maybe the voltage is too low
11:36 mwk: hmm
11:36 karolherbst: we have that problem on kepler
11:36 mwk: I can do dadd without needing 128-bit ints, let's do that first
11:36 karolherbst: and it is a pain
11:37 karolherbst: Jayhost: first you need to ssh into the machine and check what dmesg gives you after you try to change the pstate
11:37 karolherbst: Jayhost: did you compile your kernel yourself already?
11:38 karolherbst: Jayhost: there is a nvkm_volt_map function inside nouveau/nvkm/subdev/volt/base.c
11:38 karolherbst: replace info.min with info.max
11:38 karolherbst: that should usually help
11:38 karolherbst: if not, then you can go all creative and figure out what's wrong :)
11:40 Jayhost: karolherbst very cool
11:43 karolherbst: imirkin_: shouldn't isCommutationLegal return true for mul f32 %r313 %r301 c2[0x94] and tex 2D $r9 $s0 f32 %r320 %r306 %r309 ? or is there something hidden somewhere I don't see
11:43 imirkin_: i don't see why not
11:43 imirkin_: it might just bail on tex's entirely, dunno
11:49 mwk: huh, you learn something new every day
11:49 mwk: apparently 0-0 is 0, except with "round to -inf" rounding mode, then it's -0
11:52 karolherbst: makes sense somehow
11:59 karolherbst: imirkin_: http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir.cpp#n510
11:59 karolherbst: idA and idB are both -4
11:59 karolherbst: because you know, id is -1
11:59 imirkin_: right, this can only work post-ra
11:59 karolherbst: ahhh
12:00 karolherbst: makes sense somehow
12:00 karolherbst: I guess there is no pre-RA version of isCommutationLegal?
12:00 imirkin_: i guess not
12:01 karolherbst: imirkin_: then I check if the sources of the tex are in the previous instruction, and if they are not, they should be save to swap? Or is there some corner case I have to respect?
12:02 imirkin_: basically yeah... if the ValueRef's of op1's uses don't appear in op2, then you can swap
12:03 imirkin_: er, op1's dst's uses
12:03 hakzsam: karolherbst, yeah, you can do that way
12:25 karolherbst: jikes, I moved a join :O
12:38 imirkin_: huh. that should have been a flow instruction... i think
12:39 karolherbst: currently I am still failing with those use/ref things :/
12:44 pmoreau: Anyone still needing Reator?
12:54 karolherbst: pmoreau: is it on ?
12:55 karolherbst: ohh hakzsam
12:55 karolherbst: pmoreau: I guess he is still using it
12:55 hakzsam: karolherbst, no, have fun
12:55 karolherbst: :D
12:55 karolherbst: k
12:56 karolherbst: I thought you always run 10 hour processes on reator and collect the data the next day :p
13:01 karolherbst: mhhh
13:01 karolherbst: slowly I think maybe the compiler isn't the issue we have :/
13:02 RSpliet: karolherbst: it's one of many tiny factors
13:02 RSpliet: and it differs from app to app obviously
13:02 karolherbst: mhh, I doubt that it is all scheduling though
13:03 RSpliet: scheduling becomes important for big shaders, if you can lower the GPR you can increase parallelism - better ways to hide mem latency resulting in improved throughput
13:03 karolherbst: RSpliet: yeah, I moved the tex instruction up now
13:03 karolherbst: gives me a steady 1.5% perf increase in unigine
13:03 RSpliet: that's pretty damn good
13:03 karolherbst: but I think we miss something big
13:04 RSpliet: or 25 small things
13:04 RSpliet: what's the reg usage of unigine shaders?
13:04 karolherbst: well
13:04 karolherbst: there are hundreds
13:04 karolherbst: but most of them are below 32
13:04 RSpliet: what's the worst?
13:05 RSpliet: (nicht der wurst...)
13:05 karolherbst: 44
13:05 RSpliet: oh that's not terrible I think
13:05 karolherbst: no
13:05 karolherbst: there is no spilling
13:05 RSpliet: sorry if I state the obvious, but it's not just spilling that matters
13:06 karolherbst: maybe we really missing something z buffer related...
13:07 karolherbst: I bet we could check how many instructions the blob executeds in total with my trace
13:07 karolherbst: and if the perf difference is close enough :/
13:08 RSpliet: can you get metrics like L1, L2 misses, pipeline stalls?
13:08 karolherbst: don't know
13:08 karolherbst: I only played around with instruction count and dual issueing
13:10 RSpliet: perf counters should help with that
13:10 karolherbst: yeah
13:11 karolherbst: funny how only this tex pass helps unigine
13:11 karolherbst: all the other stuff I did doesn't change a thing
13:12 RSpliet: it's in the margin
13:12 karolherbst: well I also didn't look at unigine shaders
13:13 karolherbst: but I hoped maybe the dual issue pass might make a difference, but maybe unigine is too much memory bound for this
13:14 karolherbst: mhh, the tex pass makes spilling worse :/
13:15 imirkin_: coz you're extending the live intervals
13:15 karolherbst: total local used in shared programs : 5673 -> 5877 (3.60%)
13:15 karolherbst: yeah I know
13:15 karolherbst: but I hoped it wouldn't be that much
13:19 RSpliet: "Scheduling is hard!" - any Real-Time Systems bloke
13:20 karolherbst: maybe I shouldn't move it that far high
13:20 karolherbst: any ideas how many instructions should be between the tex and the texbar?
13:20 karolherbst: or I could just limit it to 8 instructions max
13:20 RSpliet: karolherbst: likely that depends on the GPR usage
13:21 karolherbst: I don't want to be smart now :D
13:21 karolherbst: I am sure that moving the tex up 30 instructions might be too much in any case
13:22 RSpliet: I personally think this tex-texbar distance should be seen in a broader sense of scheduling
13:23 RSpliet: rather than as an ad-hoc peephole-pass
13:23 RSpliet: (although I do appreciate you experimenting with it ;-))
13:23 karolherbst: RSpliet: there is no texbar in pre-ra ;)
13:24 RSpliet: sure
13:24 karolherbst: uhhhh
13:24 karolherbst: saints row IV really benefits from this :O
13:24 karolherbst: most expensive call before: 163600352, after: 96687712
13:25 karolherbst: and the numbers are smaller in general
13:26 karolherbst: RSpliet: yeah, I already told to imirkin_ that I only want to do some dump passes which kind of improve the situation. A real scheduling pass will take months
13:26 karolherbst: I won't spend more then a few days on this, so
13:26 karolherbst: and gaining 1% from a day of work is good enough :)
13:28 karolherbst: nice
13:28 imirkin_: i guess it'd be helpful for me to have hw on which i could actually benchmark stuff
13:28 imirkin_: o well
13:28 karolherbst: :D
13:45 glennk: imirkin_, yeah generally the low end parts are so crippled on fill rate and memory shader ends up almost not mattering
13:46 kugel: hi. i upgraded to 4.4 and pstate sysfs moved. how do i use debugfs?
13:47 imirkin_: kugel: same way
13:48 imirkin_: same file, just in debugfs now
13:48 imirkin_: also i think that's only in 4.5-rc1, not in 4.4
13:48 kugel: perhaps, I'm on karolherbst's stable_reclock_kepler branch
13:48 kugel: /sys/kernel/debug/dri/65/pstate?
13:48 imirkin_: ah ok
13:49 karolherbst: ä
13:49 karolherbst: ...
13:49 kugel: i seem to remember that debugfs works differently and that you need to create files/subdirs first but seems I was misinformed
13:49 imirkin_: you're thinking of configfs
13:50 kugel: duh! right, that's it
13:50 imirkin_: (or the person providing the info was)
13:51 karolherbst: mhh my tex pass isn't so bad afterall when I reduce the distance
13:51 karolherbst: helped 1 259 1534 1534 // hurt 1 946 982 982
13:51 karolherbst: :D
13:51 RSpliet: glennk: interestingly, if I look at the Kepler desktop GPUs listed on wikipedia there seems to be between 14 and 23GB/s per SMX
13:52 RSpliet: I kind of had the assumption that effective bandwidth would decrease as theoretical bw rises (eg. the gap between low-end GPUs with 14GB/s and the high end with 23GB/s is only theoretical)
13:52 karolherbst: RSpliet: I really doubt that though :/
13:53 RSpliet: so as far as keeping the SMX fed with data, I wouldn't expect such a massive difference - although a higher resolution of course puts a relatively high strain on the low-end cards
13:54 RSpliet: hmm, I said "only theoretical", I meant "not as big as the numbers indicate"
13:54 imirkin_: any half-way decent gpu is like $150 =/
14:04 karolherbst: hakzsam: is there a perf counter which tells us how many shaders are ran per frame?
14:05 hakzsam: karolherbst, not currently
14:05 karolherbst: mhh okay
14:05 karolherbst: though I would have to compare that with the blob
14:06 imirkin_: there probably ought to be a draw count
14:06 imirkin_: although a single shader might be reused
14:06 karolherbst: mhh any idea what would be a good way to do this? I suspect that the blob does zculling/zbuffers stuff better and therefore executes less shaders in total
14:06 karolherbst: would be nice to verify that
14:06 imirkin_: you can have queries around the number of fragment invocations/etc
14:06 imirkin_: with ARB_pipeline_statistics_query
14:06 karolherbst: nice
14:07 karolherbst: I hope the blob does execute 50% less shaders :)
14:32 karolherbst: there is no way to do that with apitrace yet, right?
14:47 karolherbst: imirkin_: that ARB_pipeline_statistics_query thing seems to be really interessting :)
14:47 karolherbst: I just need to check how I can add this to apitrac
14:47 karolherbst: e
14:48 imirkin_: iirc it might support it, not sure
14:49 karolherbst: imirkin_: I do't think so though
14:49 jeremySal: imirkin: why would I get the error "Bad enable/disable enum at GL_CONSERVATIVE_RASTERIZATION_NV" with this shader: http://pastebin.com/pf6EKtnM ?
14:50 jeremySal: Is this a limitation of piglit, of my driver installation, or of my script?
14:50 karolherbst: imirkin_: I think it only supports GL_AMD_performance_monitor and GL_INTEL_performance_query
14:50 hakzsam: karolherbst, yeah
14:51 karolherbst: hakzsam: I think it would really sense to add support for ARB_pipeline_statistics_query too
14:52 imirkin_: jeremySal: mmmm... your piglit was built with a gl.xml that didn't have that enum :(
14:53 imirkin_: jeremySal: try leaving off the GL_ bit of it
14:53 imirkin_: karolherbst: any objections to this rewording? http://hastebin.com/raw/xetakofabe
14:54 karolherbst: imirkin_: nope, seems fine, thanks :)
14:55 karolherbst: and for today I am gone anyway :), I hope tomorrorw will be as productive as today :D
14:55 jeremySal: imirkin: same error without the GL_ prefix in the error
14:56 imirkin_: jeremySal: ugh, looks like the patch never landed to look at the larger list :(
14:58 imirkin_: jeremySal: you can just add it to tests/shaders/shader_runner.c -- see the enable_table .
16:38 imirkin: mupuf: could i trouble you to stick a fermi into reator at some point? i can't figure out this images stuff with just guessing
22:56 jeremySal: imirkin: I have a test and dump for the NV_conservative_raster extension
23:02 jeremySal: Although I'm not really sure the best way to do the test, so I drew a line with slope 45-epsilon degrees, and checked whether the pixel at 1,0 is colored
23:02 jeremySal: (45 minus epsilon) degrees
23:45 jeremySal: imirkin: if I want to set the polygonmode (for NV_fill_rectangle) will I have to to create a C test?