00:00 karolherbst: ohh wait, mhh, no I compared the wrong outputs..
00:05 karolherbst: imirkin_: seems like between two algebraicopts there should be one localCSE run
00:05 karolherbst: otherwise those refcount == 1 checks don't work
00:08 karolherbst: imirkin_: with that: changes between 2 and 3 iterations: helped 0 8 189 189
00:08 karolherbst: no hurt
00:08 karolherbst: :)
00:32 karolherbst: and now, why does running DCE hurts some shaders...
00:45 karolherbst: okay nice, it is getting better :)
01:55 karolherbst: imirkin_: do you think a slct => min optimization could give us any benefit with no effect on instruction or gpr count?
02:03 RSpliet: karolherbst: I doubt it
02:03 karolherbst: me too, I just wanted to be sure. I found a lot slct which can be reduced to mins, but there is no benefit except a sclt would become a min
02:04 RSpliet: the only potential improvement I could possibly imagine is saving 4 bytes of code in a few distinct cases
02:04 karolherbst: mhh
02:04 RSpliet: on nv50
02:04 karolherbst: well it could remove some other code in theory
02:05 karolherbst: in fact it goes like that: slct lt a b c, where a and c have the same sign
02:05 karolherbst: and b is 0
02:05 karolherbst: but I am not sure if b matters that much
02:06 karolherbst: mhh
02:06 karolherbst: and the code would be much too complex just for that
02:08 RSpliet: generally not worth it
02:09 RSpliet: I was secretly hoping your tex - texbar patch was a compelling motivator to focus on scheduling instead of small peephole optms :-P
02:09 karolherbst: it didn't change much
02:09 karolherbst: like nothing really
02:10 RSpliet: prefetching is quite valuable in a lot of cases, although GPU's are a lot more resistant to latency than CPUs
02:10 karolherbst: yeah
02:10 karolherbst: I don't think that the tex location is the problem here
02:10 karolherbst: I am still suspecting that something zcull/zbuffer related isn't perfect
02:11 karolherbst: because the perf difference gets bigger the more complex the scene to render is
02:11 karolherbst: which could of course because caused by other stuff as well
02:11 RSpliet: imirkin_: 'd you think a greedy "when in doubt, schedule the insn with the biggest positive impact on liveness first" is a useful strategy? or is there literature that contains better plans? :-P
02:11 karolherbst: RSpliet: I was there already
02:11 karolherbst: in fact it's not
02:12 RSpliet: did you manage to observe a reduced GPR?
02:12 karolherbst: more like a x3 gpr
02:12 RSpliet: then you were doing it wrong :-P
02:12 karolherbst: and that was what I did: I manage a list of live values
02:12 karolherbst: and schedule those first with a use count of 1
02:12 karolherbst: seems fine, right?
02:12 RSpliet: not sure what your definition of use count is
02:13 karolherbst: well if there is one use, then use count is 1
02:13 karolherbst: like if the value is only used by one instruction
02:13 RSpliet: oh, yes that's an excellent way of maximising liveness
02:13 karolherbst: :D
02:14 karolherbst: I noticed
02:14 RSpliet: no what you'd rather do is determine for an instruction what the net effect of liveness is. like mad %4 %1 %2 %3 could make 1, 2 and 3 obsolete while only spawning a single var
02:14 karolherbst: RSpliet: https://github.com/karolherbst/mesa/commit/5da5109b50cbb23da73f74f1897ef1c5ced15cb5
02:14 RSpliet: so your net effect would be -2
02:14 karolherbst: ohh yeah, that would be smart
02:14 karolherbst: ...
02:15 RSpliet: it has the unfortunate downside of pushing all ld's and tex's to later in your program, and if that introduces latency you might be optimising for the wrong target :-)
02:16 karolherbst: yeah...
02:16 karolherbst: because I know that something like that takes months to do
02:16 karolherbst: I was hoping with some minor post-ra stuff we could at least hide the latency better
02:17 karolherbst: but it didn't change a thing
02:20 karolherbst: RSpliet: does that makes any sense or can the first move be removed? not $p0 mov u32 $r20 $r18; not $p0 mov u32 $r21 $r19
02:21 karolherbst: ohh wait
02:21 karolherbst: ...
02:21 karolherbst: I think I am still a bit tired
02:29 karolherbst: okay, that can be replaced: max ftz f32 $r25 abs $r19 $r63
03:10 karolherbst: RSpliet: ex2(preex2(mul(lg2(a), b))) => mul(a,b) ?
03:15 karolherbst: ohh wait...
03:15 karolherbst: this is a^b
03:15 RSpliet: eh? (ln2(a)*b)^2 != a*b
03:15 karolherbst: yeah I was playing around in wolfram alpha and just messed it up
03:15 karolherbst: mhh
03:15 karolherbst: but this opt mighe be usefull for b==2 and maybe b==3 too
03:16 karolherbst: a mul should be better than ex2 and lg2
03:16 karolherbst: well in my case b is 3
03:18 RSpliet: pardon my screw-up btw, should've been: 2^(ln2(a)*b)
03:19 karolherbst: no worries
03:20 karolherbst: but this is a^b
03:20 karolherbst: for b==2 this can be simply optimized to a single mul, which would be really nice
03:20 karolherbst: and with b==3 we would have two muls
03:21 karolherbst: what is the difference between mul ftz and mul dnz by the way?
03:23 RSpliet: is that rounding mode?
03:25 karolherbst: I guess so
03:26 RSpliet: idk, it's bound to be encoded somewhere in envyds
03:26 RSpliet: *envydis
04:44 karolherbst: RSpliet: wow, that ex2/lg2 opt just gave me 1% perf in pixmark_piano
04:46 karolherbst: and 15 other shaders are also affected :)
04:47 karolherbst: all 15 from the game deadcore :/ yeah well
05:18 karolherbst: uhhh, with a nouveau mesa change I made my intel screen flicker in a really odd way :O
05:37 imirkin: karolherbst: an opt case for OP_POW in opnd() seems like it'd be very beneficial
05:37 imirkin: karolherbst: you can go up to pow(4) i think -- all of those should just be 2 mul's
05:38 karolherbst: imirkin: so I would catch those lg2, ex2 thingies with that?
05:38 imirkin: no
05:38 imirkin: OP_POW
05:38 imirkin: which gets lowered to lg2/ex2 after the opt passes
05:38 karolherbst: oh yeah makes sense
05:39 karolherbst: anyway, I found a serious issue, or maybe I did something terrible
05:39 karolherbst: imirkin: why could that causes artefacts in games: https://github.com/karolherbst/mesa/commit/e34eec30117a1101ca828d39d52f94b05babaa5a
05:40 karolherbst: ohh wait, I know why this could, :/
05:40 karolherbst: maybe I messed something up
05:55 karolherbst: imirkin: okay this patch on mesa master causes some bad stuff in borderlands. Something AlgebraicOpt realted :/
05:55 imirkin: the opt passes aren't ready to be run in a loop?
05:55 karolherbst: mhh maybe
05:56 imirkin: i dunno, that's weird though
05:56 karolherbst: maybe LocalCSE causes AlgebraicOpt to do something strange
05:58 karolherbst: nope, it is algebraic opt
06:00 karolherbst: imirkin: and I don't think it has much to do with running them in a loop, because each opt shouldn't effect the result of the shader, should it?
06:00 RSpliet: correct
06:00 imirkin: each opt expects certain input
06:00 imirkin: and if it receives a new kind of input, it may not be ready for it
06:00 imirkin: in principle, you're right - the opts shouldn't affect the result of the shader
06:01 imirkin: but for example AlgebraicOpt isn't used to run *after* ModifierFolding
06:01 imirkin: so it might not check modifiers everywhere it should
06:02 karolherbst: ohhh
06:02 karolherbst: that makes sense
06:04 karolherbst: ... awesome, not even the instruction count changes
06:10 karolherbst: imirkin: it doesn't seem like anything changes though...
06:11 imirkin: instruction count != instructions...
06:11 karolherbst: I ran every shader_test through shader_runner with NV50_PROG_DEBUG=3
06:11 karolherbst: and diffed every output
06:12 karolherbst: I will clean build, test again and then I don't know
06:14 imirkin: weird
06:16 karolherbst: ....
06:16 karolherbst: guess what
06:16 karolherbst: the game is 32bit
06:18 karolherbst: ahh nice, now I have 438 changed shader
06:36 karolherbst: imirkin: I'll guess I have to trace and find the draw call painting the weird stuff I saw
06:37 imirkin: hmmmm.... are some of our opts broken on 32-bit?
06:38 imirkin: there shouldn't be any differences in shader compilation.... except *very* occasional diffs due to set ordering
06:38 imirkin: we have an unordered set of pointers
06:38 karolherbst: yeah I know
06:38 imirkin: oh, but iirc i fixed that source of non-determinism
06:38 karolherbst: I think this is something really rare
06:38 imirkin: (in the SpillCodeInserter)
06:38 karolherbst: because
06:38 karolherbst: most of the stuff looks fine
06:43 karolherbst: imirkin: same result with glretrace
06:44 imirkin: well there shouldn't be any diffs between 32-bit and 64-bit compilation
06:44 imirkin: so if you have ANY shaders that compile differently on 32- and 64-bit
06:44 imirkin: i'd like to see the diffs
06:44 karolherbst: if I find one...
06:45 imirkin: i thought you had 438 of them
06:46 karolherbst: ohh now I now what you meant.. no I was only rebuilding my 32bit bit for the game
06:46 karolherbst: and left my 64bit build alone
06:46 karolherbst: so shader_runner was using the same binaries
06:46 imirkin: oh i see
06:46 imirkin: "oops" :)
06:47 RSpliet: imirkin: who are you citing there?
06:47 imirkin: does it matter?
07:20 karolherbst: imirkin: so I think I will find the shader in a moment :), only 300 calls to check
07:20 imirkin: good times, eh
07:20 imirkin: fwiw i like to just do NV50_PROG_DEBUG=1 glretrace
07:20 karolherbst: thing is
07:20 imirkin: with both things
07:20 imirkin: and then diff -u the output :)
07:21 karolherbst: you won't see anything out of that
07:21 karolherbst: there are like thousends of add->mad opts
07:21 imirkin: :(
07:22 karolherbst: well I could also just partly disable those opts and see which exactly is causing it...
07:25 karolherbst: apitrace doens't help that much though :/
07:28 karolherbst: handleAdd or handleRCP
07:29 karolherbst: though they could change something which changes something else later
07:29 karolherbst: mhh
07:29 karolherbst: seems like handleAdd
07:30 imirkin: that's the annoying bit with optimizations :)
07:30 imirkin: perhaps it tries to make an add neg a neg b?
07:30 karolherbst: ohh wait, I turned everything off after the second algebraic
07:30 imirkin: which iirc isn't supported for... integers?
07:30 imirkin: i forget
07:30 karolherbst: I could disable this
07:30 karolherbst: and it is handleAdd for sure
07:30 imirkin: but it should be checking for it
07:31 karolherbst: for what should I check?
07:31 karolherbst: ohh
07:31 karolherbst: algebraic handleAdd only does this tryADDToMADOrSAD thing
07:31 imirkin: dunno, there are target checks for all this stuff.
07:32 imirkin: perhaps the tryADD* forgets to check one of the target callbacks
07:32 imirkin: or forgets to copy in one of the modifiers
07:32 karolherbst: I will disable mads for now, maybe it is a sad thing...
07:32 imirkin: no
07:32 imirkin: we never emit sad's
07:33 karolherbst: oh
07:34 karolherbst: imirkin: should I just print all conversion in tryADDToMADOrSAD?
07:34 karolherbst: add->print and both sources
07:35 imirkin: you should try to figure it out yourself :p
07:35 karolherbst: :D
07:36 imirkin: i think you have a decent understanding of this stuff by now
07:37 imirkin: check the target for the various restrictions
07:37 imirkin: there's isModSupported and a few other useful bits
07:37 imirkin: *read* the functions so that you better understand the restrictions yourself
07:38 karolherbst: imirkin: what I see sometimes: add(sat mul, sat mul) => mad()
07:39 karolherbst: no idea if that's bad or not
07:39 imirkin: don't think that works
07:39 imirkin: there's no sat modifier on the mul part of the mad
07:39 karolherbst: yeah, I thought so much
07:40 karolherbst: so if the muls have a sat, I shouldn't do the conversion
07:40 imirkin: yep
07:40 karolherbst: k
07:41 imirkin: and it of course never saw this before since previously there were just OP_SAT's in place
07:41 imirkin: but with ModifierFolding, those get folded in
07:43 karolherbst: \o/
07:43 karolherbst: that was it
07:55 mupuf: karolherbst: you should monitor the execution time of the compilation too
07:56 mupuf: it is not super important, but it still needs to be monitored :p
07:56 karolherbst: mupuf: I fear the overhead of shader-db is too big though
07:56 karolherbst: I did it once and didn't see any noticeable
07:56 karolherbst: everything under <2% change
07:57 mupuf: good-enough then :p
07:57 karolherbst: eon games are linking shaders every frame, these are a total pain :/
07:59 karolherbst: but I am gettiny really near to blob performance now :)
07:59 karolherbst: I will reach 80% for sure
08:00 Tom^: wait, what are you doing :o
08:00 karolherbst: currenty at 76.6% with pixmark_piano
08:00 karolherbst: Tom^: yeah well, tuning the nouveau compiler a bit
08:01 Tom^: cool
08:03 karolherbst: my target is 90% though
08:04 karolherbst: imirkin: lower pow to mul,mul for 0-4?
08:05 karolherbst: ohh wait, 0 doesn't make much sense
08:05 karolherbst: 1-4 then :)
08:05 karolherbst: .
08:05 karolherbst: .... okay 2-4
08:17 imirkin_: karolherbst: well, 0 makes sense - the result shoudl be 1 iirc
08:17 imirkin_: karolherbst: 1 it should just be x
08:17 imirkin_: etc
08:18 karolherbst: yeah
08:18 mupuf: karolherbst: what was the perf of pixmark_piano before?
08:18 karolherbst: mupuf: master: 972, my stuff: 1027, blob: 1340
08:19 mupuf: very nice :)
08:19 imirkin_: karolherbst: since it gets lowered to a bunch of ops, perhaps it makes sense to go up to power of 16. dunno.
08:19 karolherbst: yeah, maybe
08:20 imirkin_: and sfu ops suck a big one too
08:20 karolherbst: imirkin_: there are only two combinations anyway, mul(a, a) or mul(a, mul), and everything up to 16 is a combination of both
08:20 imirkin_: :)
08:21 imirkin_: any integer power in fact
08:22 karolherbst: :O
08:22 karolherbst: even 1024?
08:22 karolherbst: :D
08:22 imirkin_: even 1023!
08:22 karolherbst: yeah but I doubt that makes sense
08:22 karolherbst: would be funny if that would be faster though
08:24 imirkin_: def more accurate!
08:24 karolherbst: mupuf: that's my favourite one of my patches: https://github.com/karolherbst/mesa/commit/e0e7c09f15215ef5737d62f58e7997a731468ae7 :)
08:24 imirkin_: anyways, that's why i picked 16 as a max
08:24 imirkin_: i think it's a good trade-off
08:27 karolherbst: imirkin_: should the case for 1 op = OP_MOV? or rather OP_CVT?
08:27 karolherbst: I never now what to use of these
08:27 imirkin_: use CVT when there's some modifier
08:27 imirkin_: use MOV when there's no modifier
08:27 karolherbst: okay
08:27 imirkin_: pow can't have any modifiers
08:27 imirkin_: so just use mov
08:27 karolherbst: nice, okay
08:27 mupuf: karolherbst: oh, you already have a proto for it
08:28 mupuf: that is really nice!
08:28 imirkin_: btw, dnz = treat nan as "0" for multiplication
08:28 karolherbst: mupuf: and because this is post-ra, it doesn't even affect gpr count or something :D
08:28 imirkin_: this is important because we want 0^0 = 1 for silly reasons
08:28 mupuf: the commit message is unclear though (+22%, but how far are you from the blob?)
08:28 karolherbst: imirkin_: ohh okay
08:28 karolherbst: mupuf: there is a hw limit
08:29 karolherbst: mupuf: we can't dual schedule more than 3 out of 7 instructions
08:29 karolherbst: mupuf: mesa master already hits above 40% in unigine heaven
08:29 karolherbst: just pixmark_piano was somehow bad
08:29 mupuf: yeah, but there is the more practical limit I guess which is how well the program maps to the dual issue capabilities
08:30 RSpliet: mupuf: nothing scheduling can't solve :-P
08:30 karolherbst: yeah, but you can get close to 43% easily
08:30 mupuf: RSpliet: ah ah
08:30 mupuf: karolherbst: then it would be nice to say from which number you went to which
08:30 karolherbst: mupuf: with that patch I am at 42.7% dual issued in pixmark_piano
08:30 karolherbst: ohh right
08:31 mupuf: because otherwise, it is kind of hard to judge :)
08:31 karolherbst: :D
08:31 karolherbst: right
08:31 karolherbst: but
08:31 karolherbst: this shows that the dual issue stuff seems to work pretty good actually
08:31 karolherbst: because if it wouldn't you would notice
08:32 mupuf: :)
08:32 mupuf: great work nonetheless!
08:33 karolherbst: the pass is a bit expensive though :/ I already increase it by a lot by just checking the next one
08:33 karolherbst: so maybe being so smart there doesn'T really pay off
08:34 mupuf: maybe you can call it when you notice the dual-issue rate is low?
08:34 mupuf: does it improve heaven for instance?
08:36 karolherbst: mupuf: mesa can only know when it is emiting the binary, and if we check every instruction already, then we can improve the situation at the same time
08:36 karolherbst: so checking how good dual issueing is will already take like 80% of the cpu time this thing would
08:37 karolherbst: mupuf: not really, heaven is pretty good witout it already
08:37 karolherbst: mupuf: you can dual issue pretty much usually, I never saw any instruction move more than 3 places away in pixmark_piano, and its shader has like 3800 instructions
08:41 mupuf: ah ah, it is a beast :p
08:42 RSpliet: karolherbst: here's a heuristic that might be interesting: try and shove tex instructions upwards to the last slot in the (previous) scheduling block
08:42 RSpliet: as it can't be dual issued anyway :-P
08:42 karolherbst: RSpliet: doesn't matter
08:43 karolherbst: two texs might be problemativ in a block though
08:44 mupuf: well, this week, I accidentally partly-solved the biggest TODO on my list... ensure that the variance of each run is within the confidence_margin we set, and then using the student-t technique to find if two data-sets are the same or not (eg. was there a perf change or not).
08:44 RSpliet: oops
08:44 mupuf: Why do I say "partly"? Because it assumes a gaussian distribution
08:44 mupuf: which I know is WRONG whenever we get CPU limited or too close to the TDP
08:45 mupuf: should not be too much of an issue on nouveau
08:46 mupuf: but on intel hw, with such a low TDP and fast reclocking, it is a pain to get stable results :s
08:46 karolherbst: :/
08:47 mupuf:was pretty pleased with the variance for the tests on nouveau :D
08:47 mupuf: well, I guess it is good that I test and design based on the worst case!
08:47 mupuf:has a small NUC to do most of his tests without blocking his main machine
09:04 karolherbst: imirkin_: I think 16 is too much
09:04 karolherbst: ohh wait, there can be muls shared :/
09:06 imirkin_: 16 = a2 = a*a, a4 = a2*a2, a8 = a4*a4, a16 = a8 * a8
09:06 imirkin_: so... 4 mul's
09:06 imirkin_: vs 3 sfu ops (preex2, ex2, lg2) + 1 mul
09:06 imirkin_: seems like a good trade
09:06 karolherbst: yeah I wasn't thinking clearly
09:12 karolherbst: I should prepare for tomorrow, but I am soo tired :/
09:12 imirkin_: ah going to fosdem?
09:12 karolherbst: yeah
09:12 imirkin_: cool
09:22 karolherbst: now I get 1035 points, but I didn't change a thing :/ (except I wrote the pow to MUL pass)
09:27 karolherbst: imirkin_: well I had this before: https://github.com/karolherbst/mesa/commit/8d09cd8a6fe76ea0785466ac1a793b809c4e7ee6
09:28 karolherbst: ....
09:28 karolherbst: ohhh
09:28 karolherbst: I didn't want to write that :/
09:28 karolherbst: nvm
09:28 karolherbst: just and old patch
09:30 imirkin_: you need to ping me less often... please try to only do so when you have a question that you've given some thought to solving already.
09:30 karolherbst: yeah I know, but this was by accident now :/
10:45 karolherbst: okay, I don't think I find anything anymore in pixmark_piano, so next most obvious thing to do would be zcull for nvc0
11:05 karolherbst: RSpliet: I think I got a pretty naive rescheduling algorithm: just look iterate over all instructions in a block, when i removes more live values than i->prev, swap them
11:06 imirkin_: you want to keep track of the # of live values
11:06 imirkin_: if you always minimize the # of live values, you end up using very few registers and having no ability for execution parallelism
11:07 karolherbst: I thought using fewer registers means more threads
11:07 karolherbst: but I was thinking of just trying that out and see how that goes
11:08 imirkin_: there's a trade-off
11:08 karolherbst: there is a max amount of threads I suppose
11:09 imirkin_: look at the getThroughput stuff
11:09 imirkin_: that should give you an idea
11:09 imirkin_: basically there's an ideal minimum delay between instructions
11:09 imirkin_: s.t. the next instruction of that type would be ready to execute
11:09 imirkin_: without having to wait for a unit to become available
11:09 glennk: no more of those BFE etc things the front end emits a bunch of ops for?
11:10 imirkin_: huh?
11:10 imirkin_: i made an opt for shift + and -> bfe/bfi
11:10 karolherbst: nice
11:10 imirkin_: as well as detecting byte extractions and using convert
11:11 karolherbst: did you messure the benefit already?
11:11 imirkin_: no, but it was a ton fewer ops in some situations
11:11 imirkin_: i pushed it a long time ago
11:12 imirkin_: i doubt it affects too many shaders
11:12 karolherbst: ohhh okay
11:15 karolherbst: I think I will try out to figure out that zculling first. It sounds like it could give a huge benefit and I don't think anybody worked on that really the last two years?
11:15 imirkin_: i've never looked at it
11:16 glennk: zcull = hierz?
11:16 imirkin_: maybe?
11:16 martm: yeah
11:16 karolherbst: it is something nouveau doesn't do for nvc0, so yeah
11:16 karolherbst: pushing fake zcull data should show us pretty fast what it does though
11:17 glennk: http://developer.download.nvidia.com/GPU_Programming_Guide/GPU_Programming_Guide_G80.pdf page 43
11:17 karolherbst: yeah I read that already
11:18 imirkin_: glennk: is that the same thing as hierz?
11:18 karolherbst: I think zculling is based on triangles or something bigger
11:18 glennk: pretty much
11:18 glennk: i think piano is a bit of a special case, its just a single quad and all the work is in the fragment shader
11:18 karolherbst: yeah
11:19 karolherbst: I thinks this is more for lateZ or how that would be called in that case
11:19 karolherbst: ohh that should be EarlyZ :D
11:20 imirkin_: zcull will help somethign like heaven (maybe)
11:20 glennk: probably, stuff using lots of shadow maps should notice the difference
11:20 karolherbst: mhh
11:21 karolherbst: but I think pixmark_piano could be a good start to see what the blob does regarding zcull/earlyZ
11:21 karolherbst: even if they is no benefit
11:21 glennk: gears is probably better for that
11:22 karolherbst: ...
11:22 karolherbst: gears is bottlenecked by pcie on my machine
11:22 glennk: not if you fullscreen it
11:22 karolherbst: pretty sure it still is
11:22 imirkin_: glennk: he's on optimus
11:22 glennk: or well, on hybrid laptops it would be
11:26 karolherbst: glennk: pixmark_piano has 65 calls per frame :)
11:27 karolherbst: and two draw calls
11:28 karolherbst: ohh the second draw call just draws tux
11:29 glennk: karolherbst, yes, but its got a few thousand ops pixel shader, you won't be able to see if zcull is active or not
11:30 glennk: ...and no overdraw or z buffer
11:30 karolherbst: yeah, I know, I just want to know what the blob does in that case
11:30 imirkin_: no z = no zcull
11:30 imirkin_: basically for zcull you need to allocate an aux buffer
11:31 imirkin_: and use it in conjunction with the depth texture
11:31 glennk: similar to radeon's htile
11:32 imirkin_: at least that's my guess
11:32 imirkin_: the size of this buffer needs to be based on the size of the depth texture
11:32 imirkin_: i guess each miplevel of the texture needs this buffer
11:32 imirkin_: or maybe just forget being clever and stick it on the fb
11:33 imirkin_: and when binding a new fb, just reset the zcull info and mvoe on
11:33 glennk: not sure what happens when shaders read z
11:33 karolherbst: well my first step is to somehow push zcull related stuff to the hw and see how I can mess stuff up
11:33 karolherbst: glennk: the pdf tells you ;)
11:33 imirkin_: glennk: well, on fermi it's probably a little diff than on tesla
11:34 karolherbst: ohh right, yeah well
11:34 glennk: older hardware is probably quirkier to get going
11:34 karolherbst: I just want to see a change for now, that's all
11:34 imirkin_: i guess the zcull buffer should be allocated when clearing a fb
11:35 imirkin_: i dunno
11:35 glennk: karolherbst, no, it doesn't for that case
11:35 imirkin_: but then what happens if someone manually writes to the texture...
11:36 glennk: does nouveau do fast clears on z?
11:36 imirkin_: fast is a relative term
11:36 imirkin_: i don't knwo if we have ZBC stuff set up properly
11:37 imirkin_: but we use the "simple" clear stuff rather than drawing "by hand"
11:37 glennk: not-write-every-pixel-byte-when-clearing
11:37 imirkin_: but what that clear does, who knows
11:37 glennk: easy enough to measure if you get clear rates >> memory bandwidth
12:27 pmoreau: Oh god!! Of course clover wasn't going to find any device, I never thought than Nouveau wasn't loaded by default on Reator… --" (which makes sense, but…)
12:28 pmoreau: hakzsam: Ok, got the same error inside the SchedDataCalculator…
12:29 imirkin_: karolherbst: fyi this is the change i see from your shl+shr patch: http://hastebin.com/hamureqeru.coffee
12:29 karolherbst: imirkin_: nice
12:29 imirkin_: mind if i just replace your section with this one in the commit log?
12:30 imirkin_: imho it's nice to run against a "standard" shader-db
12:30 karolherbst: go ahead
12:30 pmoreau: Still, weird: why don't I get that error when compiling the same program for NVC0 using nouveau-compiler?
12:31 imirkin_: pmoreau: fermi doesn't run sched data
12:31 imirkin_: pmoreau: only kepler+
12:32 pmoreau: Hum, maybe the Kepler was picked up then
12:32 imirkin_: karolherbst: you're sometimes nouveau@ and sometimes git@... you should probably pick one
12:33 karolherbst: mhhh
12:33 karolherbst: right, I guess my mesa is still at git@
12:33 imirkin_: is there one you prefer? i can fix it up
12:33 karolherbst: nouveau
12:33 imirkin_: k
12:33 karolherbst: I changed all the stuff to that
12:34 karolherbst: also I can't filter so easily if I get messages to git@ :/
12:34 imirkin_: maybe some old commits still had the old one adn you cherry-picked
12:35 imirkin_: you can fix that stuff up with git commit --author btw
12:35 karolherbst: yeah maybe
12:35 karolherbst: I will check that for the future stuff
12:35 pmoreau: Right, there is an nve4 in the bt, so it picked the GK106. https://phabricator.pmoreau.org/P84
12:36 karolherbst: okay, one part I figured out already ZCULL_HEIGHT and ZCULL_WIDTH are the depth buffer? size rouned up to 20
12:36 imirkin_: 0x20 presumably?
12:36 karolherbst: no
12:36 karolherbst: 20
12:36 imirkin_: that's... highly surprising.
12:36 karolherbst: a 1024x640 buffer created 1040x640
12:36 pmoreau: Ah great, I have the crash with nouveau_compiler as well, needed to test Kepler rather than Fermi. :-)
12:37 karolherbst: pmoreau: which one?
12:37 imirkin_: weird.
12:37 imirkin_: i don't think it's to 20 though
12:37 pmoreau: karolherbst: GK106
12:37 imirkin_: maybe it's +16 though
12:37 karolherbst: pmoreau: I meant patch
12:37 pmoreau: Ah, my SPIR-V stuff
12:37 karolherbst: imirkin_: but why not for the height?
12:38 imirkin_: karolherbst: check the logic in nvc0_state_validate
12:38 imirkin_: looks like it rounds up to 224
12:38 imirkin_: er hm, that's not it either
12:38 imirkin_: i dunno.
12:38 imirkin_: try diff sizes :)
12:38 karolherbst: :D
12:38 karolherbst: right
12:39 karolherbst: there are two more interessting things though
12:39 karolherbst: ZCULL_ADDRESS and ZCULL_LIMIT
12:39 karolherbst: 0x1140000 and 0x1160000
12:40 karolherbst: which I assume is the location of the depth buffer or something?
12:40 imirkin_: nvc0_validate_zcull just does + 1 << 17
12:40 imirkin_: you should look at it :)
12:41 imirkin_: pick up where calim left off rather than try everything from scratch
12:41 karolherbst: ohh right
12:41 martm: tried to look from google did not even find anything , what is dEQP?
12:41 karolherbst: imirkin_: but there was nothing at screen creation time
12:42 imirkin_: http://patchwork.freedesktop.org/patch/71771/
12:42 imirkin_: srlsly?
12:42 imirkin_: .../nouveau/codegen/nv50_ir_peephole.cpp.save | 3932 ++++++++++++++++++++
12:43 airlied: lols git fail
12:43 karolherbst: ....
12:44 karolherbst: sorry for that
12:45 karolherbst: thing is, I have no clue when to call nvc0_validate_zcull :/ just at screen creation time where the other zcull stuff is? (the blob seems to push stuff there) or would there be another place reasonable for that
12:48 imirkin_: karolherbst: it needs to be done based on the framebuffer i think
12:48 karolherbst: ohh yeah, okay, I slowly get this stuff
12:48 karolherbst: I need to add a flag down in that file
12:48 karolherbst: and dirty |= it
12:48 imirkin_: in nvc0_state.c
12:48 imirkin_: you need to figure out where to store the zcull buffer
12:48 imirkin_: maybe it should be stored with the fb
12:49 imirkin_: or maybe... something else
12:49 imirkin_: i dunno :)
12:49 imirkin_: maybe the zcull buffer should only be allocated into the fb when a clear is done
12:49 imirkin_: otherwise i have no idea how to initialize it
12:50 karolherbst: k, first segfault :O
12:51 karolherbst: ohh nv50_surface is NULL
12:55 karolherbst: mhhh, I was hoping that stuff randomly disappears when I push some silly data in there :/
13:12 imirkin_: karolherbst: i pushed a handful of your patches
13:12 karolherbst: yeah I saw
13:13 imirkin_: the ones that i felt were super-safe
13:13 karolherbst: yeah I noticed :)
13:13 imirkin_: the others need testing and thought
13:26 imirkin_: interesting. i get these results by disabling the DCE that st/mesa does: http://hastebin.com/obunidowaw.coffee
13:31 karolherbst: interessting
13:31 RSpliet: imirkin_: what? have you spotted examples of difference?
13:32 imirkin_: RSpliet: the st/mesa stuff can cause various stupidity to appear
13:32 imirkin_: although i'm surprised that its *dce* does that
13:33 imirkin_: disabling the cp that st/mesa does maeks things way worse :(
13:33 imirkin_: disabling CP i get: http://hastebin.com/ijelusimoq.coffee :(
13:34 RSpliet: sure, but that might be compensated for by running an extra round of CP in nouveau?
13:34 RSpliet: as for the DCE, I'm rather curious to understand what kind of stupidity it is that causes the difference :-P
13:34 pmoreau: hakzsam: Think I found the error: the pass initialises an array A with the nb of BBs present, and use the BB's id to access the array.
13:35 imirkin_: RSpliet: yeah it's odd
13:35 pmoreau: hakzsam: There are two BBs, however they have for id 0 and 2 resp. So the id 2 is wrecking havoc…
13:36 pmoreau: Argh, because BB of id 1 is the exit BB, which doesn't get any instruction… :-/
13:37 pmoreau: So it most likely gets removed, but the BBs aren't renumbered?
13:41 imirkin_: well at least it's no longer hilariously bad now that i've fixed indirect arrays
13:45 karolherbst: imirkin_: mhhh, changing gpr count has a rather big perf impact
13:46 karolherbst: I just increased the count by 10 and got a perf impact of 6ms frame time
13:50 karolherbst: 49 -> 59 max gpr: 1035 -> 984 points
13:52 RSpliet: karolherbst: sure, instead of having 6 warps in flight concurrently, you now only have 5... that's bound to have some impact on your efficient mem bw
13:52 RSpliet: *effective
13:54 karolherbst: ahh okay
13:55 Jayhost: karolherbst I have dmsg and kmsg of gm107 lockout after pstate change. Did you want to see or should I poke around.
13:55 karolherbst: I have no idea how that stuff works on maxwell
13:55 RSpliet: karolherbst: on kepler, 48 should be the tipping point for 7 warps... think you can shave two registers off that usecase? :-P
13:56 pmoreau: \o/ hello_world works on Kepler now! But not on Fermi O.O
13:57 karolherbst: RSpliet: well I just cut out two and see how that goes
13:57 karolherbst: meh it cuts something important off
13:58 RSpliet: eh yes, you'd want to do that with proper optimisations instead
13:58 karolherbst: but the blob uses 56 regs in total :/
13:59 RSpliet: karolherbst: yeah that's fine, still gives them 6 warps :-P
13:59 RSpliet: there's 65536 GPRs per SMX, divide that by 192 and you'll have the GPR per SIMD lane
14:00 RSpliet: (a very awkward 341,33 it seems, wtf?)
14:00 RSpliet: anyway, floor(341,33 / shader GPR use) gives you the number of concurrent warps
14:01 imirkin_: so i'm guessing that disabling DCE partly disables some of the later passes, which in turn is what allows nouveau to do better
14:01 Jayhost: If anyone can point me in the right direction. Fifo Engine fault. Maxwell. Dolphin-emu. after pstate frequency change.
14:01 karolherbst: the active_warps counter tells me 8.3G
14:02 karolherbst: and 24k warps launched
14:02 karolherbst: and I have 5 SMX by the way
14:02 RSpliet: Jayhost: don't do pstate frequency changes on Maxwell, it's supposed to be unsupported
14:03 RSpliet: karolherbst: no idea what the active warps counter is; certainly not on a single SMX
14:03 Jayhost: RSpliet, can I work fixing it? Starting point?
14:05 RSpliet: Jayhost: reverse engineer the clock tree? sorry mate, that's not an easy task - interpret and fix nvkm/subdev/clk/*.c if you're up for it
14:05 RSpliet: I can't help you with it
14:06 Jayhost: Cool. Thanks RSpliet and karolherbst
14:15 RSpliet: skeggsb_: is there a particular reason why you hooked up GM107 to gk104_clk_new ? or is that an artefact from copy-pastry during the last big rewrite? :-)
14:40 imirkin_: gr. i found a shader where the *renumbering* that st/mesa does greatly helps codegen ... somehow.