00:00karolherbst: ohh wait, mhh, no I compared the wrong outputs..
00:05karolherbst: imirkin_: seems like between two algebraicopts there should be one localCSE run
00:05karolherbst: otherwise those refcount == 1 checks don't work
00:08karolherbst: imirkin_: with that: changes between 2 and 3 iterations: helped 0 8 189 189
00:08karolherbst: no hurt
00:32karolherbst: and now, why does running DCE hurts some shaders...
00:45karolherbst: okay nice, it is getting better :)
01:55karolherbst: imirkin_: do you think a slct => min optimization could give us any benefit with no effect on instruction or gpr count?
02:03RSpliet: karolherbst: I doubt it
02:03karolherbst: me too, I just wanted to be sure. I found a lot slct which can be reduced to mins, but there is no benefit except a sclt would become a min
02:04RSpliet: the only potential improvement I could possibly imagine is saving 4 bytes of code in a few distinct cases
02:04RSpliet: on nv50
02:04karolherbst: well it could remove some other code in theory
02:05karolherbst: in fact it goes like that: slct lt a b c, where a and c have the same sign
02:05karolherbst: and b is 0
02:05karolherbst: but I am not sure if b matters that much
02:06karolherbst: and the code would be much too complex just for that
02:08RSpliet: generally not worth it
02:09RSpliet: I was secretly hoping your tex - texbar patch was a compelling motivator to focus on scheduling instead of small peephole optms :-P
02:09karolherbst: it didn't change much
02:09karolherbst: like nothing really
02:10RSpliet: prefetching is quite valuable in a lot of cases, although GPU's are a lot more resistant to latency than CPUs
02:10karolherbst: I don't think that the tex location is the problem here
02:10karolherbst: I am still suspecting that something zcull/zbuffer related isn't perfect
02:11karolherbst: because the perf difference gets bigger the more complex the scene to render is
02:11karolherbst: which could of course because caused by other stuff as well
02:11RSpliet: imirkin_: 'd you think a greedy "when in doubt, schedule the insn with the biggest positive impact on liveness first" is a useful strategy? or is there literature that contains better plans? :-P
02:11karolherbst: RSpliet: I was there already
02:11karolherbst: in fact it's not
02:12RSpliet: did you manage to observe a reduced GPR?
02:12karolherbst: more like a x3 gpr
02:12RSpliet: then you were doing it wrong :-P
02:12karolherbst: and that was what I did: I manage a list of live values
02:12karolherbst: and schedule those first with a use count of 1
02:12karolherbst: seems fine, right?
02:12RSpliet: not sure what your definition of use count is
02:13karolherbst: well if there is one use, then use count is 1
02:13karolherbst: like if the value is only used by one instruction
02:13RSpliet: oh, yes that's an excellent way of maximising liveness
02:14karolherbst: I noticed
02:14RSpliet: no what you'd rather do is determine for an instruction what the net effect of liveness is. like mad %4 %1 %2 %3 could make 1, 2 and 3 obsolete while only spawning a single var
02:14karolherbst: RSpliet: https://github.com/karolherbst/mesa/commit/5da5109b50cbb23da73f74f1897ef1c5ced15cb5
02:14RSpliet: so your net effect would be -2
02:14karolherbst: ohh yeah, that would be smart
02:15RSpliet: it has the unfortunate downside of pushing all ld's and tex's to later in your program, and if that introduces latency you might be optimising for the wrong target :-)
02:16karolherbst: because I know that something like that takes months to do
02:16karolherbst: I was hoping with some minor post-ra stuff we could at least hide the latency better
02:17karolherbst: but it didn't change a thing
02:20karolherbst: RSpliet: does that makes any sense or can the first move be removed? not $p0 mov u32 $r20 $r18; not $p0 mov u32 $r21 $r19
02:21karolherbst: ohh wait
02:21karolherbst: I think I am still a bit tired
02:29karolherbst: okay, that can be replaced: max ftz f32 $r25 abs $r19 $r63
03:10karolherbst: RSpliet: ex2(preex2(mul(lg2(a), b))) => mul(a,b) ?
03:15karolherbst: ohh wait...
03:15karolherbst: this is a^b
03:15RSpliet: eh? (ln2(a)*b)^2 != a*b
03:15karolherbst: yeah I was playing around in wolfram alpha and just messed it up
03:15karolherbst: but this opt mighe be usefull for b==2 and maybe b==3 too
03:16karolherbst: a mul should be better than ex2 and lg2
03:16karolherbst: well in my case b is 3
03:18RSpliet: pardon my screw-up btw, should've been: 2^(ln2(a)*b)
03:19karolherbst: no worries
03:20karolherbst: but this is a^b
03:20karolherbst: for b==2 this can be simply optimized to a single mul, which would be really nice
03:20karolherbst: and with b==3 we would have two muls
03:21karolherbst: what is the difference between mul ftz and mul dnz by the way?
03:23RSpliet: is that rounding mode?
03:25karolherbst: I guess so
03:26RSpliet: idk, it's bound to be encoded somewhere in envyds
04:44karolherbst: RSpliet: wow, that ex2/lg2 opt just gave me 1% perf in pixmark_piano
04:46karolherbst: and 15 other shaders are also affected :)
04:47karolherbst: all 15 from the game deadcore :/ yeah well
05:18karolherbst: uhhh, with a nouveau mesa change I made my intel screen flicker in a really odd way :O
05:37imirkin: karolherbst: an opt case for OP_POW in opnd() seems like it'd be very beneficial
05:37imirkin: karolherbst: you can go up to pow(4) i think -- all of those should just be 2 mul's
05:38karolherbst: imirkin: so I would catch those lg2, ex2 thingies with that?
05:38imirkin: which gets lowered to lg2/ex2 after the opt passes
05:38karolherbst: oh yeah makes sense
05:39karolherbst: anyway, I found a serious issue, or maybe I did something terrible
05:39karolherbst: imirkin: why could that causes artefacts in games: https://github.com/karolherbst/mesa/commit/e34eec30117a1101ca828d39d52f94b05babaa5a
05:40karolherbst: ohh wait, I know why this could, :/
05:40karolherbst: maybe I messed something up
05:55karolherbst: imirkin: okay this patch on mesa master causes some bad stuff in borderlands. Something AlgebraicOpt realted :/
05:55imirkin: the opt passes aren't ready to be run in a loop?
05:55karolherbst: mhh maybe
05:56imirkin: i dunno, that's weird though
05:56karolherbst: maybe LocalCSE causes AlgebraicOpt to do something strange
05:58karolherbst: nope, it is algebraic opt
06:00karolherbst: imirkin: and I don't think it has much to do with running them in a loop, because each opt shouldn't effect the result of the shader, should it?
06:00imirkin: each opt expects certain input
06:00imirkin: and if it receives a new kind of input, it may not be ready for it
06:00imirkin: in principle, you're right - the opts shouldn't affect the result of the shader
06:01imirkin: but for example AlgebraicOpt isn't used to run *after* ModifierFolding
06:01imirkin: so it might not check modifiers everywhere it should
06:02karolherbst: that makes sense
06:04karolherbst: ... awesome, not even the instruction count changes
06:10karolherbst: imirkin: it doesn't seem like anything changes though...
06:11imirkin: instruction count != instructions...
06:11karolherbst: I ran every shader_test through shader_runner with NV50_PROG_DEBUG=3
06:11karolherbst: and diffed every output
06:12karolherbst: I will clean build, test again and then I don't know
06:16karolherbst: guess what
06:16karolherbst: the game is 32bit
06:18karolherbst: ahh nice, now I have 438 changed shader
06:36karolherbst: imirkin: I'll guess I have to trace and find the draw call painting the weird stuff I saw
06:37imirkin: hmmmm.... are some of our opts broken on 32-bit?
06:38imirkin: there shouldn't be any differences in shader compilation.... except *very* occasional diffs due to set ordering
06:38imirkin: we have an unordered set of pointers
06:38karolherbst: yeah I know
06:38imirkin: oh, but iirc i fixed that source of non-determinism
06:38karolherbst: I think this is something really rare
06:38imirkin: (in the SpillCodeInserter)
06:38karolherbst: most of the stuff looks fine
06:43karolherbst: imirkin: same result with glretrace
06:44imirkin: well there shouldn't be any diffs between 32-bit and 64-bit compilation
06:44imirkin: so if you have ANY shaders that compile differently on 32- and 64-bit
06:44imirkin: i'd like to see the diffs
06:44karolherbst: if I find one...
06:45imirkin: i thought you had 438 of them
06:46karolherbst: ohh now I now what you meant.. no I was only rebuilding my 32bit bit for the game
06:46karolherbst: and left my 64bit build alone
06:46karolherbst: so shader_runner was using the same binaries
06:46imirkin: oh i see
06:46imirkin: "oops" :)
06:47RSpliet: imirkin: who are you citing there?
06:47imirkin: does it matter?
07:20karolherbst: imirkin: so I think I will find the shader in a moment :), only 300 calls to check
07:20imirkin: good times, eh
07:20imirkin: fwiw i like to just do NV50_PROG_DEBUG=1 glretrace
07:20karolherbst: thing is
07:20imirkin: with both things
07:20imirkin: and then diff -u the output :)
07:21karolherbst: you won't see anything out of that
07:21karolherbst: there are like thousends of add->mad opts
07:22karolherbst: well I could also just partly disable those opts and see which exactly is causing it...
07:25karolherbst: apitrace doens't help that much though :/
07:28karolherbst: handleAdd or handleRCP
07:29karolherbst: though they could change something which changes something else later
07:29karolherbst: seems like handleAdd
07:30imirkin: that's the annoying bit with optimizations :)
07:30imirkin: perhaps it tries to make an add neg a neg b?
07:30karolherbst: ohh wait, I turned everything off after the second algebraic
07:30imirkin: which iirc isn't supported for... integers?
07:30imirkin: i forget
07:30karolherbst: I could disable this
07:30karolherbst: and it is handleAdd for sure
07:30imirkin: but it should be checking for it
07:31karolherbst: for what should I check?
07:31karolherbst: algebraic handleAdd only does this tryADDToMADOrSAD thing
07:31imirkin: dunno, there are target checks for all this stuff.
07:32imirkin: perhaps the tryADD* forgets to check one of the target callbacks
07:32imirkin: or forgets to copy in one of the modifiers
07:32karolherbst: I will disable mads for now, maybe it is a sad thing...
07:32imirkin: we never emit sad's
07:34karolherbst: imirkin: should I just print all conversion in tryADDToMADOrSAD?
07:34karolherbst: add->print and both sources
07:35imirkin: you should try to figure it out yourself :p
07:36imirkin: i think you have a decent understanding of this stuff by now
07:37imirkin: check the target for the various restrictions
07:37imirkin: there's isModSupported and a few other useful bits
07:37imirkin: *read* the functions so that you better understand the restrictions yourself
07:38karolherbst: imirkin: what I see sometimes: add(sat mul, sat mul) => mad()
07:39karolherbst: no idea if that's bad or not
07:39imirkin: don't think that works
07:39imirkin: there's no sat modifier on the mul part of the mad
07:39karolherbst: yeah, I thought so much
07:40karolherbst: so if the muls have a sat, I shouldn't do the conversion
07:41imirkin: and it of course never saw this before since previously there were just OP_SAT's in place
07:41imirkin: but with ModifierFolding, those get folded in
07:43karolherbst: that was it
07:55mupuf: karolherbst: you should monitor the execution time of the compilation too
07:56mupuf: it is not super important, but it still needs to be monitored :p
07:56karolherbst: mupuf: I fear the overhead of shader-db is too big though
07:56karolherbst: I did it once and didn't see any noticeable
07:56karolherbst: everything under <2% change
07:57mupuf: good-enough then :p
07:57karolherbst: eon games are linking shaders every frame, these are a total pain :/
07:59karolherbst: but I am gettiny really near to blob performance now :)
07:59karolherbst: I will reach 80% for sure
08:00Tom^: wait, what are you doing :o
08:00karolherbst: currenty at 76.6% with pixmark_piano
08:00karolherbst: Tom^: yeah well, tuning the nouveau compiler a bit
08:03karolherbst: my target is 90% though
08:04karolherbst: imirkin: lower pow to mul,mul for 0-4?
08:05karolherbst: ohh wait, 0 doesn't make much sense
08:05karolherbst: 1-4 then :)
08:05karolherbst: .... okay 2-4
08:17imirkin_: karolherbst: well, 0 makes sense - the result shoudl be 1 iirc
08:17imirkin_: karolherbst: 1 it should just be x
08:18mupuf: karolherbst: what was the perf of pixmark_piano before?
08:18karolherbst: mupuf: master: 972, my stuff: 1027, blob: 1340
08:19mupuf: very nice :)
08:19imirkin_: karolherbst: since it gets lowered to a bunch of ops, perhaps it makes sense to go up to power of 16. dunno.
08:19karolherbst: yeah, maybe
08:20imirkin_: and sfu ops suck a big one too
08:20karolherbst: imirkin_: there are only two combinations anyway, mul(a, a) or mul(a, mul), and everything up to 16 is a combination of both
08:21imirkin_: any integer power in fact
08:22karolherbst: even 1024?
08:22imirkin_: even 1023!
08:22karolherbst: yeah but I doubt that makes sense
08:22karolherbst: would be funny if that would be faster though
08:24imirkin_: def more accurate!
08:24karolherbst: mupuf: that's my favourite one of my patches: https://github.com/karolherbst/mesa/commit/e0e7c09f15215ef5737d62f58e7997a731468ae7 :)
08:24imirkin_: anyways, that's why i picked 16 as a max
08:24imirkin_: i think it's a good trade-off
08:27karolherbst: imirkin_: should the case for 1 op = OP_MOV? or rather OP_CVT?
08:27karolherbst: I never now what to use of these
08:27imirkin_: use CVT when there's some modifier
08:27imirkin_: use MOV when there's no modifier
08:27imirkin_: pow can't have any modifiers
08:27imirkin_: so just use mov
08:27karolherbst: nice, okay
08:27mupuf: karolherbst: oh, you already have a proto for it
08:28mupuf: that is really nice!
08:28imirkin_: btw, dnz = treat nan as "0" for multiplication
08:28karolherbst: mupuf: and because this is post-ra, it doesn't even affect gpr count or something :D
08:28imirkin_: this is important because we want 0^0 = 1 for silly reasons
08:28mupuf: the commit message is unclear though (+22%, but how far are you from the blob?)
08:28karolherbst: imirkin_: ohh okay
08:28karolherbst: mupuf: there is a hw limit
08:29karolherbst: mupuf: we can't dual schedule more than 3 out of 7 instructions
08:29karolherbst: mupuf: mesa master already hits above 40% in unigine heaven
08:29karolherbst: just pixmark_piano was somehow bad
08:29mupuf: yeah, but there is the more practical limit I guess which is how well the program maps to the dual issue capabilities
08:30RSpliet: mupuf: nothing scheduling can't solve :-P
08:30karolherbst: yeah, but you can get close to 43% easily
08:30mupuf: RSpliet: ah ah
08:30mupuf: karolherbst: then it would be nice to say from which number you went to which
08:30karolherbst: mupuf: with that patch I am at 42.7% dual issued in pixmark_piano
08:30karolherbst: ohh right
08:31mupuf: because otherwise, it is kind of hard to judge :)
08:31karolherbst: this shows that the dual issue stuff seems to work pretty good actually
08:31karolherbst: because if it wouldn't you would notice
08:32mupuf: great work nonetheless!
08:33karolherbst: the pass is a bit expensive though :/ I already increase it by a lot by just checking the next one
08:33karolherbst: so maybe being so smart there doesn'T really pay off
08:34mupuf: maybe you can call it when you notice the dual-issue rate is low?
08:34mupuf: does it improve heaven for instance?
08:36karolherbst: mupuf: mesa can only know when it is emiting the binary, and if we check every instruction already, then we can improve the situation at the same time
08:36karolherbst: so checking how good dual issueing is will already take like 80% of the cpu time this thing would
08:37karolherbst: mupuf: not really, heaven is pretty good witout it already
08:37karolherbst: mupuf: you can dual issue pretty much usually, I never saw any instruction move more than 3 places away in pixmark_piano, and its shader has like 3800 instructions
08:41mupuf: ah ah, it is a beast :p
08:42RSpliet: karolherbst: here's a heuristic that might be interesting: try and shove tex instructions upwards to the last slot in the (previous) scheduling block
08:42RSpliet: as it can't be dual issued anyway :-P
08:42karolherbst: RSpliet: doesn't matter
08:43karolherbst: two texs might be problemativ in a block though
08:44mupuf: well, this week, I accidentally partly-solved the biggest TODO on my list... ensure that the variance of each run is within the confidence_margin we set, and then using the student-t technique to find if two data-sets are the same or not (eg. was there a perf change or not).
08:44mupuf: Why do I say "partly"? Because it assumes a gaussian distribution
08:44mupuf: which I know is WRONG whenever we get CPU limited or too close to the TDP
08:45mupuf: should not be too much of an issue on nouveau
08:46mupuf: but on intel hw, with such a low TDP and fast reclocking, it is a pain to get stable results :s
08:47mupuf:was pretty pleased with the variance for the tests on nouveau :D
08:47mupuf: well, I guess it is good that I test and design based on the worst case!
08:47mupuf:has a small NUC to do most of his tests without blocking his main machine
09:04karolherbst: imirkin_: I think 16 is too much
09:04karolherbst: ohh wait, there can be muls shared :/
09:06imirkin_: 16 = a2 = a*a, a4 = a2*a2, a8 = a4*a4, a16 = a8 * a8
09:06imirkin_: so... 4 mul's
09:06imirkin_: vs 3 sfu ops (preex2, ex2, lg2) + 1 mul
09:06imirkin_: seems like a good trade
09:06karolherbst: yeah I wasn't thinking clearly
09:12karolherbst: I should prepare for tomorrow, but I am soo tired :/
09:12imirkin_: ah going to fosdem?
09:22karolherbst: now I get 1035 points, but I didn't change a thing :/ (except I wrote the pow to MUL pass)
09:27karolherbst: imirkin_: well I had this before: https://github.com/karolherbst/mesa/commit/8d09cd8a6fe76ea0785466ac1a793b809c4e7ee6
09:28karolherbst: I didn't want to write that :/
09:28karolherbst: just and old patch
09:30imirkin_: you need to ping me less often... please try to only do so when you have a question that you've given some thought to solving already.
09:30karolherbst: yeah I know, but this was by accident now :/
10:45karolherbst: okay, I don't think I find anything anymore in pixmark_piano, so next most obvious thing to do would be zcull for nvc0
11:05karolherbst: RSpliet: I think I got a pretty naive rescheduling algorithm: just look iterate over all instructions in a block, when i removes more live values than i->prev, swap them
11:06imirkin_: you want to keep track of the # of live values
11:06imirkin_: if you always minimize the # of live values, you end up using very few registers and having no ability for execution parallelism
11:07karolherbst: I thought using fewer registers means more threads
11:07karolherbst: but I was thinking of just trying that out and see how that goes
11:08imirkin_: there's a trade-off
11:08karolherbst: there is a max amount of threads I suppose
11:09imirkin_: look at the getThroughput stuff
11:09imirkin_: that should give you an idea
11:09imirkin_: basically there's an ideal minimum delay between instructions
11:09imirkin_: s.t. the next instruction of that type would be ready to execute
11:09imirkin_: without having to wait for a unit to become available
11:09glennk: no more of those BFE etc things the front end emits a bunch of ops for?
11:10imirkin_: i made an opt for shift + and -> bfe/bfi
11:10imirkin_: as well as detecting byte extractions and using convert
11:11karolherbst: did you messure the benefit already?
11:11imirkin_: no, but it was a ton fewer ops in some situations
11:11imirkin_: i pushed it a long time ago
11:12imirkin_: i doubt it affects too many shaders
11:12karolherbst: ohhh okay
11:15karolherbst: I think I will try out to figure out that zculling first. It sounds like it could give a huge benefit and I don't think anybody worked on that really the last two years?
11:15imirkin_: i've never looked at it
11:16glennk: zcull = hierz?
11:16karolherbst: it is something nouveau doesn't do for nvc0, so yeah
11:16karolherbst: pushing fake zcull data should show us pretty fast what it does though
11:17glennk: http://developer.download.nvidia.com/GPU_Programming_Guide/GPU_Programming_Guide_G80.pdf page 43
11:17karolherbst: yeah I read that already
11:18imirkin_: glennk: is that the same thing as hierz?
11:18karolherbst: I think zculling is based on triangles or something bigger
11:18glennk: pretty much
11:18glennk: i think piano is a bit of a special case, its just a single quad and all the work is in the fragment shader
11:19karolherbst: I thinks this is more for lateZ or how that would be called in that case
11:19karolherbst: ohh that should be EarlyZ :D
11:20imirkin_: zcull will help somethign like heaven (maybe)
11:20glennk: probably, stuff using lots of shadow maps should notice the difference
11:21karolherbst: but I think pixmark_piano could be a good start to see what the blob does regarding zcull/earlyZ
11:21karolherbst: even if they is no benefit
11:21glennk: gears is probably better for that
11:22karolherbst: gears is bottlenecked by pcie on my machine
11:22glennk: not if you fullscreen it
11:22karolherbst: pretty sure it still is
11:22imirkin_: glennk: he's on optimus
11:22glennk: or well, on hybrid laptops it would be
11:26karolherbst: glennk: pixmark_piano has 65 calls per frame :)
11:27karolherbst: and two draw calls
11:28karolherbst: ohh the second draw call just draws tux
11:29glennk: karolherbst, yes, but its got a few thousand ops pixel shader, you won't be able to see if zcull is active or not
11:30glennk: ...and no overdraw or z buffer
11:30karolherbst: yeah, I know, I just want to know what the blob does in that case
11:30imirkin_: no z = no zcull
11:30imirkin_: basically for zcull you need to allocate an aux buffer
11:31imirkin_: and use it in conjunction with the depth texture
11:31glennk: similar to radeon's htile
11:32imirkin_: at least that's my guess
11:32imirkin_: the size of this buffer needs to be based on the size of the depth texture
11:32imirkin_: i guess each miplevel of the texture needs this buffer
11:32imirkin_: or maybe just forget being clever and stick it on the fb
11:33imirkin_: and when binding a new fb, just reset the zcull info and mvoe on
11:33glennk: not sure what happens when shaders read z
11:33karolherbst: well my first step is to somehow push zcull related stuff to the hw and see how I can mess stuff up
11:33karolherbst: glennk: the pdf tells you ;)
11:33imirkin_: glennk: well, on fermi it's probably a little diff than on tesla
11:34karolherbst: ohh right, yeah well
11:34glennk: older hardware is probably quirkier to get going
11:34karolherbst: I just want to see a change for now, that's all
11:34imirkin_: i guess the zcull buffer should be allocated when clearing a fb
11:35imirkin_: i dunno
11:35glennk: karolherbst, no, it doesn't for that case
11:35imirkin_: but then what happens if someone manually writes to the texture...
11:36glennk: does nouveau do fast clears on z?
11:36imirkin_: fast is a relative term
11:36imirkin_: i don't knwo if we have ZBC stuff set up properly
11:37imirkin_: but we use the "simple" clear stuff rather than drawing "by hand"
11:37imirkin_: but what that clear does, who knows
11:37glennk: easy enough to measure if you get clear rates >> memory bandwidth
12:27pmoreau: Oh god!! Of course clover wasn't going to find any device, I never thought than Nouveau wasn't loaded by default on Reator… --" (which makes sense, but…)
12:28pmoreau: hakzsam: Ok, got the same error inside the SchedDataCalculator…
12:29imirkin_: karolherbst: fyi this is the change i see from your shl+shr patch: http://hastebin.com/hamureqeru.coffee
12:29karolherbst: imirkin_: nice
12:29imirkin_: mind if i just replace your section with this one in the commit log?
12:30imirkin_: imho it's nice to run against a "standard" shader-db
12:30karolherbst: go ahead
12:30pmoreau: Still, weird: why don't I get that error when compiling the same program for NVC0 using nouveau-compiler?
12:31imirkin_: pmoreau: fermi doesn't run sched data
12:31imirkin_: pmoreau: only kepler+
12:32pmoreau: Hum, maybe the Kepler was picked up then
12:32imirkin_: karolherbst: you're sometimes nouveau@ and sometimes git@... you should probably pick one
12:33karolherbst: right, I guess my mesa is still at git@
12:33imirkin_: is there one you prefer? i can fix it up
12:33karolherbst: I changed all the stuff to that
12:34karolherbst: also I can't filter so easily if I get messages to git@ :/
12:34imirkin_: maybe some old commits still had the old one adn you cherry-picked
12:35imirkin_: you can fix that stuff up with git commit --author btw
12:35karolherbst: yeah maybe
12:35karolherbst: I will check that for the future stuff
12:35pmoreau: Right, there is an nve4 in the bt, so it picked the GK106. https://phabricator.pmoreau.org/P84
12:36karolherbst: okay, one part I figured out already ZCULL_HEIGHT and ZCULL_WIDTH are the depth buffer? size rouned up to 20
12:36imirkin_: 0x20 presumably?
12:36imirkin_: that's... highly surprising.
12:36karolherbst: a 1024x640 buffer created 1040x640
12:36pmoreau: Ah great, I have the crash with nouveau_compiler as well, needed to test Kepler rather than Fermi. :-)
12:37karolherbst: pmoreau: which one?
12:37imirkin_: i don't think it's to 20 though
12:37pmoreau: karolherbst: GK106
12:37imirkin_: maybe it's +16 though
12:37karolherbst: pmoreau: I meant patch
12:37pmoreau: Ah, my SPIR-V stuff
12:37karolherbst: imirkin_: but why not for the height?
12:38imirkin_: karolherbst: check the logic in nvc0_state_validate
12:38imirkin_: looks like it rounds up to 224
12:38imirkin_: er hm, that's not it either
12:38imirkin_: i dunno.
12:38imirkin_: try diff sizes :)
12:39karolherbst: there are two more interessting things though
12:39karolherbst: ZCULL_ADDRESS and ZCULL_LIMIT
12:39karolherbst: 0x1140000 and 0x1160000
12:40karolherbst: which I assume is the location of the depth buffer or something?
12:40imirkin_: nvc0_validate_zcull just does + 1 << 17
12:40imirkin_: you should look at it :)
12:41imirkin_: pick up where calim left off rather than try everything from scratch
12:41karolherbst: ohh right
12:41martm: tried to look from google did not even find anything , what is dEQP?
12:41karolherbst: imirkin_: but there was nothing at screen creation time
12:42imirkin_: .../nouveau/codegen/nv50_ir_peephole.cpp.save | 3932 ++++++++++++++++++++
12:43airlied: lols git fail
12:44karolherbst: sorry for that
12:45karolherbst: thing is, I have no clue when to call nvc0_validate_zcull :/ just at screen creation time where the other zcull stuff is? (the blob seems to push stuff there) or would there be another place reasonable for that
12:48imirkin_: karolherbst: it needs to be done based on the framebuffer i think
12:48karolherbst: ohh yeah, okay, I slowly get this stuff
12:48karolherbst: I need to add a flag down in that file
12:48karolherbst: and dirty |= it
12:48imirkin_: in nvc0_state.c
12:48imirkin_: you need to figure out where to store the zcull buffer
12:48imirkin_: maybe it should be stored with the fb
12:49imirkin_: or maybe... something else
12:49imirkin_: i dunno :)
12:49imirkin_: maybe the zcull buffer should only be allocated into the fb when a clear is done
12:49imirkin_: otherwise i have no idea how to initialize it
12:50karolherbst: k, first segfault :O
12:51karolherbst: ohh nv50_surface is NULL
12:55karolherbst: mhhh, I was hoping that stuff randomly disappears when I push some silly data in there :/
13:12imirkin_: karolherbst: i pushed a handful of your patches
13:12karolherbst: yeah I saw
13:13imirkin_: the ones that i felt were super-safe
13:13karolherbst: yeah I noticed :)
13:13imirkin_: the others need testing and thought
13:26imirkin_: interesting. i get these results by disabling the DCE that st/mesa does: http://hastebin.com/obunidowaw.coffee
13:31RSpliet: imirkin_: what? have you spotted examples of difference?
13:32imirkin_: RSpliet: the st/mesa stuff can cause various stupidity to appear
13:32imirkin_: although i'm surprised that its *dce* does that
13:33imirkin_: disabling the cp that st/mesa does maeks things way worse :(
13:33imirkin_: disabling CP i get: http://hastebin.com/ijelusimoq.coffee :(
13:34RSpliet: sure, but that might be compensated for by running an extra round of CP in nouveau?
13:34RSpliet: as for the DCE, I'm rather curious to understand what kind of stupidity it is that causes the difference :-P
13:34pmoreau: hakzsam: Think I found the error: the pass initialises an array A with the nb of BBs present, and use the BB's id to access the array.
13:35imirkin_: RSpliet: yeah it's odd
13:35pmoreau: hakzsam: There are two BBs, however they have for id 0 and 2 resp. So the id 2 is wrecking havoc…
13:36pmoreau: Argh, because BB of id 1 is the exit BB, which doesn't get any instruction… :-/
13:37pmoreau: So it most likely gets removed, but the BBs aren't renumbered?
13:41imirkin_: well at least it's no longer hilariously bad now that i've fixed indirect arrays
13:45karolherbst: imirkin_: mhhh, changing gpr count has a rather big perf impact
13:46karolherbst: I just increased the count by 10 and got a perf impact of 6ms frame time
13:50karolherbst: 49 -> 59 max gpr: 1035 -> 984 points
13:52RSpliet: karolherbst: sure, instead of having 6 warps in flight concurrently, you now only have 5... that's bound to have some impact on your efficient mem bw
13:54karolherbst: ahh okay
13:55Jayhost: karolherbst I have dmsg and kmsg of gm107 lockout after pstate change. Did you want to see or should I poke around.
13:55karolherbst: I have no idea how that stuff works on maxwell
13:55RSpliet: karolherbst: on kepler, 48 should be the tipping point for 7 warps... think you can shave two registers off that usecase? :-P
13:56pmoreau: \o/ hello_world works on Kepler now! But not on Fermi O.O
13:57karolherbst: RSpliet: well I just cut out two and see how that goes
13:57karolherbst: meh it cuts something important off
13:58RSpliet: eh yes, you'd want to do that with proper optimisations instead
13:58karolherbst: but the blob uses 56 regs in total :/
13:59RSpliet: karolherbst: yeah that's fine, still gives them 6 warps :-P
13:59RSpliet: there's 65536 GPRs per SMX, divide that by 192 and you'll have the GPR per SIMD lane
14:00RSpliet: (a very awkward 341,33 it seems, wtf?)
14:00RSpliet: anyway, floor(341,33 / shader GPR use) gives you the number of concurrent warps
14:01imirkin_: so i'm guessing that disabling DCE partly disables some of the later passes, which in turn is what allows nouveau to do better
14:01Jayhost: If anyone can point me in the right direction. Fifo Engine fault. Maxwell. Dolphin-emu. after pstate frequency change.
14:01karolherbst: the active_warps counter tells me 8.3G
14:02karolherbst: and 24k warps launched
14:02karolherbst: and I have 5 SMX by the way
14:02RSpliet: Jayhost: don't do pstate frequency changes on Maxwell, it's supposed to be unsupported
14:03RSpliet: karolherbst: no idea what the active warps counter is; certainly not on a single SMX
14:03Jayhost: RSpliet, can I work fixing it? Starting point?
14:05RSpliet: Jayhost: reverse engineer the clock tree? sorry mate, that's not an easy task - interpret and fix nvkm/subdev/clk/*.c if you're up for it
14:05RSpliet: I can't help you with it
14:06Jayhost: Cool. Thanks RSpliet and karolherbst
14:15RSpliet: skeggsb_: is there a particular reason why you hooked up GM107 to gk104_clk_new ? or is that an artefact from copy-pastry during the last big rewrite? :-)
14:40imirkin_: gr. i found a shader where the *renumbering* that st/mesa does greatly helps codegen ... somehow.