02:58 airlied: imirkin: nve7 in mbp
02:58 airlied: imirkin: if it happens again I'll probably try and restart gnoem-shell instead of reboot
04:06 mupuf: imirkin: less instruction is of course better but GPUs are designed to mask the memory latency, so if you are memory bound, any change in the number of instructions won't change much
04:06 mupuf: as in nothing at all
04:06 mupuf: unless at some points the memory usage drops under 100% :p
04:17 RSpliet: mupuf: but if less instructions lead to a lower register use, we can schedule more warps a a given point in time
04:17 RSpliet: hence having more threads to hide the latency with
04:17 mupuf: definitely
04:18 mupuf: that's the very interesting thing about nvidia's architecture :)
04:18 RSpliet: I take it from your statement that Intel doesn't have that flexibility? :-P
04:18 mupuf: nor AMD :p
04:18 mupuf: AFAIR
04:18 RSpliet: nor ARM I guess
04:18 mupuf: there is an equivalent concept
04:19 mupuf: it is the SIMD8/16/32
04:19 mupuf: you basically need to compile your program multiple times to with 8, 16 or 32 threads
04:19 mupuf: but you will always share the same number of regs
04:20 mupuf: so, if you don't have enough regs, you may not be able to use the SIMD32 or 16
04:20 mupuf: there is no global register file
07:25 karolherbst: imirkin: did you pushed the patches you told me about yesterday?
08:05 karolherbst: mhhh in furmark I get "Mesa: User error: GL_INVALID_OPERATION in glUniform1("Layer"@7 is float, not int)"
08:31 karolherbst: imirkin: I think I might found a possible optimization in this shader: code
08:31 karolherbst: https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3
08:31 karolherbst: ...
08:31 karolherbst: https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3
08:31 karolherbst: ohh it is the tgsi
08:31 karolherbst: this line: https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3#file-gistfile1-txt-L26
08:33 karolherbst: shouldn't the compile be able to eliminate those 0 somehow?
08:33 karolherbst: also this: https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3#file-gistfile1-txt-L65
08:34 karolherbst: I thought &r63 is always 0
08:34 karolherbst: isn't a add ftz f32 $r12 $r12 $r63 pointless then somehow?
08:42 imirkin: karolherbst: my changes should def not have affected that glUniform thing
08:42 imirkin: mmmm... add y, x, 0 should get optimized into mov y, x
08:43 karolherbst: imirkin: in that case it is add x, x, 0
08:43 imirkin: karolherbst: unfortunately you cut out the most interesting part
08:43 imirkin: which is all the lines that come before what you pasted, starting with the tgsi
08:43 karolherbst: thats a lot, but okay :D
08:44 imirkin: the nouveau optimizations don't try to achieve a fixed point, so it's conceivable that something happens after ConstantFolding which makes such an opt possible
08:44 karolherbst: https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3 done
08:44 imirkin: although... unlikely
08:45 karolherbst: imirkin: I just the post ra thing is a lot easier to find optimization possibilities by eye, that's all
08:45 karolherbst: especially in a ca. 4k instruction binary
08:45 imirkin: 61: add ftz f32 %r7722 %r7721 0.000000 (0)
08:45 imirkin: that's not easier to find?
08:46 karolherbst: well
08:46 karolherbst: $r63 or 0.0000000, does it matter? :D
08:46 imirkin: anyhooo.... no clue why that'd happen
08:46 imirkin: normally we should have taken care of that
08:46 karolherbst: yeah, I thought so too
08:46 karolherbst: this seems way too trivial
08:46 imirkin: i wonder if it's a left-over of some other opt that doesn't get cleaned up
09:01 karolherbst: imirkin: any idea how I can investigate this? in which pass does this has to be optimized away?
09:03 Tom^: imirkin: btw did you notice that i had same minecraft layer on 11.0.6
09:04 imirkin: Tom^: nope... if you could make a modestly sized apitrace (preferably with a low resolution), that'd be super
09:04 imirkin: karolherbst: ConstantFolding::opnd, look for case OP_ADD
09:04 Tom^: roger
09:05 imirkin: Tom^: basically my GPU has like 1/100th the power of yours, and it's convenient if i can play through the trace quickly ;)
09:05 Tom^: you have to learn to compensate opensource slowness with money.
09:05 Tom^: =D
09:06 karolherbst: :D
09:06 imirkin: burn
09:07 imirkin: i take a different approach - if i'm spending time developing drivers, for free, i'm not going to pay $$ for hardware for that privilege
09:07 Tom^: elementaldemo takes parameters for resolution?
09:07 imirkin: er wait
09:08 imirkin: if it's just elemental i can run it myself
09:08 imirkin: i had meant for minecraft
09:08 imirkin: but like i said, on my GF108, elemental is fine. so it's either a kepler+ issue or a kepler2 issue. i have a kepler2 at work, so i can check it out on monday.
09:09 Tom^: ok so no apitrace then?
09:09 imirkin: nah
09:09 karolherbst: imirkin: if (i->usesFlags()) break; could that trigger that?
09:09 imirkin: karolherbst: sure, but it doesn't use flags
09:09 karolherbst: k
09:09 imirkin: karolherbst: that's for like when you want to addc
09:09 imirkin: (i.e. carry)
09:10 karolherbst: ahh
09:10 karolherbst: k
09:11 imirkin: karolherbst: probably the leftovers of 42: MAD TEMP[5].xyz, TEMP[2].yzxx, IMM[4].xxyy, -TEMP[5].xyzz
09:11 imirkin: IMM[4].x = 0
09:11 imirkin: so that becomes ADD 0, -TEMP[5].xyzz
09:12 karolherbst: ahhhh
09:12 karolherbst: so this pass has to be run again later on?
09:12 karolherbst: or just bad ordering
09:12 imirkin: and someone forgets to fix it up
09:12 imirkin: well, the pass should keep running
09:12 imirkin: i thought it'd fix it up, but maybe not? or there's something subtle?
09:12 imirkin: or i'm just plain wrong
09:12 imirkin: ;)
09:13 karolherbst: I could just run the pass at the end again
09:13 imirkin: anyways, i'll let you play with it
09:13 karolherbst: where is the invocations of all those passes?
09:14 karolherbst: ohhh found it
09:14 karolherbst: yeah lol
09:14 karolherbst: running the pass again cuts one instruction
09:14 karolherbst: yeha
09:15 imirkin: ok, so that means that the pass isn't working the way i thought it did
09:15 imirkin: or something funky's going on
09:15 imirkin: the solution is NOT to run the same pass 100x
09:15 karolherbst: this one is gone now
09:15 karolherbst: :D
09:15 karolherbst: or inside a loop until nothing changes :p
09:15 imirkin: right. so that's called running to a fixed point
09:15 imirkin: which is actually pretty common
09:15 karolherbst: thing is
09:15 imirkin: however these passes have been carefully structured so that a single run through gets you 99.9% of the benefit
09:16 karolherbst: now I get a segfault :(
09:16 karolherbst: at ../../../../../src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp:818
09:16 imirkin: although i'm debating whether i should create an algebraic pass and run that + constant folding to a fixed point... dunno
09:18 karolherbst: insn wasn NULL in src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp:818 :/
09:18 karolherbst: *was
09:20 karolherbst: imirkin: okay, it seems to be messed up inside ConstantFolding
09:20 karolherbst: running it just twice witht anything else in between also eliminates this instruction
09:23 karolherbst: imirkin: ever messured the effect of all passes together?
09:23 imirkin: ?
09:23 karolherbst: I just disabled all, just to compare performance
09:23 imirkin: you can use NV50_PROG_OPTIMIZE=0/1/2 to disable some of them
09:23 karolherbst: in furmark: 7700 => 3900 instructions and 30 => 50 fps
09:23 imirkin: uhhh... hopefully that's disabled -> enabled
09:24 imirkin: not vice-versa :)
09:24 karolherbst: yes
09:24 karolherbst: :D
09:24 karolherbst: no it is fine, it is good to know that the effect is _that_ big though
09:24 imirkin: NV50_PROG_OPTIMIZE=1 should give you a lot of the simpler opt
09:24 imirkin: which gets you a ton of the benefit
09:25 imirkin: basically the compiler is written with those passes in mind... so the earlier stages generate pretty crap code
09:25 imirkin: because they know the later opts will clean it up
09:26 karolherbst: :D
09:29 karolherbst: imirkin: that's also strange: https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3#file-gistfile1-txt-L8373-L8374
09:29 karolherbst: r11= -r10; r12 = r11 * r11
09:31 karolherbst: the negation doesn't make much sense, does it?
09:31 imirkin: yeah, but i don't think we handle that case too well
09:31 karolherbst: later there is r11 = r11 * r12
09:31 imirkin: let em check
09:31 karolherbst: but the latter one can be r11 = -r10 * 12
09:31 karolherbst: and the neg just removed
09:31 imirkin: hmmmmmmm
09:32 imirkin: ModifierFolding should propagate that one
09:32 karolherbst: *r11 = -r10 * r12
09:32 imirkin: unless it doesn't for some reason
09:32 imirkin: OP_MUL can take neg on either arg
09:32 karolherbst: I get the feeling this shader is a gold mine :D
09:32 imirkin: and so ModifierFolding should have propagated that into r12 = -r10 * -r10
09:32 imirkin: every shader is a goldmine
09:32 karolherbst: yeah but the small ones are a bit boring
09:33 imirkin: there's a lot of stuff that codegen handles well
09:33 imirkin: and there's a lot more that it handles properly in theory, but there's some dumb bug preventing it
09:33 imirkin: and then there's a WHOLE lot more that it just doesn't handle at all
09:33 imirkin: the things you're pointing out are things it should handle but apparently doesn't for one stupid reason or another
09:34 karolherbst: imirkin: maybe if ConstantFolding::visit changed something, just rerun on the "current" instruction, which might have changed
09:34 imirkin: i thought it did
09:34 imirkin: does it not?
09:34 karolherbst: nope
09:34 karolherbst: next = i->next; is the first thing in the loop
09:34 karolherbst: and if() else if() chains
09:34 imirkin: ConstantFolding::foldAll(Program *prog)
09:34 imirkin: ah i see
09:34 imirkin: so it runs itself up to 2x
09:35 imirkin: or make the OP_MAD handling a bit cleverer
09:35 imirkin: shouldn't be too hard
09:35 imirkin: it's a super-special case
09:35 imirkin: /* Move the immediate to the second arg, otherwise the ADD operation
09:35 imirkin: * won't be emittable
09:35 imirkin: */
09:36 imirkin: just add a check if the immediate is 0 and make it into a MOV instead
09:36 imirkin: er wtf... i re-run expr() on it
09:36 imirkin: on, but only if they're BOTH immediates
09:36 karolherbst: in the else branch?
09:36 imirkin: yeah, should re-run opnd()
09:36 imirkin: like this
09:37 karolherbst: first I try to write a small tgsi to cath the case :D
09:38 imirkin: http://hastebin.com/gajabemava.coffee
09:38 imirkin: karolherbst: try that?
09:39 karolherbst: yep
09:39 karolherbst: that works
09:39 imirkin: =]
09:39 karolherbst: would be interessting what shader-db says about that
09:45 karolherbst: imirkin: okay now we could take care of this one: https://gist.github.com/karolherbst/b7a3adb9a42bf688afa3#file-gistfile1-txt-L8372-L8376 :D
09:46 karolherbst: imirkin: is there any difference in mul $r1 $r2 $r2 or mul $r1 neg $r2 neg $r2?
09:49 karolherbst: this makes total sense though: https://gist.github.com/karolherbst/1293948dd982b332424f
10:05 ob-sed: hi
10:05 ob-sed: i was curious about nvidia 700 series firmware
10:05 karolherbst: 700?
10:05 ob-sed: is it something the driver has to upload to the card on every boot.. or is there a persistance mechanism for the firmware to live on in the card ?
10:05 karolherbst: don't you mean 900?
10:05 ob-sed: yeah like GTX 760, 770, 780 etc
10:06 karolherbst: mhh you mean for video acceleration then?
10:06 ob-sed: 900 is the latest series yeah, but i mean 700 series
10:06 karolherbst: mhh you don't need any firmware except for video acceleration and the normal kernel mechanism is used for that
10:07 karolherbst: or what do you mean by "firmware"?
10:10 ob-sed: karolherbst: ah well i mean like on an nvidia gpu
10:10 karolherbst: you mean the vbios then?
10:10 ob-sed: yeah the vbios
10:11 karolherbst: it is already on the gpu
10:11 ob-sed: and also does it have firmware thats uploaded at each boot (like the cpu microcode) or does it stay in flash on the card
10:11 ob-sed: ahh yeah true, the vbios is in flash on the gpu
10:11 karolherbst: it just stays on the card
10:11 ob-sed: yeah
10:11 karolherbst: and nouvea reads it out from the gpu
10:11 ob-sed: on the newer 900 series is nvidia requiring that the vbios code be digitally signed as well ?
10:11 karolherbst: *nouveau
10:12 karolherbst: ob-sed: well with the gen2 maxwell cards the driver needs to upload a signed vbios to the card
10:12 karolherbst: which isn't needed before
10:12 ob-sed: ohhhh i see
10:12 Tom^: cant you get it form the blob?
10:12 ob-sed: ok so there is only _one_ set of firmware to be run on the card right, the vbios ?
10:12 karolherbst: they are signed though I think, but the signature doesn't have to be valid
10:12 karolherbst: Tom^: well tell us where to look :p
10:12 Tom^: :p
10:12 ob-sed: lol it has to be signed, but the signature doesnt have to be valid lol ?
10:13 ob-sed: what do you mean by that:) ?
10:13 karolherbst: ob-sed: you can upload a vbios to the gpu which isn't signed
10:13 karolherbst: and it would work
10:13 ob-sed: ahh interesting
10:13 karolherbst: but that has other reasons
10:13 ob-sed: but now we cannot do that with the 900 series ?
10:13 karolherbst: because the vbios is there mainly for the driver on pre 900 gpus
10:13 ob-sed: ah right
10:14 karolherbst: the vbios does more on the 900 cards, I am o expert with this though
10:27 karolherbst: imirkin: running all passes again: 3901 => 3839 instructions
10:30 karolherbst: imirkin: yeah, and the neg thing also disappears by that
10:39 karolherbst: imirkin: running AlgebraicOpt after the last DeadCodeElim seems to cut some further instructions 3901 => 3861
10:41 karolherbst: seems to bascially eliminate some adds away
10:41 karolherbst: mul ftz f32 $r18 $r18 0.200000 + add ftz f32 $r18 $r12 neg $r18 => mad ftz f32 $r18 neg $r18 0.200000 $r12
11:41 RSpliet: karolherbst: how much does execution time increase?
11:42 karolherbst: mhhh
11:42 karolherbst: no idea, should I render one frame with glxgears?
11:42 karolherbst: ohh wait, some application was able to do that
11:43 karolherbst: ...
11:45 RSpliet: I'd just add some time() calls before/after optimisations
11:45 RSpliet: so you can compare easily
11:45 RSpliet: I assume it doesn't hurt
11:46 karolherbst: well
11:46 karolherbst: RSpliet: there is a simplier approach
11:47 karolherbst: run ModifierFolding after ConstantFolding (again?) and AlgebraicOpt after the last DeadCodeElim (again?)
11:48 karolherbst: yeah, both have to be run twice :/
11:48 RSpliet: sure, the more you run it, the more you fold
11:49 RSpliet: the question is how much longer the runtime is going to be ;-)
11:49 karolherbst: RSpliet: this is enough to cut instructions from 3902 to 3839 https://gist.github.com/karolherbst/1d3fb263893127dcd51a
11:49 RSpliet: I *know* that
11:50 karolherbst: :p k
11:50 karolherbst: maybe
11:50 karolherbst: it makes more sense to find which optimisations produce non optimal code
11:50 RSpliet: but there could be quite a lot of code hidden behind the RUN_PASS calls, esp. since it is likely to traverse through your entire code tree
11:50 karolherbst: yes
11:50 karolherbst: it does
11:50 RSpliet: you'll never get "optimal code", you can get "decent code"
11:51 RSpliet: or "quite all right code" :-)
11:52 RSpliet: don't get me wrong, it's a useful exercise, but make sure you understand all the parameters
11:52 RSpliet: it'd probably help further doing DeadCodeElim after AlgebraicOpt (to clean up)
11:53 karolherbst: yeah, I don't think that running the pass over and over again is a good idea in general
11:53 karolherbst: yeah was thinking that
11:53 john_cephalopoda: Hi
11:53 karolherbst: but well
11:53 karolherbst: in my case no dead code was produced
11:53 karolherbst: just add+mul => mad optimistaions
11:53 RSpliet: depends, if it takes a second it isn't, if it takes less than 16ms it might be worth it :-P
11:53 karolherbst: that's why I think that somewhere just not that good code is produced
11:54 karolherbst: I will check what the second ModifierFolding run changes
11:55 karolherbst: it isn't that much though
11:55 karolherbst: ahh right
11:55 karolherbst: it eliminates those negs
11:56 karolherbst: ohhh
11:56 karolherbst: RSpliet: ModifierFolding moves negs/abs and stuff into the instructions?
11:56 RSpliet: I assume it does
11:57 karolherbst: mhhh
11:57 karolherbst: okay, then I think it is easy to figure out which optimisation we could improve
12:05 karolherbst: RSpliet: maybe you see it? https://gist.github.com/karolherbst/225de4b290d8176d0837
12:05 karolherbst: okay, the last 4 are the same :/
12:09 karolherbst: okay, I think I have it
12:11 karolherbst: seems like ConstantFolding produces some NEGs
12:11 karolherbst: and these are optimized away by ModifierFolding
12:14 karolherbst: so there seems to be optimizations like mul $r1 -1 => neg $r1, but this neg could be just folded into the source of other instructions
12:18 RSpliet: what the ... oh wow, okay, that was unexpected
12:19 RSpliet: finding values in the VBIOS that in trace looked an awful lot like training values :-P
12:24 RSpliet: esp since it looks like a script, not a table
12:51 imirkin: ooh nice find
12:52 RSpliet: yes, I still hate it though
12:52 RSpliet: it means I now have to write a parser for scripts, to turn it into a sequence of memx commands
12:54 imirkin: ;)
12:54 RSpliet: preferably without relying on the gf100_ramfuc struct defining all the registers memx might write to
12:54 RSpliet: because who knows we need this for gk104
12:54 RSpliet: or your gt21-something
12:54 imirkin: karolherbst: yeah, so there are a few unfortunate situations where constantfolding generates opportunities for some of the other passes
12:55 imirkin: not really sure what to do about those without sticking a few passes and running them to a fixed point
12:55 imirkin: wouldn't be the worst thing in the world, tbh
12:56 imirkin: i.e. take algebraic + modifier + constant and run them until they make no progress
12:56 imirkin: it doesn't happen too often though
13:25 karolherbst: imirkin: mhh sounds a bit painful, or should we say progress is instruction count went done?
13:26 karolherbst: later we could do something weighted or just cycles needed for each instructions in total
13:26 karolherbst: fun like that
13:27 karolherbst: imirkin: thing is, dead code elimination also allows some algebraic optimizations
13:27 karolherbst: this is really weird
13:27 m3n3chm0:nasZZ
13:27 imirkin: ugh
13:27 imirkin: coz in a few places we were lazy and look at refcounts?
13:27 karolherbst: I guess so
13:27 karolherbst: will look at the diff
13:28 karolherbst: algebraic => dead code: 3879. dead code => algebraic: 3839
13:29 imirkin: i believe it... i'm just annoyed
13:29 karolherbst: yeah, it is pretty much actually :/
13:29 karolherbst: thats what bothers me most
13:29 karolherbst: if it would be like 1 or 2 like the case I found first
13:29 karolherbst: then yeah, well bad luck
13:29 karolherbst: but this is a bit more serious
13:30 karolherbst: more than 1 %
13:32 karolherbst: imirkin: fun fact: the dead code pass doesn't actually remove the instruction count
13:32 karolherbst: *reduce
13:32 karolherbst: allthough it moves code pieces from branch to branch
13:33 karolherbst: ohh wait
13:34 karolherbst: imirkin: what is the first nv50 dump by the way? and what is the second?
13:34 karolherbst: pre/post RA?
13:34 karolherbst: or pre/post this reg thingy
13:34 imirkin: assuming you have NV50_PROG_DEBUG=1, then yes
13:34 imirkin: pre and post RA
13:34 karolherbst: k
13:35 karolherbst: pre RA: 5191/4247/4247 instructions post RA: 3879/3879/3839 instructions. where no last dead code/last dead code/algebric after dead code
13:35 karolherbst: this is fun to figure out
13:36 karolherbst: mhhh
13:36 karolherbst: okay nice
13:36 karolherbst: imirkin: it seems the algebraic pass only merges mul/adds into mads
13:36 imirkin: it does a few other things depending on the shader
13:36 karolherbst: yeah I know, but I meant in my case
13:37 karolherbst: don't see any other changes so far
13:37 karolherbst: and they are always i: mul i+1: add
13:37 karolherbst: maybe dead code just removes something and between and then the algebraic pass can merge them?
13:39 karolherbst: ohh wait
13:39 karolherbst: that doesn't look right :O
13:40 karolherbst: imirkin: is this right? https://gist.github.com/karolherbst/6d47f53cf62e053c75f5
13:41 karolherbst: ohhh
13:41 karolherbst: it is
13:41 karolherbst: I am stupid
13:41 karolherbst: I thought the first one is an add :D
13:41 imirkin: seems fine
13:42 karolherbst: yeah, I didn't read carefully enough
13:42 karolherbst: imirkin: https://gist.github.com/karolherbst/dff00388e2cf317864d9
13:43 karolherbst: seems like the algebraic thing can't handle those negs?
13:43 karolherbst: ohh wait
13:43 karolherbst: right, the $r15791 is dead code
13:45 karolherbst: okay, it seems like I need to look earlier where this neg comes from, because r15790 is always negated
13:52 karolherbst: imirkin: ohhhh
13:52 karolherbst: found it
13:52 karolherbst: imirkin: https://gist.github.com/karolherbst/dff00388e2cf317864d9
13:53 karolherbst: 192: add ftz f32 %r15792 %r15783 %r15791 (0) => 192: add ftz f32 %r15792 %r15783 neg %r15790 (0)
13:53 karolherbst: and the neg instruction has no use anymore
13:53 karolherbst: so it is just left there
13:53 imirkin: which is fine... DCE should take care of it
13:53 karolherbst: yeah right
13:54 karolherbst: but the mul/add can be merged
13:54 karolherbst: so instead of just merging the neg into the add
13:54 karolherbst: we should merge the source of neg also in it
13:54 karolherbst: if it's a mul
13:54 karolherbst: and do a mad
13:56 karolherbst: mul $r2 $r1; neg $3 $2; add $r5 $r4 $r3 => mad $r5 neg $r2 $r1 $r4
13:57 karolherbst: imirkin: how does that sound?
13:57 imirkin: that sounds fine... isn't that what's happening?
13:58 karolherbst: mhh no
13:58 karolherbst: it doesn't do the add+mul => mad merge
13:58 karolherbst: I just add another algebraic pass after the dead code thing, so that it is happening
13:59 karolherbst: what bothers me is, that the neg even if it's dead code, prevents the add+mul merge too
14:02 karolherbst: imirkin: which function does the neg+add => add neg thing? handleNeg or handleAdd?
14:02 karolherbst: ohh there is no handleNeg :D
14:02 imirkin: ModifierFolding
14:02 imirkin: somewhere in there
14:02 imirkin: i don't look at that pass too often
14:02 karolherbst: the comments are out of date too?
14:03 karolherbst: the ones which tells you what a pass does
14:04 karolherbst: ohh right
14:06 karolherbst: so I add the support for ADD(a, NEG(MUL(b, c)) -> MAD(neg b, c, a) inside AlgebraicOpt
14:07 imirkin: what for?
14:07 imirkin: it should already be handled via modifier folding + algebraic opt
14:07 imirkin: oh i see
14:07 imirkin: coz ... grr
14:07 imirkin: hmmmmmmmm should modifier folding go *before* algebraic opt? needs some thought
14:08 karolherbst: imirkin: currently there is a neg in between add/mul
14:08 karolherbst: so you have something like add/neg/mul
14:08 imirkin: right, i get it
14:08 karolherbst: and neg uses the add result
14:08 karolherbst: ;)
14:08 karolherbst: k
14:09 imirkin: what if you stick modifierfolding before algebraicopt?
14:10 karolherbst: mhhh
14:10 karolherbst: still 3901 instructions
14:10 karolherbst: but the neg is gone actually
14:10 karolherbst: ohh wait, it always gone
14:14 karolherbst: imirkin: that is, what algebraicopt gets: https://gist.github.com/karolherbst/d968c33b69696fd50c74
14:18 mupuf: RSpliet: nice finding!
14:21 RSpliet: tnx
14:23 imirkin: karolherbst: ah right... coz of the "src0->refCount() == 1" heuristic
14:23 imirkin: in AlgebraicOpt::tryADDToMADOrSAD
14:24 imirkin: all of the refcount checks are pretty much hacks
14:24 imirkin: thing is you're trying to make global decisions with only local info
14:24 karolherbst: why not just do a mad out of that?
14:24 imirkin: it's not an exact science
14:24 imirkin: for example
14:25 imirkin: mul foo, bar, 0.5; add baz, foo, 0.5
14:25 imirkin: is better than
14:25 imirkin: mov a, 0.5; mov b 0.6; mad baz foo a b
14:25 imirkin: esp on nv50 you (basically) can't have an immediate in a mad
14:25 imirkin: on nvc0 it makes more sense
14:26 karolherbst: yeah right, but is mul foo, bar, 0.5; add baz, foo, 0.5 better than mul foo, bar, 0.5; mad bar, 0.5, 0.5?
14:26 karolherbst: or is it the "same"?
14:27 imirkin: mad can only have 1 immed arg
14:27 imirkin: only the second one
14:27 karolherbst: ohhhh
14:28 karolherbst: okay then "mul foo, bar, baa; add baz, foo, bab" and "mul foo, bar, baa; mad bar, baa, bab"
14:32 karolherbst: imirkin: okay so the mad optimization might be unlucky when there are two immediates in the end and the mul is still used somewhere else
14:34 karolherbst: ... I am the only one with connection problems, right?
14:34 RSpliet: *white noise*
14:36 karolherbst: imirkin: I will think of something
14:37 karolherbst: maybe if the neg and mul have refcount == 1 we can do the mad out of it
14:40 imirkin: hence the current checks in there ;)
14:40 imirkin: except if modifierfolding passes the neg through
14:40 imirkin: then that value has 2 uses
14:40 imirkin: and i'd rather not sprinkle modifier folding all over the place for the sheer fun of it
15:29 imirkin: RSpliet: fyi i have a GF108 vbios here which also uses only the 8f opcode
15:30 RSpliet: imirkin: I was looking for your GDDR5 GT21x card, but I don't seem to have it's VBIOS
15:31 RSpliet: it wouldn't have the script in the same place probably, but not surprised if it's there... somewhere
15:33 RSpliet: oh there we are, yes confirmed for GT21x GDDR5 as well :-)
15:33 imirkin: RSpliet: http://people.freedesktop.org/~imirkin/traces/nva3/
15:34 imirkin: trace also there if you're interested
15:35 imirkin: looks like it's gone on nvdX
15:35 imirkin: but it's there on all nvcX's
15:35 imirkin: and the nva3 too apparently
15:35 RSpliet: yeah, it seemed like a "oh my god we need to stick this extra information somewhere for NVA3 GDDR5" hack :-P
15:36 imirkin: hehe
15:41 RSpliet: imirkin: well, gone on NVDx, or are all those cards just DDR3?
15:41 RSpliet: the script seems to be quite GDDR5 specific
15:41 imirkin: could be both :)
16:57 imirkin: skeggsb: ping
17:09 skeggsb: imirkin: hey
17:10 imirkin: skeggsb: any thoughts on how to deal with nv50cal space errors? how should throttling be done?
17:10 imirkin: skeggsb: my thought is that it'd be ok for the kernel to block...
17:11 skeggsb: well, i find it very surprising that with a lot of small IBs, free slots don't clear fast enough..
17:11 skeggsb: are you *certain* the GPU hasn't hung?
17:12 imirkin: well, it can clearly happen.... i wouldn't be surprised if the submit just had too many IBs on its own
17:12 imirkin: a single submit might take up multiple IB slots right?
17:12 imirkin: with nouveau_pushbuf_data()
17:12 imirkin: no :)
17:13 imirkin: but there's no indication that it has
17:13 imirkin: nothing in dmesg, etc
17:13 skeggsb: that doesn't necessarily mean anything :P but, it'd be useful (and easy) to confirm before going that route
17:13 skeggsb: remove the timeout from nv50_dma_push_wait(), see if it hangs forever
17:14 imirkin: wait, there's a timeout? hm
17:14 imirkin:should have read the code
17:15 skeggsb: oh, and from READ_GET().. it's also possible the accounting goes horribly wrong somehow and gets things confused
17:15 imirkin: ret = nouveau_dma_wait(chan, req->nr_push + 1, 16);
17:15 imirkin: so... what if nr_push is just really big :)
17:15 imirkin: how many slots are there in the first place?
17:15 skeggsb: yes, that would be bad
17:16 skeggsb: 1024, iirc
17:16 skeggsb: that's changeable too
17:16 imirkin: yeah, so with every indexed draw on nv50, we use up an extra slot
17:16 imirkin: so that's 2 slots per draw
17:17 skeggsb: userspace should probably flush before it gets to that point
17:17 imirkin: (regular + data)
17:17 imirkin: yeah... i was just thinking about that
17:17 imirkin: if nrpush > ... 100? kick?
17:17 imirkin: or 64?
17:17 skeggsb: whatever random number you choose really
17:17 skeggsb: as long as it's less :P
17:18 imirkin: than 1024 :)
17:18 imirkin: the IB limit is per channel right? not global?
17:18 skeggsb: the kernel will consume 1 slot to insert a fence
17:18 skeggsb: yeah, it's per-channel
17:18 imirkin: i also want to start keeping track of how much vram is used in a "current" submit
17:18 skeggsb: we currently hardcode 8KiB of space for the GPFIFO
17:19 imirkin: i'm running into issues of submit fail on my nva3 with 512mb vram
17:19 skeggsb: (that's the IB slots)
17:19 imirkin: which i'm semi-moderately convinced is due to it trying to bunch up too many draws together
17:19 imirkin: but i've done no analysis on that
17:19 imirkin: but again, it should be moderately easy to loop through the current pushbuf buf list + bufctx and count up the vram
17:19 imirkin: and if it's... x% of total vram, submit
17:19 imirkin: right?
17:20 skeggsb: libdrm should already do that
17:20 imirkin: er what?
17:20 imirkin: when?
17:20 imirkin: i do remember there's that nouveau_available_vram thing, but i assumed that was in conjunction with soemthing else
17:21 skeggsb: when you tell libdrm to "validate" a group of buffers for a submission, it'll check if any new buffers can fit with the ones previously queued, and flush if not
17:22 imirkin: ah clever.
17:22 imirkin: and if that submit should happen to fail, the validate fails too?
17:22 skeggsb: i believe so, i haven't looked at that stuff in a *long* time.. calim and i spent ages trying to figure out how to make it all work, and that's what we came up with
17:23 imirkin: fair enough
17:23 imirkin: i do get ttm_validate failures though
17:23 skeggsb: the kernel could handle stuff better if the initial attempt to fit stuff in fails (ie. as a last resort, kick everything out and try again)
17:24 skeggsb: fragmentation and stuff is a problem too, where pinned scanout buffers randomly in the middle of memory can fuck things up too
17:24 imirkin: right
17:24 skeggsb: well, the latter shouldn't be a problem on g80 and up actually, since we can use non-contiguous allocations
17:25 imirkin: not too worried about pre-g80
17:25 skeggsb: me neither
17:26 skeggsb: but i temporarily forgot that >=g80 doesn't have that problem :P
17:26 imirkin: i do prefer to keep them working ok
17:26 imirkin: but i doubt many people use ancient nvidia gpu's outside of this channel
17:26 skeggsb: indeed
17:26 imirkin: [esp with nouveau]
17:27 skeggsb: not suggesting we break them :P
17:27 imirkin: only with a hammer
17:47 imirkin: skeggsb: hm, i owe you a review of stuff, don't i... will try to get that done today
17:49 skeggsb: yep, and i owe you fifo fixes :P they've gotten a tad more involved than i'd have liked, but i should have them pushed soon
17:49 imirkin: skeggsb: and that will make parallel piglit reliable?
17:50 imirkin: skeggsb: including but not limited to nv50?
17:50 skeggsb: i still have to find a board that has issues still, even with the patches in my tree already the two boards i were using worked fine
17:50 skeggsb: one fermi, one kepler
17:50 skeggsb: i'll look over tesla too
17:50 imirkin: hmmmm iirc i tried those patches and my gk208 still fell over
17:50 imirkin: potentially due to unrelated reasons
17:50 skeggsb: yeah, i recall you saying
17:51 skeggsb: i'll get airlied to retrieve mine from the ppc machine and see if i can reproduce tomorrow
17:52 airlied: you might be waiting for me to come in :)
17:57 imirkin: skeggsb: perhaps try on a shittier gpu?
17:58 skeggsb: i'll plug in a variety and see what happens
17:58 imirkin: a GK107 should match the shittiness of my GK208
17:59 skeggsb: i <3 my gk107, it lets me play sc2 :P
18:00 imirkin: hehe
21:00 Arbition: did someone say gk208?