03:47karolherbst: mhhh "optimize neg(add(bool, 1)) to bool" shouldn't effect max gpr count at all, but rather reducing it, sadly I have some shaders with hurt gpr count from that :/ maybe RA messes up a bit?
05:52karolherbst: is there another instruction wich has either 0 or -1 as the result besides set and slct?
05:55RSpliet: neg(abs(clamp(*expr*)), but that sounds a bit farfetched :-P
05:56karolherbst: RSpliet: well I have 720 shaders with neg(add(set, 1))
05:57karolherbst: RSpliet: and this instruction cut overall: 1922121 -> 1911112 (-0.57%)
05:57karolherbst: but I guess neg(abs(clamp())) might be really rare
06:01karolherbst: RSpliet: clamp seems to be handled a bit ugly inside the ir?
06:03karolherbst: ohh, now I know what you meant
06:05karolherbst: RSpliet: well I fount no neg(abs(sat()))
06:38karolherbst: imirkin: DCE in reverse: 80574-> 52187 visit calls
06:39imirkin: karolherbst: yay :)
06:39imirkin: karolherbst: can you do it for the other one too?
06:39karolherbst: this->run calls
06:39karolherbst: yeah, the post_ra isn't as important here though
06:39karolherbst: only 10% less
06:39imirkin: huh. surprising.
06:39imirkin: oh, probably because there's a lot less code to dead-eliminate
06:42karolherbst: but DCE doesn't have a big impact overall sadly :/
06:42karolherbst: hopefully the eon games benefit from that somewhat
06:42karolherbst: compiling shaders each frame... insane
06:43karolherbst: imirkin: what is the negativ impact of "WARNING: out of code space, evicting all shaders" ?
06:43karolherbst: all shaders removed and used ones uploaded again?
06:45imirkin: yeah, just reuploaded, not recompiled
06:46karolherbst: so if that only happens at application start there is no harm I suppose
06:49imirkin: yeah, it's *very* rare that it happens during gameplay
06:49imirkin: but it can happen of course
06:58karolherbst: imirkin: k, send the worthy patches to the ML, but I guess it will take some time until they appear :/
06:58karolherbst: or not...
07:01karolherbst: until they appear these are the ones I sent: https://github.com/karolherbst/mesa/commits/to_upstream
07:01karolherbst: without the nouveua_compiler stuff commit
07:08karolherbst: imirkin: did you know that the shader-db thingy has a race condition somewhere?
07:08karolherbst: or the nouveau compiler :/
07:08imirkin: didn't use to...
07:08karolherbst: happens also without my changes
07:08karolherbst: imirkin: https://gist.github.com/karolherbst/db3056aa6995dac5ecad
07:09karolherbst: I cut at the top
07:09imirkin: figure out which shader is killing it
07:09karolherbst: that only happens sometimes
07:09karolherbst: that's my problem, so
07:10karolherbst: the files are in the mmap :D
07:10karolherbst: talos_principle/1918.shader_test and talos_principle/1945.shader_test loaded
07:11RSpliet: karolherbst: now that you added a PostRADeadCodeElim, could you tell me how beneficial the simple bit of code in lines 2950 onward still is? https://github.com/karolherbst/mesa/commit/441fec8203e6a9256fc2c60c87dd47a0360e4cc0
07:12RSpliet: eg: if you remove that bit of logic, does it impact the number of insn? and if not: could you remove it to simplify everything? :-P
07:12imirkin: karolherbst: do they crash on their own?
07:12imirkin: and/or make valgrind complain loudly
07:14karolherbst: imirkin: wait, this makes actually sense
07:14karolherbst: imirkin: "ERROR: Unexpected token in shaders/talos_principle/1918.shader_test"
07:15karolherbst: maybe something doesn't get cleaned up the right way or something
07:15karolherbst: RSpliet: k, will try out
07:16RSpliet: as far as I'm concerned, even if it increments DCE calls, it's still worth it for the sake of code readability and separation of tasks etc
07:16karolherbst: RSpliet: yeah I know
07:17karolherbst: RSpliet: this post ra thing is not that often called anyway
07:17RSpliet: sorry, being verbose :-)
07:18karolherbst: RSpliet: doesn't change a thing after remove your dead code stuff
07:19RSpliet: good, get rid of it (comments included) :-D
07:20karolherbst: RSpliet: https://github.com/karolherbst/mesa/commit/fbb26d2762c1e3b29de52b7f41bd99004edcd08b
07:21RSpliet: good stuff
07:21karolherbst: I bet isDead still needs some work, but it didn't seem anything broke by my stuff
07:21karolherbst: tested only saints row and bioshock though
07:27RSpliet: karolherbst: if you're concerned: in the code you just removed there was a condition "if (vtmp->getInsn()->bb)" - do you know what that condition is for, and whether it should be considered in the new PostRaDCE?
07:28imirkin: that is a VERY important condition
07:28imirkin: took me a lot of time to track that one down.
07:30RSpliet: I'd have to read up on the definition of a "split" to know exactly what it's for, but I take it that iterating over instructions following i->prev will run over those splits as well?
07:31imirkin: mmmmmm.... no
07:31imirkin: they should be out of the bb
07:31imirkin: RA removes them
07:31RSpliet: ah, so for the just-added DCE that condition is superfluous :-)
07:31imirkin: split is to take a 2-, 3-, or 4-wide register
07:32imirkin: and return a bunch of 1-wide registers
07:32imirkin: i.e. the opposite of a merge
07:33RSpliet: ah, yes. That makes sense - to be gone post-RA :-)
07:34imirkin: but there can still be references to them
07:34imirkin: even though they've been removed from the BB
07:34imirkin: HOWEVER trying to delete them will attempt to remove them from the bb again and crash
07:34imirkin: i looked into the situation and fixing it seemed much harder than throwing that cond in
07:35RSpliet: karolherbst could always make your future life easier by adding in an assert() in the postRaDCE :-D
07:45karolherbst: so what should I change in the postRADCE?
07:46karolherbst: imirkin: well couldn't we just add a bb check in delete_Instruction then?
07:46karolherbst: mhh unigine heaven seems to run faster with my patches
07:47RSpliet: karolherbst: if you iterate over i->priv post-RA you may not encouter them. If you do, RA messed up. Could be worth an assertion in the PostRaDCE
07:47RSpliet: how much faster?
07:47karolherbst: not much
07:47karolherbst: I could remember the old score wrongly though
07:47karolherbst: could be that the compilation time got shorter and the score goes up a bit, no idea
07:48karolherbst: I want to eat now so I can only check later
07:48RSpliet: I have no idea how instruction count would translate to actual perf changes, but likely the effect is very limited given most properly written games are mem-bound
07:51RSpliet: but then there could be interesting positive effects related to icaches and TLBs when reducing the program size
07:51RSpliet: it's in the margin, but I'm beginning to wonder nouveau's perf is behind on the official driver (after all the PM work) not because of some big missing features, but rather for getting all the margins wrong :-P
07:52glennk: thats probably only relevant for shadertoy.com shaders
07:53glennk: isn't it a good chunk of cpu overhead though?
07:54RSpliet: nouveau perf limits? hard to tell, it could seem that way if it busy-waits for the GPU a lot to finish its tasks
07:54imirkin: RSpliet: there are a handful of gpu features we're not using
07:55glennk: no i mean driver overhead in api calls vs the blob
07:55karolherbst: RSpliet: pixmark_piano has a _big_ oerf difference
07:55karolherbst: and there is nearly no memory operation involved at all
07:55imirkin: glennk: i suspect a lot of it is also better buffer management
07:55glennk: at least at some point in the past it went far enough to JIT bits of the GL api
07:56imirkin: glennk: that was for immediate vertex submission though..
07:56imirkin: which really was a huge bottleneck
07:56glennk: right, that was always fun to debug
07:57karolherbst: pixmark_piano tgsi: https://gist.github.com/karolherbst/06a88b33b17e19e13f9f
07:58imirkin: karolherbst: do you have some of the tgsi opts disabled? like register renumbering?
07:58karolherbst: nouveau: 17 fps, nvidia: 23 fps
07:58glennk: not terribly interesting to look at the tgsi
07:58karolherbst: ohh the differente isn't as huge as I though
07:58karolherbst: imirkin: don't think so
08:00glennk: imirkin, guess some diff can be down to the shader voting thing
08:00imirkin: could be
08:04karolherbst: maybe some zculling stuff is missing and therefore the performance is such different for pixmark_piano :/
08:13karolherbst: imirkin: seems like I don't get any difference in unigine heaven and I just remembered my old score wrongly or something else changed perf
08:13imirkin: well, mupuf claims that my change to generate IMAD's is bad
08:14imirkin: [for perf, not correctness]
08:15karolherbst: imirkin: funny though: I get more perf in pixmark_piano fro one of the change I didn't send out :/
08:16karolherbst: could be those running optimisations multiple times or something
08:16karolherbst: imirkin: imads?
08:16imirkin: integer mad
08:16karolherbst: ohh okay
08:16imirkin: i pushed it foolishly thinking that fewer ops = better
08:17Tom^: as fast as i get my ordered gpu cooler, il report how much mesa-git has regressed :p
08:17mupuf: imirkin: well, it was a reasonable assumption :D
08:18imirkin: Tom^: well this was a while ago
08:18mupuf: imirkin: I checked this morning that I did not reverse the fps generation and which would have meant that perf improved
08:18karolherbst: mupuf: wanna test this branch? https://github.com/karolherbst/mesa/commits/to_upstream
08:18mupuf: but no, it is not the case, the score really regressed
08:19imirkin: that would have been pretty funny
08:19imirkin: karolherbst: anyways, this is the commit mupuf fingered: f97f755192210ce3690e67abccefa133d398d373
08:19karolherbst: imirkin: I will run this on my shader-db
08:20mupuf: imirkin: it would have :D But I wanted to make sure because it indeed bothers me
08:21mupuf: karolherbst: hmm, I can try your branch but I doubt we will see any change
08:21mupuf: but hey, I would love to be wrong, so let's test!
08:21karolherbst: though perf regression might be possible, because a good change doesn't mean that the generated code gets better
08:22karolherbst: imirkin: after I reverted your change: total instructions in shared programs : 1895185 -> 1895842 (0.03%)
08:22karolherbst: total gprs used in shared programs : 251739 -> 251737 (-0.00%)
08:22karolherbst: so mhh
08:23karolherbst: imirkin: but can you think of any reasons why "neg(add(bool, 1)) to bool " should hurt a shader?
08:23karolherbst: I get some hurt ones, but only because some stuff gets reordered
08:23karolherbst: and RA allocates more regs in total
08:23karolherbst: though the optimization itself shouldn't hurt the max gpr count at all
08:24karolherbst: mupuf: nice thanks :)
08:25imirkin: pick a hurt shader (preferably small) and try to see what went wrong
08:31glennk: my guess: longer register live interval
08:34mupuf: possible, since this shader (computing fractals) should be quite big
08:35mupuf: it is possible that the shader started spilling
08:37karolherbst: mupuf: it was in another one
08:37karolherbst: imirkin: well I convert the neg into a mov
08:37karolherbst: imirkin: maybe I could simply delete both instructions there?
08:37imirkin: the mov should get propagated with LoadPropagation
08:37imirkin: if it can
08:38imirkin: glennk: yeah could be
08:38karolherbst: imirkin: can I change the destination of a instruction?
08:38karolherbst: like I move the destination from the neg into the set and delete the neg and add?
08:38imirkin: but you probably don't want to
08:38karolherbst: *neg and and
08:38imirkin: you can def do that
08:38imirkin: don't manually delete instructions
08:39imirkin: they'll get cleaned up
08:39karolherbst: okay, so I just move the destination
08:39karolherbst: and the neg and and become dead
08:40imirkin: well.... careful
08:40imirkin: yeah you don't want to do that
08:40imirkin: you want to transform the neg into the set
08:40imirkin: and leave the other set alone
08:40imirkin: otherwise something might have been using the result of the first set
08:41imirkin: oh but the point is that the two results are equivalent?
08:41imirkin: but things could still have been using the other def
08:41imirkin: it's a sticky situation
08:41imirkin: best off to just convert the neg into the new set
08:41imirkin: and let CSE take care of it
08:46wvuu: is tesselation done on nouveau?
08:46imirkin: wvuu: for fermi and kepler, yes. for maxwell there are a few outstanding bits.
08:47wvuu: how to find out which one I have?
08:47imirkin: lspci -nn -d 10de:
08:48imirkin: GFxxx = fermi, GKxxx = kepler, GMxxx = maxwell
08:48imirkin: GPxxx = you have a pre-release pascal board, but you'd probably know if that was the case :)
08:50imirkin: kepler. should work.
08:52wvuu: ohh.. another question, I use to get working vdpau with RADEON, but with nvidia I am awfully confused which one uses: VAAPI or VDPAU?
08:53imirkin: vdpau. recently there's also been va-api support added, but not in any released mesa versions.
08:53imirkin: wvuu: have a look at http://nouveau.freedesktop.org/wiki/VideoAcceleration/
08:53imirkin: i should probably update that with the VA-API info
08:55wvuu: so should I use VAAPI or VDPAU?
08:55orbea: vaapi is mostly intel I think
08:55wvuu: also does 'vdpauinfo' work for checking the status of vaapi?
08:55orbea: vdpau for nvidia
08:56wvuu: orbea: right, but vaapi has change to being general open source accel, from what I remember
08:56imirkin: vdpau is better.
08:56imirkin: use vdpau.
08:58wvuu: ouch, firmware...
08:58wvuu: oh ok in repo.
09:02karolherbst: imirkin: ohh good idea
09:02loonycyborg: isn't vaapi implemented in terms of vdpau anyway?
09:03karolherbst: Tom^: there was a 1.2GB divinity update :O
09:03loonycyborg: at least for nvidia cards..
09:03imirkin: loonycyborg: there are va-api <-> vdpau adapters yes
09:04imirkin: no clue how they deal with the MPEG4 thing
09:04imirkin: probably just do the hacky thing? or just plain don't work correctly? dunno.
09:04Tom^: karolherbst: :o
09:04imirkin: apparently there's some issue in the VA-API MPEG4 stuff which prevents it from working on at least AMD, but i suspect nvidia as well.
09:06Tom^: karolherbst: "ArtPack, Design Documents, Combat pdf, Soundtrack and DigitalMap" so collectors edition for those who didnt have it before.
09:07mupuf: karolherbst: btw, 0 change in perf
09:07mupuf: for xonotic, pixmark julia fp32 and gputest:plot3d
09:09wvuu: absolutely awsome, it works
09:10imirkin: wvuu: if you're looking for improved perf, you might want to update to kernel 4.4 and boot with nouveau.pstate=1 -- that will enable a mechanism that will allow you to manually switch between perf levels
09:10imirkin: but get ready for hangs, since for some people it just doesn't work.
09:10imirkin: and for others it doesn't hang but also doesn't change clocks :)
09:11wvuu: imirkin: plz stop!! even MOAR prefromance??!!
09:12mooch: i wish nvidia would just release pascal already
09:12mooch: ALONG WITH A NEW PASCAL TITAN WHOOP WHOOP
09:12mooch: 32 gb titan y or something
09:13imirkin: then we'd be behind by 2 whole generations
09:13mooch: it could happen
09:13mooch: btw, how does nouveau perform on a titan z
09:13mooch: since i know it doesn't work on the titan x
09:13wvuu: speaking of performance, I read that graphics cards did NOT have floating point operation, and the registers 32bit.
09:14mupuf: wvuu: you obviously misread
09:14mupuf: they did not have integer operations for years :D
09:14mupuf: just float
09:14wvuu: are floating point cards out yet? Fro prosumers?
09:14imirkin: wvuu: the registers are, in fact, 32-bit.
09:15imirkin: mupuf: integer ops have been around since tesla, so... 2006.
09:15wvuu: right, that intenger, and slowly recently where adding those. But only for high end.
09:15wvuu: But for the 50-400 cards?
09:15imirkin: wvuu: various cards have various performance characteristics, but starting with fermi, they all have identical ISA's (within a generation), and identical shader-accessible functionality
09:15wvuu: the mid-range?
09:16imirkin: (fermi was the first DX11 gpu)
09:16mooch: i have a fermi
09:16mooch: it's not as good as my gcn card for dolphin
09:16wvuu: but at the 40-200 price range?
09:16mooch: then again, radeons are apparently really good at integer stuff
09:17imirkin: mooch: in case you're comparing fermi perf on nouveau to amd perf, it's not a fair comparison - nouveau doesn't reclock.
09:17mooch: no, i was comparing windows perf
09:17imirkin: ah ok
09:17wvuu: how to find out the clocking info of my card? is there a nouveau cli perf monitoring package?
09:17mooch: both my cards are entry-level too
09:18mooch: though the gcn card is more recent
09:18imirkin: wvuu: all this stuff is in the early days. boot with nouveau.pstate=1
09:18imirkin: wvuu: and you should have a file, /sys/class/drm/card0/device/pstate
09:18wvuu: I am not yet in 4.4
09:18imirkin: cat it, will show you the perf levels (current will be AC:)
09:18imirkin: you can echo a level id into it and it will attempt to switch
09:19wvuu: are there cards with 64bit registers?
09:19imirkin: i don't think so
09:19imirkin: fermi+ use pairs of regs to do fp64 math
09:20imirkin: i believe that's the situation on all fp64-supporting GPUs
09:20imirkin: some earlier AMD DX11 chips don't have hw fp64 support, but all nvidia DX11 chips have it.
09:21mooch: aren't nvidia gpus vliw or something?
09:22imirkin:tries to remember what vliw means
09:22wvuu: what's the difference between 32bit and 64bit? I mean what will improve? What will take advantage of 64bit registers?
09:22imirkin: i don't think so -- 64-bit instructions
09:22imirkin: wvuu: nothing.
09:23imirkin: wvuu: you could do 64-bit math a bit more effectively if you had a dedicated 64-bit ALU, but... meh nobody does that on GPUs
09:23karolherbst: mupuf: k, good to know though
09:23karolherbst: mupuf: could you also check my nouveau_opts branch?
09:23karolherbst: I run some optimizations in a loop there and saw better perf in pixmark_piano
09:23linkmauve1: wvuu, if anything, the current move is towards 16-bit floats.
09:24karolherbst: but there seems to be other drawbacks to that, so I don't want to send them out yet
09:27mupuf: karolherbst: sure, it is compiling now
09:28mupuf: but again, I doubt I will see any improvements
09:29mupuf: the FPS reading are so low that a 0.2% change in instruction count is not going to affect anything
09:29karolherbst: mupuf: I had a significant change in pixmark_piano
09:29mupuf: how many FPS do you get?
09:29karolherbst: a drop my 2ms rendering time (62ms->60ms)
09:29mupuf: because I have ... 0.1 fps with piano :D
09:30mupuf: no kidding
09:30karolherbst: I have like 17 with my kepler
09:30karolherbst: mupuf: fullscreen?
09:30mupuf: it renders 3 frames and then it is killed
09:30mupuf: nope, 720p windowed
09:30karolherbst: I did 1024x640
09:31karolherbst: 1080p I get like 6 fps
09:31karolherbst: this benchmark is a real beast
09:31mupuf: after reclocking?
09:31karolherbst: fully reclocking
09:31karolherbst: mupuf: by the way: memory clock doesn't matter here
09:31mupuf: I have not reclocked the nvd9
09:31mupuf: it is at boot clocks
09:32karolherbst: mupuf: does ezbench messure fps or frame rendering time?
09:32karolherbst: because the latter might give a higher accuracy
09:32karolherbst: especially with those benchmarks
09:32mupuf: for gputest, it does: # frame rendered / execution time
09:33mupuf: and yes, it can be improved by asking to dump the frame time (or using env_dump's frametime dumper), but it would use more cpu
09:33karolherbst: I see
09:33mupuf: or disk, in the case of the triangle test which has 1k FPS
09:34mupuf: the test can be improved, if you want
09:34karolherbst: imirkin: what do I have to copy over when I change the neg to a set?
09:34mupuf: ezbench takes as an input whatever the test returns
09:34mupuf: ms and fps are the common ones
09:34imirkin: karolherbst: oh wait... eternal sadness...
09:34imirkin: karolherbst: ok
09:35imirkin: better solution.
09:35mupuf: but you can measure GB/s if you have fun with the memory bw
09:35karolherbst: yay, I just got 6 bottles of beer for free :)
09:35imirkin: let a = set->getDef(0), b = neg->def(0) (not getDef())
09:35karolherbst: those marketing guys, really
09:36mupuf: karolherbst: no change for xonotic
09:36imirkin: er hm
09:37mupuf: karolherbst: no change for julia_fp32
09:37karolherbst: imirkin: you know what, I just for instructions which have neg(and( as the source and just change the source ...
09:37imirkin: karolherbst: right ok -- b.replace(a, false);
09:37imirkin: that will replace all uses of neg's def with the def from the set
09:38karolherbst: neg->def(0).replace(set->getDef(0), false) ?
09:38mupuf: karolherbst: pareil pour plot3d
09:39mupuf: sorry, I meant "no change for plot3d"
09:39mupuf:was reading french
09:40imirkin: karolherbst: yes.
09:40imirkin: karolherbst: i think that should work
09:41karolherbst: imirkin: strange, it hurt 4 more shaders, but reduced instruction count for 28 more :/
09:41imirkin: karolherbst: hmmmmm.... odd.
09:41karolherbst: I will check what is going on here
09:42imirkin: the problem with making the neg into a set is that a set has to be a CmpInstruction
09:42karolherbst: yeah I figured
09:43karolherbst: imirkin: okay, with the change two moves got removed
09:43karolherbst: and RA messes up
09:43imirkin: "messes up"?
09:44karolherbst: imirkin: https://gist.github.com/karolherbst/0bfdaec0ff2f47d6de3e
09:45karolherbst: imirkin: well more gprs getting used
09:45karolherbst: and the only change is two removed movs
09:46karolherbst: imirkin: in the RA nothing really changes except two phi nodes
09:46imirkin: karolherbst: more mov's = fewer gpr's
09:46imirkin: mov's break live ranges
09:46karolherbst: imirkin: yeah but in that case it doesn't make sense
09:51karolherbst: ohh wait
09:51karolherbst: maybe this change only hurts shaders which got helped before
09:51karolherbst: but those hurts ones doesn't get better :/
10:15karolherbst: imirkin: mhh but with the replace the generated code looks better now
10:15karolherbst: no MOVs generated
10:15karolherbst: and in one shader a block of 9 instruction just got cut out :D
10:16karolherbst: imirkin: I guess with reordered instruction we should concentrate on reduce the max gprs count in use
10:17karolherbst: because that would be a rather simple thing to do as long as we don't know which order is better
10:17imirkin: with instruction reordering, register pressure is important to take into account, yes
10:17imirkin: thing is that register usage limits are non-linear
10:18imirkin: e.g. 16 may be better than 17, but 15 and 16 might be the same
10:18karolherbst: yeah, I kind of thought about this already
10:18pmoreau: This is so true
10:19imirkin: we could determine arbitrary limits and make nv-report compute things based on those
10:19imirkin: i just decided looking at the overall register count made more sense for now
10:19karolherbst: the good thing about this neg(and)) optimization is that it makes this possible: set(neg(and(set)))) => set
10:19pmoreau: From my experience with CUDA kernels, limits are around 2^x.
10:20imirkin: karolherbst: yep. because we know how to fold multiple SET's
10:20imirkin: when comparing to 0
10:20karolherbst: imirkin: slct is the other boolean operation, or is there any other?
10:20imirkin: the idea is to try to make relatively orthogonal optimizations, so that they can all work together in tandem
10:20imirkin: karolherbst: i don't think so
10:20karolherbst: k, my pass with slct doesn't change a thing anyway
10:20imirkin: karolherbst: i mean there's OP_SET_AND and so on
10:21karolherbst: imirkin: so and(set)) => set_and?
10:21imirkin: and(set(), set() -> set(), set_and()
10:21imirkin: only with a predicate though
10:27karolherbst: is there any optimization for mul(mul(a, a), c) ? I see that really often and was wondering that maybe the hardware can do something with that
10:27imirkin: only if c is a small power of 2
10:28karolherbst: not really :/
10:32karolherbst: imirkin: I think in the next version I will also that patch, because there is really no harm: https://github.com/karolherbst/mesa/commit/f7e952a6bb03fa9a0864b2a10316dd7a7061b134
10:33karolherbst: this cuts usually movs away cause they get immediated
10:34imirkin: karolherbst: no need to pass 's' in, also please name it imm2
10:35karolherbst: and I have to return, because I changed the OP
10:43Jayhost: looks like I got the mmiotrace to work
10:45pmoreau: What did you do?
10:46Jayhost: Debian 8 standard kernel 3.16 but recompiled with CONFIG_MMIOTRACE=y
10:47karolherbst: imirkin: I am thinking if it's worth to change mad(mul(a, a), 2, b) to add(mul_x2^1(a, a), b)
10:50imirkin: karolherbst: probably is... but should be done as part of a more generic expression rejigger pass
10:51karolherbst: yeah I mean it could be also 0.5, or did you mean something else?
10:52imirkin: i mean you should have a pass that takes algebraic equations with mul's, add's, etc
10:52imirkin: and is able to rejigger them
10:58mwk: clarity up, performance down.
10:58mwk: maybe if I sprinkle "static inline" everywhere...
10:59karolherbst: mwk: static should be enough though
10:59mwk: they already are static :(
10:59karolherbst: since gcc-4 inline isn't the same anymore...
11:00karolherbst: mwk: I really doubt that adding inline changes anything
11:00karolherbst: because even with inline, gcc decides itself anyway
11:05karolherbst: imirkin: the x2^c thing is done through the subOp right?
11:05imirkin: karolherbst: no, there's a property... ->factor or something
11:05karolherbst: ahh okay
11:05mwk: hmm, I'll change all these abort calls to assert
11:05mwk: NDEBUG may work wonders
11:14karolherbst: imirkin: can't I do i->setSrc(0, bld.mkOp1v(OP_MUL, src->dType, src->getSrc(0), src->getSrc(1)));?
11:14karolherbst: because the getSrc(0)->getInsn() becaomse linterp pass x2^1 f32 %r7553 a[0x70] (0) :/
11:15imirkin: clear out the postFactor :)
11:15karolherbst: I set it
11:15karolherbst: I want to set it
11:15imirkin: you can only set it on a mul
11:15karolherbst: I just want to become mul
11:16karolherbst: I thought if I set the first argument to OP_MUL the source becomes a mul not a linterp
11:16imirkin: linterp mul? that's not a mul
11:16imirkin: that's a linterp :)
11:17karolherbst: imirkin: so what would be your expected out of this:
11:17karolherbst: i->setSrc(0, bld.mkOp1v(OP_MUL, src->dType, src->getSrc(0), src->getSrc(1)));?
11:24karolherbst: ohhh the first argument is the destination :/
11:35jeremySal: imirkin: Is it a good idea to use GLFW to write my test case for piglit?
11:38imirkin: jeremySal: nope
11:38imirkin: jeremySal: use the piglit framework
11:39imirkin: it solves all of these issues
11:39imirkin: also for that particular one you might even be able to get away with just using shader_runner
11:39imirkin: whereby you write a short text file with the shaders and a bit of setup
11:39imirkin: you'll find tons of examples in piglit
11:46jeremySal: imirkin: Do you by any chance know which debian package contains the nvidia opengl headers?
11:47imirkin: why do you need any of that?
11:47jeremySal: Don't I need to install the opengl headers to compile piglit?
11:48jeremySal: I assumed nvidia provided the interface
11:48imirkin: you do need the GL headers
11:48imirkin: i suspect they're in some dev package somewhere
11:49imirkin: hm, mesa would have them, but that might conflict with nvidia
11:49imirkin: maybe there *is* an nvidia dev package with headers
11:49jeremySal: yeah, that's what I'm trying to understand
11:49imirkin: sorry, don't know too much about debian
11:49imirkin: normally the GL provider should also provide GL.h & co
11:49jeremySal: ok, I figured it was worth a question
12:00karolherbst: imirkin: mhhh instruction count up and frame time down...
12:01imirkin: for what?
12:01karolherbst: this mad,mul thing
12:01karolherbst: 979->986 points
12:01imirkin: which mad/mul thing?
12:01imirkin: the imad thing?
12:02karolherbst: mad(mul), 2, c) => add(mul_x2^1, c)
12:02imirkin: ah well that makes sense
12:02imirkin: mad is slower than mul
12:02imirkin: the postfactor comes for free in the mul
12:03imirkin: since it's basically just a shift on the mantissa
12:03karolherbst: imirkin: thing is, now there is mul/mul/add instead of mul/mad
12:03karolherbst: because the mul isn't dead
12:03imirkin: ah sad
12:03karolherbst: but I am not smart in that optimization
12:04karolherbst: maybe mad is also slower than mul+add.... which would be odd?
12:05imirkin: yeah dunno
12:08karolherbst: ohh wait, one mov got removed...
12:23karolherbst: imirkin: now I also added this check for 0.5 and the perf increased even further...
12:23karolherbst: but not as much
12:24imirkin: karolherbst: fwiw we try to detect postfactor situations already
12:24imirkin: you should figure out why it's not working and improve it
12:25karolherbst: imirkin: I think such a optimization should be done if we don't increase instruction count and just replace the mul/mad with mul/add
12:25karolherbst: this would be safe enough
12:35glennk: i should add those postfactor bits to r600 some day
12:35glennk: another one of those sorta-done local branches :-/
12:38imirkin: glennk: you should work with shaders that have more than 128 tgsi regs
12:38imirkin: that seems more important :p
12:39glennk: oh it handles that, its when you have local arrays that eat up all the register space its a problem
12:39glennk: and its 124 * 4 scalars rather than 128
12:40imirkin: do you determine if arrays have indirect indexing?
12:40imirkin: coz otherwise you can treat them as non-arrays
12:40imirkin: and also do you rebase the indirect arrays down to 0?
12:41glennk: sb handles that, but it never gets the chance as the driver bails before it runs
12:41glennk: fixing it is basically rewriting r600_shader.c to go tgsi -> sb
12:42glennk: and at that point i may as well be doing spir-v to sb
12:49jeremySal: imirkin: I notice there is a common naming scheme for many of the functions used in the piglit tests, like 'glViewportArrayvNV' 'glViewportArrayv'. Is there somewhere it's defined what "nNV" or "v" means?
12:49imirkin: v = vector
12:49imirkin: NV = NVIDIA
12:49imirkin: jeremySal: but - good news - you don't have to worry about that
12:50jeremySal: how so?
12:50imirkin: jeremySal: you need to figure out how gl_ViewportIndex is set from vertex shaders
12:50imirkin: which is purely within the compiler
12:50imirkin: the thing is that setting viewport/layer was supported since nv50, but only from geometry shaders
12:50imirkin: in gm20x, it's now possible to set it in vertex/fragment shaders
12:50imirkin: the question is whether it just uses the same output slot as on geometry
12:50imirkin: or if there's something more to it
12:51imirkin: and if there's anything in the cmdstream to enable it
12:51jeremySal: okay, but I didn't see any existing shader tests using gl_ViewportIndex
12:51imirkin: i.e. does it just write a[0x64] and a[0x68] from the vertex shder?
12:52imirkin: have a look at the ones in tests/spec/amd_vertex_shader_viewport_index/execution
12:52imirkin: forget the execution bit of it
12:52imirkin: you should be able to easily adjust this to the ARB_viewport_layer_array thing
12:53imirkin: similarly for the AMD_vertex_shader_layer ext, which will have stuff with gl_Layer
12:53jeremySal: ok thanks, I was searching for the wrong string
12:54imirkin: oh and there are also simpler tests in
12:54imirkin: where are they
12:54imirkin: i def remember writing them...
12:54imirkin: ah there ya go
12:54imirkin: these tests set things in the vs/gs and make sure they come out in the frag shader
12:55jeremySal: I see
12:55imirkin: should be easy to adapt the ones that test AMD_vertex_shader_layer functionality to the new ARB_whatever ext
12:55jeremySal: Ok, I'll give it a shot
12:56imirkin: those should just be able to be run through shader_runner and that's it
12:57jeremySal: Is there a way to view the visual output when piglit runs these shaders?
12:59imirkin: it should display :)
13:00jeremySal: Ok, I'm not sure how to run shader_runner. I just ran piglet and it output test results.
13:00imirkin: bin/shader_runner foo.shader_test
13:00glennk: ./bin/shader_runner path
13:00jeremySal: okay, thanks!
13:18jeremySal: So all shader_test files in the directory are run? There is not some file which lists all the tests to run?
13:21imirkin: no, you just give it one file to run
13:22jeremySal: For the shader_runner
13:22imirkin: there are clever mechanisms for bundling it all up and running all the tests
13:22imirkin: but don't worry about that
13:22imirkin: you're trying to run one test
13:22jeremySal: yeah, I'm trying to figure out a good place to put the files I'm working on
13:22imirkin: ./ :)
13:23jeremySal: Is there a reference for the shader_test language?
13:23imirkin: add to it if you need new things
13:23imirkin: defining a language/etc takes time and effort
13:23jeremySal: I don't need new things, I'm just trying to understand the commands in the examples
13:24imirkin: which nobody really feels like investing into it
13:24jeremySal: Like why does it clear twice?
13:24imirkin: which one?
13:24imirkin: full path
13:24jeremySal: clear color 0.0 0.0 0.0 0.0
13:24imirkin: (so i can just paste it in rather than go around looking for it
13:24imirkin: first one doesn't clear - just sets the clear color
13:25jeremySal: oh I see
13:25jeremySal: Why is the draw rect command "-1 -1 2 2"
13:26imirkin: that tells it to draw a quad starting at (-1,-1) and width=2, height=2
13:26imirkin: the defaults are set up s.t. this will end up covering the whole screen
13:26jeremySal: Is there a reason why it's not "0 0 1 1"
13:26imirkin: that'd only draw a quarter of the screen
13:26jeremySal: oh, the origin is at the center?
13:27imirkin: NDC which stands for normalized display coordinates(?) is all -1..1
13:27imirkin: on all 3 axes
13:27jeremySal: oh, thanks
13:27imirkin: but obviously you can set a projection matrix/etc to adjust all that
13:27imirkin: fragment shader sees everything in window coordinates though
13:38jeremySal: imirkin:Should it matter if I put the #require in the vertex shader vs the geometry shader?
13:38jeremySal: I mean, the #extension : enable
13:39imirkin: ah, well you won't have a GS at all right?
13:39imirkin: just vertex + frag
13:39jeremySal: Well, I'm starting from the example which has a geometry shader
13:39imirkin: do the example which has just vertex + frag
13:39imirkin: and replace AMD_vertex_shader_layer with ARB_i_forget_the_name
13:40imirkin: right :)
13:40imirkin: layer_viewport_array probably
13:40imirkin: since layer is a thing and viewport array is a thing
13:40jeremySal: tbh, I am confused about the semantics of all these similarly labeled things
13:41imirkin: any particular big of confusion, or just confused by the gathering darkness?
13:41jeremySal: haha, so the extension is called "ARB_shader_viewport_layer_array"
13:41imirkin: so it is
13:41imirkin: if they had asked me, i would have objected
13:42jeremySal: whereas the existing examples use the extension ARB_fragment_layer_viewport
13:42imirkin: that's a different ext
13:43jeremySal: as well as ARB_viewport_array
13:43imirkin: that allows you to read gl_ViewportIndex in fragment shader
13:43imirkin: ARB_viewport_array allows you to write gl_ViewportIndex in geometry shader
13:43imirkin: [and some other things]
13:43jeremySal: so is it that arb_shader_viewport_layer_array is amending TWO existing extensions?
13:43jeremySal: one with layers and one with viewports?
13:46jeremySal: Also the documentation for ARB_shader_viewport_layer_array refers to AMD_vertex_shader_layer and AMD_vertex_shader_viewport neither of which seem to be used in the existing texts
13:46jeremySal: I'm not sure if these are aliases or separate extensions
13:48imirkin: take a look at tests/spec/arb_fragment_layer_viewport/viewport-vs-write-simple.shader_test
13:48imirkin: hope that clears things up.
13:55karolherbst: imirkin: do you know where that postFactor stuff is?
13:55karolherbst: because I don't see it
13:59karolherbst: but that is only for MULs right?
13:59imirkin: i guess
13:59imirkin: but it should be able to start with a MAD
13:59imirkin: maybe it doesn't realize that?
14:00karolherbst: well it only handles MULs
14:01imirkin: karolherbst: i guess it might make sense to just do it for OP_MAD directly
14:02imirkin: if imm.isPow2() and othersrc->refCount() == 1
14:02imirkin: then move the postfactor up
14:02imirkin: and flip this op into an add
14:34karolherbst: imirkin: I think I have bad news :/ I looked a bit deeper into that
14:35karolherbst: imirkin: and just by replacing mad(mul)) with add(mul_x2^1(mul))) it is getting faster
14:35karolherbst: no other changes
14:35imirkin: ok, so ... mad really just sucks then?
14:36karolherbst: I don't know
14:36imirkin: can you try nuking the algebraic opt that creates mad's in the first place?
14:36karolherbst: maybe there is something else
14:36karolherbst: mhhh I can try
14:36imirkin: and also just don't create them at all in the first place in nv50_ir_from_tgsi
14:37imirkin: i.e. make OP_MAD create a mul + add
14:37imirkin: make OPCODE_MAD createa mul + add
14:37imirkin: leave FMA alone if it exists (i forget)
14:38biker_rat: My kepler pstate file disappeared under rc1.
14:38karolherbst: biker_rat: debugfs
14:38imirkin: and no more need for nouveau.pstate=1
14:39pmoreau: It seems weird that MUL + ADD would be faster than MAD. What would be the point of MAD then?
14:40imirkin: well, fma
14:40imirkin: i agree that it's odd though
14:40karolherbst: imirkin: is there a simple way to not to creade MADS at all?
14:40karolherbst: imirkin: or just rewrite the case in Converter::handleInstruction?
14:42imirkin: just emit mul + add in there yea
14:43karolherbst: mhh looks messy, I just translate them in algebraicopt
14:43imirkin: ok wtvr
14:43karolherbst: I already did most of the stuff for the mul_x2^1 opt
14:43karolherbst: just needs minor midifications and I am done
14:49karolherbst: imirkin: okay, performance is worse now
14:49imirkin: yay :)
14:49imirkin: the world is not completely insane
14:49karolherbst: yeah I think there is something else
14:52karolherbst: imirkin: maybe mul_x2^1 is just faster than mul?
14:52karolherbst: for whatever reasons
14:52imirkin: seems unlikely
14:52karolherbst: I meant, mul_x2^1+add faster than mad
14:53airlied: I remember i965 had some issue where producing MADs defeated some other optimisation
14:53karolherbst: imirkin: okay, here is the change: 82: mad ftz f32 $r7 $r7 2.000000 $r11 => 82: mul x2^1 ftz f32 $r0 $r0 $r9 + 83: add ftz f32 $r11 $r12 $r0 (8)
14:53karolherbst: imirkin: three times in a row
14:53airlied: and made things seem worse, I could be wrong though
14:53karolherbst: so 82 mad => 82 mul, 83 add
14:53karolherbst: 83 mad => 84 mul, 85 add
14:54imirkin: airlied: mmmmm could be. i don't think that's what's going on here though
14:54imirkin: airlied: how did your images thing go?
14:54karolherbst: imirkin: !!! I think I got it
14:55karolherbst: imirkin: https://gist.github.com/karolherbst/152ec763aa6b64783e46 look carefully
14:56imirkin: not sure that i'm seeing it
14:57imirkin: a whole extra mul is being folded in?
14:57karolherbst: I was more thinking about pipelining
14:57imirkin: i mean, it's basically doing mul + mad -> mul + add
14:57imirkin: which does seem like it'd be faster
14:57imirkin: but you said it was mul + mad -> mul + mul + add
14:57imirkin: which would not be faster
14:57karolherbst: the source mul is far away by the way
14:58karolherbst: imirkin: I will try to only optimize one of them
14:58karolherbst: maybe this will give us more information
14:59imirkin: yeah, could be a scheduling thing then
14:59imirkin: for now you should limit it to refcount==1
14:59karolherbst: the thing is
14:59karolherbst: the source has a bigger refcount
14:59karolherbst: it is still all there
14:59karolherbst: this change actually adds 3 instructions
15:05karolherbst: imirkin: okay seems like if I only change one of them I get the same perf increase/3, so each replacement seems equally good
15:06imirkin: moral of the story - don't stick mad's next to one another
15:06glennk: <airlied> I remember i965 had some issue where producing MADs defeated some other optimisation
15:06glennk: i had this on r600 when i tried using the *2
15:06karolherbst: imirkin: could be
15:06karolherbst: imirkin: but then replacing the middle one should give us a minor bigger perf increase than the other two right?
15:07karolherbst: but it didn't
15:07karolherbst: we should just don't put muls/mads after each other
15:07karolherbst: for better pipelining
15:07imirkin: don't put multiple mul's together, mad or not
15:07karolherbst: let see what my vidia dump of that shader says :D
15:08airlied: imirkin: I've written code that should do image stores, so far it does not :)
15:08glennk: most of the nv ops have shader-visible latencies for values to become available don't they?
15:08karolherbst: imirkin: yep
15:08karolherbst: it tries
15:08imirkin: airlied: hehehe. i know the feeling.
15:08airlied: imirkin: on r600 only the frag and compute shaders can do images
15:08glennk: radeon hides most of that in the hardware so you can issue dependent ops back to back
15:08airlied: and the images are bound to the same thing as color outputs
15:08imirkin: airlied: well the way i have it set up, that's what i'm doing on fermi as well
15:09glennk: airlied, MEM_RAT you mean?
15:09karolherbst: imirkin: sometimes it even does this: https://gist.github.com/karolherbst/eb059776753718ef5c3c
15:09imirkin: airlied: ah, no, separate binding points on fermi. just only 8 of them total.
15:09airlied: glennk: yeah the RATs
15:09airlied: imirkin: we have 12 binding points, but fglrx only exposes 8 ever
15:09imirkin: could access them in vertex shader as well but... not worth exposing it
15:09glennk: i think on cayman there's some alternate way to do scatter/gather reads?
15:09glennk: airlied, yeah the other 4 are specifically RAT only
15:10karolherbst: imirkin: can I do stuff like mov $r63 $r63?
15:10karolherbst: ohh well, would be optimized away anyway
15:10imirkin: sure, any mov to $r63 will have no effect though
15:10glennk: xor $r63 $r63
15:10imirkin: $r63 is the zero register
15:11karolherbst: I know
15:11karolherbst: I am just thinking how we could try that theory out
15:11glennk: kind of funny having the zero register be something not r0
15:11karolherbst: imirkin: k, same idea convert mads to mul/add when the next instruction is a mul/mad
15:16imirkin: we coudl be calculating the sched data wrong
15:16karolherbst: yeah could be
15:16imirkin: that seems like the most likely thing tbh
15:18mwk: imirkin: well, I found something with my testing
15:18glennk: also, any spilling at all = bad bad bad
15:19mwk: apparently pre-G200 GPUs properly support red/atom.inc/dec s32
15:19karolherbst: imirkin: actually I already found a difference: https://github.com/karolherbst/mesa/commit/c4394f28a2adaa736f9d57414fdc60dbfbf385ff
15:19karolherbst: but I doubt that changesanything at all
15:20mwk: if you use positive second arg, it works like u32; if you use negative, it keeps increasing/decreasing the argument in [s2, 0) range, wrapping at ends
15:20mwk: and if you attempt to use it on an argument outside that range, it jumps to s2
15:21imirkin: i dunno if i'll be looking at tesla compute shaders anytime soon...
15:21mwk: no idea why they removed it on G200
15:21mwk: doesn't seem to be an accidental feature
15:22imirkin: ran out of die area? who knows
15:23imirkin: doesn't seem like this would be so big though
15:24mwk: correction, range is [s2, 0]
15:34karolherbst: imirkin: because I am so smart I removed the sched data from the blob dump I have :/ meh
15:36imirkin: karolherbst: yes, that was genius
15:39karolherbst: playing around with the scheduling stuff is really dangerous by the way :/
15:49glennk: karolherbst, wrong -> hang?
15:50karolherbst: glennk: if you dual issue every instruction you get 10 times higher frame times
15:50glennk: better than a cold reset
15:53karolherbst: no, I find a cold reset better than this
15:53karolherbst: because this you don't notice
15:53karolherbst: imagine 1% of all instructions get wrongly dual issued
15:53karolherbst: you will never find out
15:54karolherbst: hakzsam: do you think there might be _some_ way to check if something gets wrongly dual issued through some "nvidia internal" counters or anything?
15:55hakzsam: karolherbst, not sure if this kind of information is presently exposed
15:56karolherbst: hakzsam: I am wondering, because the perf impact for disabling dual issuing isn't that big, maybe we just do it sometimes wrong
15:58hakzsam: karolherbst, mmh.. maybe have a look at inst_issued_X via the HUD
15:59hakzsam: these counters might help, but not sure if they expose all you want :)
15:59karolherbst: hakzsam: what is inst_issued1 and inst_issued2?
16:00karolherbst: got it
16:00hakzsam: karolherbst, number of single/dual instructions issued
16:02karolherbst: hakzsam: funny, they seem to help actually
16:02karolherbst: hakzsam: I dual issue everything, but inst_issued1 still shows stuff
16:02hakzsam: karolherbst, cool :)
16:03karolherbst: another thing: same instruction count
16:03karolherbst: but the perf is just horrible
16:03hakzsam: karolherbst, please report any issues you find with perf counters btw
16:04karolherbst: yeah, when I find some...
16:04hakzsam: I'm sure, you will
16:05karolherbst: okay okay...
16:05karolherbst: hakzsam: idea!
16:05karolherbst: we can get the total amount of isntructions per frame, right?
16:06karolherbst: and if the ration we dual issue doesn't fit with the ratio we get, then something might be fishy
16:06karolherbst: glennk: "gallium_hud: all queries are busy after 8 frames, can't add another query" :(
16:06karolherbst: hakzsam: ^
16:06karolherbst: I only did GALLIUM_HUD=inst_issued1,inst_issued2,inst_executed
16:07hakzsam: karolherbst, let me check
16:08hakzsam: karolherbst, on kepler?
16:08karolherbst: hakzsam: okay fun time: 1G inst executed, 660k issued1, 180k issued2, where are the others?
16:08karolherbst: hakzsam: yes
16:09karolherbst: hakzsam: when I disable dual issueing: 1G issued1
16:09karolherbst: hakzsam: 1000 - (660 + 180) = 160k wrongly dual issued?
16:10hakzsam: karolherbst, 660+(180*2) is better ;)
16:10karolherbst: mhh yeah, makes sense
16:11karolherbst: ohhhh okay
16:11karolherbst: another thing
16:11hakzsam: karolherbst, okay well, this HUD error happens sometimes when you there is a synchro problem.. don't pay much attention
16:11karolherbst: yes, that really helps by the way
16:11karolherbst: actually this really helps...
16:12karolherbst: hakzsam: dual issue everything: executed: 1G, issued2: 1G issued1: 660k
16:13karolherbst: so perfect dual issueing: executed == issued2 * 2 + issued1
16:13karolherbst: if the right side is higher => something wrong
16:14karolherbst: but mhh, I really would like to display all three
16:14karolherbst: but I can't :/
16:14karolherbst: the graphs don't update
16:16karolherbst: hakzsam: can I disable those k/M things in the hud?
16:16hakzsam: karolherbst, you could use apitrace if you want more precision because you can replay a trace and monitor perf counters via gl_amd_perfmance_monitor
16:16karolherbst: I really would like to have the real numbers
16:16hakzsam: the code has been merged few days ago IIRC
16:17karolherbst: which code?
16:17hakzsam: karolherbst, you can't
16:17hakzsam: this one
16:19hakzsam: karolherbst, http://www.x.org/wiki/Events/XDC2015/Program/pitoiset_perf_counters.pdf
16:19hakzsam: slide 44
16:19hakzsam: it's a short howto
16:19karolherbst: hakzsam: I just hacked the hud a bit
16:19karolherbst: now I get the full numbers
16:20karolherbst: and issued1+issued2*2 is a bit higher than the total count
16:21karolherbst: 1015k vs 1022k
16:23karolherbst: but this could be also due to inaccuracy while reading it out
16:26hakzsam: yeah, that's possible
16:28karolherbst: yay! I found something which doesn't fit nouveaus stuff
16:28karolherbst: the third add isn't dual issued here
16:28karolherbst: but nouveau would
16:29imirkin: they're wrong!
16:29imirkin: can't do dual-issue with a limm?
16:30karolherbst: maybe not
16:30karolherbst: another found
16:31karolherbst: update gist
16:31karolherbst: the mul before the add with limm isn't dual issued either
16:31karolherbst: ohh but
16:32karolherbst: okay theory
16:33karolherbst: imirkin: you can only dual issue with a next instruction having a limm when you have a (l)imm
16:33karolherbst: reload gist, I think I have enough samples
16:33imirkin: right, can't mix limm'ness
16:33karolherbst: ohh, but the first have also a imm
16:33karolherbst: okay, then either both limm
16:33karolherbst: or none
16:34imirkin: look at the second example
16:34karolherbst: the third one is bothering me
16:34imirkin: oh, it's not a limm
16:34karolherbst: ohh right, 0x0 is $r63
16:35imirkin: why aren't the last two instructions in the 2nd example dual-issued?
16:35karolherbst: the last is never
16:36karolherbst: and the add can't be with the mul
16:36karolherbst: add has a limm
16:36imirkin: lines 16 + 17 in the gist
16:36karolherbst: ohh right, instruction 16 dual issuing with 17
16:36imirkin: it isn't
16:36imirkin: 0x4 = dual-issue right?
16:37imirkin: those two instructions get 0x28 + 0x20
16:37karolherbst: 17 depends on 16
16:37imirkin: right ok
16:37imirkin: and we detect that
16:37karolherbst: I think
16:37karolherbst: actually I don'T know if that matters
16:38karolherbst: would make sense though
16:38imirkin: perhaps just no add + add if they're different limm-ness?
16:38imirkin: coz yeah, third example mixes it up
16:39karolherbst: nouveau currently doens't check if the result depends on the instruction before :/
16:39karolherbst: I am sure it doesn't matter though
16:40imirkin: wait what?
16:40imirkin: oh wtf!
16:40imirkin: it needs to check
16:41karolherbst: the blob actually cares about that :O
16:41karolherbst: I just checked lke 100 instructions and the result is never used in the next one
16:41karolherbst: this can have a different reason
16:42imirkin: ah no. i think nouveau accounts for it
16:42imirkin: that's the delay >= 0 bit
16:43karolherbst: ohh right
16:43karolherbst: k, then this is checked
16:43imirkin: at least i think
16:43karolherbst: should be fine
16:43karolherbst: otherwise we would have noticed
16:44imirkin: probably the reason why the other thing isn't dual-issued as well
16:44karolherbst: dual issue only when delay == 0
16:45karolherbst: yep, found another example
16:45karolherbst: so 0x20 means dealy == 0
16:45karolherbst: okay, what*s important is, if the next instruction has a limm
16:46karolherbst: but only...
16:46karolherbst: imirkin: update and check the 5th example
16:47imirkin: ok, so it just hates limm and non-limm adds
16:47karolherbst: no, I think that isn't it
16:47karolherbst: "mul ftz rn f32 $r0 $r1 $r0" "add ftz f32 $r1 neg $r10 0x3d99999a"
16:47karolherbst: mul not dual issued
16:47karolherbst: add overwrites $r1
16:48karolherbst: but dealy is 0?
16:49karolherbst: wait no
16:49karolherbst: the overwrite doesn't matter
16:49karolherbst: example 2
16:49imirkin: the last 2 adds aren't dual-issued
16:49karolherbst: fma is dual issued though the next mul overwrites
16:49imirkin: er wait
16:49imirkin: wrong pair
16:49karolherbst: imirkin: yeah
16:49karolherbst: the max already is ;)
16:50karolherbst: anyway, we should do this: https://github.com/karolherbst/mesa/commit/c4394f28a2adaa736f9d57414fdc60dbfbf385ff
16:50karolherbst: currently nouveau sometimes dual issues the seventh instruction
16:50karolherbst: and the blob obviously doesn'T
16:50imirkin: uhhh... are you sure?
16:50imirkin: it's not about which instruction you are
16:50imirkin: it's about alignment
16:51karolherbst: imirkin: very sure: https://gist.github.com/karolherbst/a47e85c06a665b76949b
16:51karolherbst: last one: never
16:51karolherbst: it doesn't effect performance
16:52karolherbst: if you dual issue the 7th one you can't hurt it
16:52karolherbst: so I think the hardware just ignores it anyway
16:52karolherbst: but maybe the delay data can help
16:56imirkin: robclark: fyi, we check for num_crtc == 0 for gpu's that really just don't have any crtc's at all, i.e. no display hardware.
16:56imirkin: robclark: nvidia gets rid of it on some 3d-only accelerator models
16:56imirkin: and accessing it when it's not there causes all sorts of issues
16:57karolherbst: imirkin: sixth example :/
16:57karolherbst: dual issued add allthough the next one has a limm
16:59jeremySal: imirkin: Okay, I think I understand all of the involved extensions, and I think I have made the proper test case. However, piglit is telling me that I don't have the ARB_shader_viewport_layer_array extension. I think this is related to using the mesa headers?
17:00imirkin: jeremySal: are you using the blob?
17:00jeremySal: I am using the blob
17:00imirkin: jeremySal: pastebin glxinfo
17:00jeremySal: But piglit was complaining about missing opengl headers
17:00jeremySal: so I installed the mesa headers
17:00jeremySal: It's on another computer
17:00jeremySal: but "server glx vendor string NVIDIA CORPORATION
17:00imirkin: ok, so what's the issue again?
17:01imirkin: you just have to write a short shader_test file
17:01imirkin: build shader_runner
17:01imirkin: and run + trace it
17:01robclark: imirkin, hmm, so maybe you need to check for num_connectors==0 too (for optimus laptops w/ no outputs wired up to gpu)
17:01jeremySal: yes, but the extension is not available
17:01jeremySal: the one I'm supposed to test
17:02jeremySal: Maybe it's only in recent versions of the blob?
17:02imirkin: robclark: yeah probably. actually a lot of those optimus chips are the display-less ones.
17:02imirkin: jeremySal: that's why i'm asking you to pastebin glxinfo
17:02jeremySal: ok, one sec
17:02imirkin: jeremySal: and yeah, you need a moderately recent blob version
17:03imirkin: i think the 35x series ought to have it
17:03robclark: imirkin, hmm, ok.. well, seems like a num_connectors check wouldn't be a horrible idea.. although I haven't actually seen any issue like this reported on intel+nv laptops, which I guess are a bit more common than intel+radeon ;-)
17:04karolherbst: imirkin: but I think the blob tries real hard to dual issue 3 instructions per block
17:04imirkin: jeremySal: it actually exposes GL_AMD_vertex_shader_layer -- just trace the piglits as-is :)
17:05karolherbst: imirkin: and nouveau dual issues less than that
17:05karolherbst: imirkin: maybe a reorder pass which just tries to increase dual issuing might be worth it?
17:05imirkin: robclark: i wasn't entirely sure what the issue you ran into was, but was just pointing out the reason why we try hard not to touch display things when there are no crtc's -- the hw is just plain fused off, causes errors if you touch it
17:07jeremySal: imirkin: "global variable gl_FragColor is deprecated after version 120"
17:08imirkin: jeremySal: is that just a warning or an actual error?
17:08jeremySal: error, it fails
17:08jeremySal: Failed to compile
17:08imirkin: jeremySal: ok, above main, add "out vec4 color;", and then instead of gl_FragColor = ... do color = ...
17:09imirkin: strictly speaking it is right that it's deprecated after version 120, but not compiling is a little harsh...
17:09imirkin: maybe that's the core profile behaviour
17:10jeremySal: curious: how does it know that the out4 color is the actual color you want to render?
17:10imirkin: jeremySal: "out" tells it it's an output
17:10jeremySal: imirkin: so it assumes the first output is the color? Sorry, I'm not familiar with this.
17:11imirkin: jeremySal: no worries... fragment shaders can only output colors :)
17:11jeremySal: ok nice
17:11imirkin: there are also some built-ins, like gl_FragDepth and gl_SampleMask but those are special
17:13jeremySal: is there a standard way to kill shader_runner? CTRL+C? exit the window?
17:14imirkin: esc exits
17:14imirkin: i think you want -auto
17:15jeremySal: shader_runner doesn't seem to take args
17:15imirkin: put -auto at the end
17:15imirkin: trust me :)
17:17jeremySal: So I ran valgrind directly on shader_runner and it produced a 9MB dump, but it seems like demmt won't read it (??)
17:17jeremySal: "unknown type: 0x1b"
17:18imirkin: you ran it with --tool=mmt right?
17:18jeremySal: wait... I was passing the filename as an arg instead of piping it in
17:19jeremySal: yes I did
17:19imirkin: you want --log-file=foo.mmt
17:19jeremySal: huh, it gives me a segfault when I try to pipe the output of demmt to a file
17:20imirkin: don't do that
17:20imirkin: just use the default less pager
17:20imirkin: anyways... sometimes things change with blob versions
17:20imirkin: and the demmt tool isn't perfect
17:20jeremySal: The less log looks good
17:20imirkin: if you xz -9 the mmt file and upload it somewhere i can see what's up
17:21imirkin: probably some issue late in the file
17:22jeremySal: k, I'm uploading
17:30imirkin: looks like i mess something up
17:32imirkin: jeremySal: ok, pull. crash fixed.
17:33imirkin: yep, looks like they just write to a[0x64]
17:46imirkin: jeremySal: the next order of business is to figure out how https://www.opengl.org/registry/specs/NV/viewport_array2.txt works -- gl_ViewportMask
17:59jeremySal: imirkin: Where could I find this in the log?
17:59imirkin: search for START_ID
17:59imirkin: that will show you the shaders
17:59imirkin: go until you find the right one
17:59imirkin: and try to match up the code you wrote to what the ops are doing
18:12jeremySal: imirkin: is there a list of the instructions output by demmt? For example I can't find "st" in the list of vertex shader instructions
18:15imirkin: what list are you looking at?
18:15imirkin: either way, "st" = "store"
18:15jeremySal: Idk, it was DX8 virtex shader instructions
18:16jeremySal: the best list I could find
18:16imirkin: DX8 = 2001
18:16imirkin: so... a bit out of date
18:16imirkin: also, those would be DX8 shader model instructions, not actual gpu instructions
18:16imirkin: for maxwell, envydis uses mostly nvdisasm's mnemonics (which is a tool that comes with the cuda tools)
18:16jeremySal: I see, is there a reason why nothin gelse turns up on google?
18:17imirkin: yeah, this stuff isn't really publicly documented
18:18jeremySal: I can find a document describing how to implement the "missing" vertex shader instructions, but not the non-missing ones
18:19imirkin: what missing instructions?
18:19jeremySal: not "missing" per se, just how to do things that don't have an instruction primitive
18:20jeremySal: but are simple enough that people expect there to be
18:20imirkin: yeah, nothing you'll find in google searches will be of any use to anything we do
18:20jeremySal: I can't believe I can't find a list of nvasm instructions
18:20jeremySal: I mean doesn't nvidia provide documentation?
18:21imirkin: nvidia provides a fake-o isa, aka PTX ISA: http://docs.nvidia.com/cuda/parallel-thread-execution/
18:21imirkin: but it doesn't map 1:1 to any gpu's ops
18:21imirkin: although there can be many similarities at times
18:21imirkin: it's also compute shader focused, so won't have any of the bits relating to GL pipeline items
18:22jeremySal: oh cool
18:23jeremySal: what does the sched instruction do?
18:27imirkin: it's not an actual instruction
18:27imirkin: it just provides scheduling info to the executor
18:28imirkin: isntructions are loaded in groups of 32 bytes, and each one starts with an 8-byte sched info descriptor
18:28imirkin: showing it as an instruction in the parser was the simplest way of showing that
18:29imirkin: since we process stuff 8 bytes at a time, sometimes a piece of sched data will look a lot like a real instruction adn we'll decode it as such. but it's really shced data.
18:43Jayhost: Is anyone here to guide on reclocking path?
18:44orbea: please stop
18:45Jayhost: Accident. It looks like I should get a different client.
18:45peb`: shit happens sometimes :)
18:48orbea: Jayhost: for reclocking, 'cat /sys/class/drm/card0/device/pstate' and then echo one of the values to the file. example: 'echo 07 > /sys/class/drm/card0/device/pstate'. Results will vary depending on your kernel version
18:49orbea: also add nouveau.pstate=1 to your boot
18:49imirkin: i think he's looking on info for how to make it work, not on how to operate it once it does work
18:50orbea: the question was a bit vague
18:52Jayhost_: peb` third time I fat fingered recently
18:53Jayhost_: question was supposed to be reclocking path for new card | maxwell | gm107
18:54lanteau: imirkin: not sure if you're around, was told on the debian-powerpc mailing list that you might be the guy to talk to on my nouveau woes
18:57imirkin: i suppose something's not working?
18:57imirkin: were you the guy who mailed the list about a NV40 recently?
18:58lanteau: imirkin: for better or for worse, yes that's me
18:59imirkin: i remember glancing at it and having no clue
18:59imirkin: it's trying to do a blit, but then hits some sort of protection fault? but why?
18:59lanteau: by all accounts (by my limited understand of nouveau and it's hardware support), the NV40 should work right?
19:00imirkin: it def *ought* to work
19:00imirkin: you could also try nvidiafb
19:00imirkin: instead of nouveau
19:00imirkin: that will get you an accelerated fbdev
19:00imirkin: but no nouveau accel of any sort
19:01imirkin: may be worth noting that while i did make mesa kinda-sorta run on at least my NV34
19:01imirkin: it's definitely far from perfect
19:01imirkin: there's a lot of confusion in my head about what is what endian when and where
19:01lanteau: hmm, I just want a usable 2D desktop, complete software rendering doesn't seem to be the ticket
19:02imirkin: well, i use the same 2D desktop i used (almost) 20 years ago, and i did it just fine back then without a space-age video accelerator
19:03imirkin: does "2d desktop" include something like a compositor?
19:03imirkin: perhaps a GL-based compositor?
19:03imirkin: if so i'd strongly advise against that
19:03lanteau: I guess in my mind, it's one of those things where the NV40 *should* work with nouveau and now I feel like I need to figure out why it doesn't
19:03imirkin: it definitely *should*
19:04imirkin: are you using something like gnome-shell?
19:04imirkin: or kde plamsa
19:05lanteau: well, it's failing when Debian is trying to launch lightdm, so I wouldn't think lightdm would be the cause of any issues...
19:05imirkin: yeah iirc lightdm doesn't do anything particularly dumb
19:05imirkin: sadly they killed gdm -- it uses gnome-shell now :(
19:06lanteau: ahh, that explains the move to lightdm
19:06imirkin: used to be a perfectly fine replacement for xdm, and then, poof. but i digress.
19:06imirkin: you might have mentioned this, but are you using xf86-video-nouveau 1.0.12?
19:06imirkin: yes you are.
19:06lanteau: I went through the trouble to get a kernel with 4k page sizes, I know that was a sticking point for some
19:07imirkin: yeah, there's a lot of confusion in nouveau around gpu vs host page tables
19:07imirkin: er, pages
19:07imirkin: gpu pages are all 4k
19:09lanteau: The Xorg log file didn't seem to reveal anything in my opinion. Saw a lot of normal looking nouveau messages in there
19:10imirkin: yeah, i mean... we get that protection fault, and then we hang.
19:10imirkin: this is nv40, so before there was a gpu mmu...
19:10imirkin: not 100% sure what a protection fault means there... we went beyond the dma descriptor?
19:14urjaman: ... hmm, interested in someting in my dmesg that starts with this: [27504.324606] nouveau 0000:02:00.0: fb: trapped read at 0020506100 on channel 2 [1fb16000 Xorg] engine 00 [PGRAPH] client 0a [TEXTURE] subclient 00  reason 00000002 [PAGE_NOT_PRESENT]
19:15urjaman: i was looking for a device name of a disk :P
19:16imirkin: urjaman: unfortunately that error provides no real additional information beyond that an error happened :(
19:18urjaman: http://d11mgdpsdcgrvc.cloudfront.net/hmm123.txt did come with a bit more data, but i suppose it's all just like "make: error 2" :P
19:19lanteau: imirkin: is there anything I can try on the system to give guys more information that would help in figuring this out?
19:19urjaman: this just seemed different to what i'd seen before so thought i'd give it an ask
19:19imirkin: urjaman: the fact that the dma pusher came up with errors at the same time leads me to believe it's one of those "wtf" errors we have no idea about.
19:20urjaman: ok (do tell me if i'm being a bother)
19:20imirkin: urjaman: if you can repro this, i'd be interested. otherwise there's not a whole lot i can do
19:21imirkin: lanteau: unfortunately i'm not sure where to begin
19:21urjaman: ok, cannot (most likely) because i didnt even notice it ...
19:22urjaman: it's quite certain that some kind of error will happen sooner or later here though, i just dont have a single case (though it's mostly web browsing related, flash or firefox :P)
19:22imirkin: yeah, and that's the problem
19:22lanteau: hmm okay
19:23imirkin: the error definitely happens, but i have no clue what brings it on
19:23imirkin: and in order to analyze it, i'd need a full mmt trace along with the errors in dmesg
19:24urjaman: and yeah that'd get very big quite quickly i assume
19:25urjaman: but ok i'll keep it in mind if i find something to crash it quickly with
19:27imirkin: lanteau: i would see if nvidiafb (another driver in the linux kernel) fares better
19:28imirkin: lanteau: also is this a recent issue? did this work in older kernels? like 3.17 or so? (some ppc stuff got broken in 3.18 or 3.19 or so)
19:29lanteau: Nope, tried in 3.16 with no luck
19:31imirkin: hm ok
19:32imirkin: does the console work ok?
19:32imirkin: you could probably get things going if you booted with nouveau.noaccel=1
19:38lanteau: yeah nouveau.noaccel=1 works, I get to lightdm
19:49imirkin: lanteau: right, that makes sense... it's all cpu-rendered
19:49lanteau: would nvidiafb be all cpu rendered as well?
19:51jeremySal: imirkin: What is the ipa instruction? I assume it doesn't involve beer brewing :)
19:53imirkin: jeremySal: sadly no. interpolate.
19:53imirkin: lanteau: i think so, yeah...
19:53imirkin: you still get a hw cursor either way :)
19:54lanteau: lol so probably no perceivable difference between nvidiafb and nouveau.noaccel=1
19:58imirkin: unlikely to be any... nouveau might be more flexible since you still get modesetting and whatnot
19:58imirkin: not sure how well that works with nvidiafb
19:58imirkin: i know quite little about fbdev
19:59imirkin: i do think i heard that people with PCIe G5's got their nv4x's up and running though
19:59imirkin: so this might be something specific to AGP nv4x :(
19:59imirkin: which makes it even weirder
19:59imirkin: and even more in skeggsb territory
20:00lanteau: well I have another G5 with a PCIe GeForce 6600...so I guess I need to try that one now too
20:02jeremySal: imirkin: What does $p0 mean before an instruction?
20:03jeremySal: and what does a[0x64] mean? The value at address 64 offset by a?
20:03imirkin: jeremySal: predicated
20:04imirkin: i.e. only executed if $p0 is true
20:06imirkin: a is "shader output memory space"
20:06imirkin: for all the non-fragment shaders
20:06imirkin: you might find this useful: http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_program.c
20:06imirkin: note that: case TGSI_SEMANTIC_LAYER: return 0x064;
20:12lanteau: well imirkin thanks for your help, I'll use noaccel for now and maybe talk too skeggsb sometime when he is around. I'll have to report back on whether or not nouveau works on my G5 with the PCIe 6600
20:18imirkin: lanteau: oh, one more idea -- try booting with nouveau.config=NvAGP=0
20:18imirkin: lanteau: that will disable AGP entirely which isn't *great* for performance, but might be the source of some of your problems
20:19imirkin: (although hm, ppc normally gets agp disabled anyways... perhaps it already is for you)
20:23lanteau: imirkin: nope, unfortunately that didn't help
20:53imirkin: skeggsb: perhaps worth testing if you have an AGP nv4x, booting that on x86 with agp disabled...