00:10mupuf: imirkin: pretty sure there is a GF108 already (c1)
00:23mwk: I seem to have lost my "run compute shit on Fermi" nva program
00:24mwk: otoh, I found 15 distinct copies of peek.c/nvapeek.c
00:25mwk: and that's not including variations like "peek from an NV3", which used to be a different program due to a different PCI vendor id...
00:27mwk:tries connecting one more ancient hdd
00:48mwk: ah, found it... on an almost dead hdd
00:51karlmag: aaah... data mining.. or - computer archology... I'm so not looking forward to do that when I get my new file servers online :-P (Got a couple piles of old disks laying about :-P )
00:56mwk: the good news is, that machine is up and running again
00:56mwk:hasn't connected it to the grid since the move
00:57mwk: the bad news is, it now boots from NFS :)
02:21karolherbst: imirkin: I think I will do the tex pass also post-ra to not hurt the binary too much, so that it is a clear win situation. Any objections to this?
03:10karolherbst: Tom^: I think I have a working dynamic reclocking thing now, only memory stuff needs some tweaking, but generally it shouldn't have the issues you encountered anymore
03:14karolherbst: are there any other instructions which might benefit from an earlier execution similiar to OP_TEX, maybe stuff like VFETCH?
04:13urjaman: guh mmt is.... slow...
04:14urjaman: and with my luck this firefox will load the page just fine now
04:14karolherbst: urjaman: as long as you don't ahve to wait 10 seconds for one frame
04:14urjaman: didnt o.o
04:15karolherbst: well I mmt traced saints row IV, and I had to wait 10 seconds :D
04:15karolherbst: ohh wait, that was apitrace :O
04:15karolherbst: so with mmt it should be even worse
04:20karolherbst: yeah benefit! finally :/
04:21karolherbst: more fps in unigine without harming gpr count or causing spilling
04:24urjaman: so ok i got 352M of binary log ...
04:25karolherbst: demmt is your friend now
04:25urjaman: that i'm both compressing (xz -9) and trying to look at the end of (less is taking a lot of memory :P) at the same time
04:26urjaman: previously i was building demmt (and folks)
04:26urjaman: oh, only 5gb to view this file ...
04:26urjaman: Calculating line numbers... (interrupt to abort)
04:27urjaman: dont need those anyways...
04:28urjaman: yeah i dont know what to say about this ... it's G84 stuff :P
04:28urjaman: only 20M compressed
04:29urjaman: i'll upload that if somebody is interested.
04:34urjaman: ends with LOG: MSG: ==27926==
04:34urjaman: and i have no clue what it's trying to say
05:14imirkin: urjaman: do you have the dmesg errors that go with it?
05:47imirkin: jeremySal: looks like it
05:47imirkin: jeremySal: or extend shader_runner
06:04karolherbst: imirkin: these lines have to be changed for kepler gen2, right? https://github.com/karolherbst/mesa/blob/c8e5a796cf099703ee05b2d82003a5be7e29ce41/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp#L2871-L2873
06:05imirkin: those lines only apply to nv50 btw
06:05imirkin: not to fermi at all
06:06imirkin: nv50 has 128 regs, but regs >= 64 require alternate encoding
06:06imirkin: you could add a target callback of like ->getMaxShortReg()
06:09imirkin: it just so happens you get to float by on fermi/kepler1 since they only have 64 regs to begin with...
06:09imirkin: but that check is wrong
06:11karolherbst: imirkin: is it also wrong for those pre fermi cards?
06:12imirkin: no, it's right for nv50
06:13karolherbst: imirkin: okay, so for fermi+ this new function should return the number of regs or is there something else I have to respect?
06:13imirkin: (nv50 is what comes before fermi)
06:14imirkin: maybe just add a bit to the target that's like "cares about short encodings"
06:14imirkin: basically i hate having chipset checks in common code
06:14imirkin: sometimes it's difficult to avoid
06:14imirkin: but i try to keep it to a minimum
06:14karlmag: (nv50? Tesla IIRC)
06:15imirkin: nv50 = tesla
06:15karlmag:have looked too much on the card lists :-P
06:16RSpliet: not to be confused with the Tesla range of HPC cards
06:16karlmag: or the car :-P
06:16RSpliet: hence we usually refer to it as NV50
06:17karlmag:tries to take note
06:18karlmag: the pr people should get some whipping for that, actually...
06:18karlmag: "let's make some extra confusion about our own products"
06:19imirkin: meh wtvr. the assumption should be that there's no link between marketing and engineering names
06:19karlmag: I'll shut up about that for a bit now.. I can rant for a looong time.
06:37karolherbst: imirkin: yeah okay, I know it is safe for now, but maybe I just add the function already and let it return something meaningful, 64 for nv50 and the max reg count for fermi+
06:37karolherbst: or would that be wrong?
06:37imirkin: it'd be fine
06:37imirkin: there's a ->getFileSize() already
06:37imirkin: you should do something similar
06:38imirkin: the thing is that this specifically relates to short encodings on tesla :)
06:38imirkin: and the pass is called Nv50Stuff
06:39karolherbst: yeah I know
06:39karolherbst: but it makes sense for other chips too
06:41imirkin: yeah, but that's not something i realized when i recommended roy make it into a nv50-specific pass :)
06:41imirkin: btw, iirc i have a change for gk110 emitting this junk
06:41imirkin: so just need to fix up gm107 a bit
06:41RSpliet: nor did I for that matter
06:42mlankhorst: who's going to fosdem this year?
06:43RSpliet: also: Hans de Goede, mupuf, pmoreau, karolherbst
06:47RSpliet: oh, also effractur :-P
06:56imirkin: some day a conference will be held in a place i'm willing to go to
07:04karolherbst: RSpliet: yeah? :D
07:05karolherbst: mhh I might should have a look at the schedule before I leave :/
07:09pmoreau: I'm planning to attend jekstrand's talk about Vulkan (and all the graphics talk), but that's about it IIRC.
07:09pmoreau: I should have another look
07:10karolherbst: yeah I think I will hear all the grapchis stuff too
07:10karolherbst: but you know what, the gaming stuff is at the same time :/
07:10pmoreau: Oh, right…
07:11pmoreau: I think there were a few confs I wanted to see in the gaming stuff, but the timing wasn't great
07:24karolherbst: imirkin: so is this 64 limit more about ... I don't know, maybe there should be something like canImmediate(OP, id, id, id, id)?
07:25karolherbst: and it either returns 0 or the src position where the immediate can be
07:25karolherbst: mhh but that also isn't the thing we want :/
07:33Tom^: karolherbst: you mean the flickering?
07:33karolherbst: Tom^: no, that it stuck at a clock
07:33karolherbst: the flickering is something I can't fix
07:33karolherbst: well I could try, but I can't test it
07:34Tom^: nice, i also just got my gpu cooler about to unbox it and im hoping they sent some thermal paste with it or im without it another day :p
07:34xexaxo: pmoreau: there's a talk about Vulkan... when is that ?
07:34pmoreau: On Saturday, around 12 IIRC
07:35xexaxo: heh I was looking at the graphics room :\
07:35pmoreau: Yeah, they put it in the hardware room…
07:35Tom^: karolherbst: i wonder if you maybe can
07:35Tom^: karolherbst: isnt your hdmi port wired to the nvidia gpu?
07:36Tom^: karolherbst: what happends if you plug in a monitor and start reclock?
07:38karolherbst: all wired to intel
08:41karolherbst: imirkin: I made the one patch more clear now: https://github.com/karolherbst/mesa/commit/13cae37ea3740cf5078b335904bfd36af0983ca2
08:42karolherbst: but as you may notice the lower condition is always fullfilled
08:42karolherbst: the old one
08:42imirkin_: karolherbst: thanks :)
08:43karolherbst: it is a pain that the patches are optimized for size :/
08:44imirkin_: karolherbst: that's not generally the case
08:44imirkin_: karolherbst: however in this one i think this way is clearer
08:45karolherbst: yeah, but the old condition is always true at that point :/
08:45imirkin_: that's ok
08:45imirkin_: karolherbst: btw you might be interested in i->swapSources()
08:46karolherbst: mhh yeah, makes sense
08:56karolherbst: imirkin_: I think I also want to add this: https://github.com/karolherbst/mesa/commit/2815826bd88e0ab4695e7bef0d7bf3bf8e839370
08:56karolherbst: the hurt things are caused by other runs moving stuff around a little
08:56imirkin_: yeah. i'm probably going to re-run it all on my shaderdb to "normalize" the results
08:57karolherbst: ohh on the default shader-db?
08:57karolherbst: or what do you include?
08:57karolherbst: because if you rerun all that stuff I won't bother updating the statistics in my commit messages :D
08:57imirkin_: well, you should def have *something* in there
08:59karolherbst: imirkin_: I also want to finish up the dual issue stuff or shall I do that in another series?
09:00imirkin_: i don't really care how you split up your patches
09:00imirkin_: i consider them on a basically one-by-one basis anyways, unless they're in a logical group
09:01imirkin_: since these opts can be a bit tricky, i need to do a bunch of regression testing on your changes
09:01imirkin_: on at least my fermi and tesla boards
09:06karolherbst: I think I leave the dual issue patch for now, because I still want to play around there a little
09:28karolherbst: imirkin_: I sent the stuff out
09:28imirkin_: yeah, i saw
09:31jeremySal: imirkin do you want me to write tests of the new features, that fail if it's not working, or is the point just to get the traces
09:32imirkin_: jeremySal: well, it'd be SUPER awesome if you could write *actual* tests of functionality. However i don't really expect you to do that - it's difficult and a pain and i hate it, and see no reason to make you do it
09:33imirkin_: so... it has to be enough to be able to implement the feature, but not necessarily to fully test the impl
09:35jeremySal: imirkin: so what do you think about testing the conservative_raster extension? would you just test a single pixel in a single setup?
09:37karolherbst: imirkin_: these patches: 972 -> 978 points in pixmark_piano for 1024x640
09:38karolherbst: it is awesome that this benchmark just returns the same number every run :D
09:38imirkin_: jeremySal: quite honestly, i haven't looked at it in detail yet.
09:38imirkin_: jeremySal: try it, and look at the trace, and see if you can make out what's going on
09:39jeremySal: imirkin: it's pretty simple: the idea is that every pixel that overlaps the triangle is rasterized, not just those whose center is inside the triangle
09:39jeremySal: imirkin: but it gives the implementation some leeway to also include extraneous pixels
09:40imirkin_: oh i see
09:40jeremySal: so for something like that, would you just test a single pixel, or would you try to reproduce every single pixel
09:40imirkin_: so that's probably just a method somewhere that gets flipped from 0 to 1
09:40jeremySal: that is independently calculated to intersect
09:40jeremySal: yes exactly, it uses gl
09:40imirkin_: i don't mean a gl method
09:40imirkin_: i mean a gpu method
09:40imirkin_: like GM204.BLA_BLA_BLA
09:41imirkin_: you can think of them as "register writes"
09:41imirkin_: but they're actually method calls
09:41imirkin_: and can do crazy things. although most of them just store the value somewhere.
09:42jeremySal: I see
09:42jeremySal: I'm trying to write *actual* tests of the functionality, at least for this feature
09:42jeremySal: I'm trying to ask what the best way from piglit's perspective would be
09:43imirkin_: from the piglit perspective you want to have a test that verifies functionality
09:43jeremySal: like to test multiple scenarios or just one
09:43imirkin_: from the nouveau perspective, as long as we know how to teach the GPU about it, we're done
09:43jeremySal: and whether you check if the entire image is to spec, or just a single pixel
09:43imirkin_: btw is that an ARB ext, or a NV one?
09:44jeremySal: there was another one which looks like both? Do they like transition?
09:44jeremySal: from NV extension to ARB extension?
09:44imirkin_: well, something may be promoted to EXT or ARB yes
09:45imirkin_: ARB_post_depth_coverage should be a really easy one to trace as well
09:46jeremySal: okay, does that affect nouveau at all? Like does it create two mechanisms to use the feature?
09:46imirkin_: either it reads the sample mask from a different place, or it flips a method (like forcing early frag tests does)
09:46imirkin_: not at the hardware level
09:46imirkin_: mesa might need adjustment
09:46Tom^: karolherbst: woop woop, this is what i call proper gpu cooling, im not going above 64C ever now.
09:47Tom^: and the fans never go above 40% :P
09:48jeremySal: imirkin: about testing the conservative_raster functionality, I feel like I'm not really getting an answer from you. Is it because that's something you don't really know about? I'm trying to write an actual test of functionality, and I'm trying to understand what the best practices are.
09:48imirkin_: if you're trying to write an _actual_ test of functionality
09:48imirkin_: then you need to make sure that it does what you expect
09:49imirkin_: i doubt you can do that with just a single pixel
09:49imirkin_: but if you can, that's great
09:49jeremySal: imirkin: I'm asking because for example the existing tests might draw a green square over the entire screen and check if it's all green
09:49jeremySal: but they don't check that the boundary of the square is the exact pixel
09:49jeremySal: if you resize
09:50imirkin_: resizing is not a thing to worry about
09:50imirkin_: most such tests basically have a check in the fragment shader where it always outputs green or always red
09:50karolherbst: Tom^: nice
09:50imirkin_: it's a common way to write tests, but hardly the only one
09:52karolherbst: Tom^: if you want you can try out this: https://github.com/karolherbst/nouveau/commit/a8a2cf5c4852b8e2d0db24dcefa6891343114846
09:52karolherbst: should be much better
09:55jeremySal: I guess what I'm asking is how much assumption the test will make in favor of the driver? Like should the test check that unrelated features don't break when you enable the conservative rasterization? like checking that the center of the triangle is still filled when you enable conservative rasterization?
09:56imirkin_: jeremySal: usually you just test that enabling it does the thing it's supposed to do and disabling disables that
09:56imirkin_: jeremySal: sometimes you test interactions between features, but only if you think they're directly relevant
09:57karolherbst: nice with pixmark_piano I am now at 76.8% blob speed, and I began at 72.5%
10:03karolherbst: ohhh, I am nearly at the maximum regarding dual issueing already :O
10:04karolherbst: right because that counter value needs to be doubled
10:05imirkin_: karolherbst: time to triple issue!
10:05karolherbst: imirkin_: yeah well, my patch still does better then stock nouveau: 978->998 points
10:06karolherbst: I just thought I could improve it a lot more :/
10:06imirkin_: there's probably more going on
10:06imirkin_: i'd look at buffer movement
10:06imirkin_: and wait times
10:06karolherbst: there is nearly none I think
10:07imirkin_: so then it has to either be that nvidia optimizes their shaders a *lot* better
10:07imirkin_: or some feature like zcull
10:07karolherbst: yeah well, memory clock doesn't matter for that benchmark
10:07imirkin_: if i wasn't already doing 75 things i'd look at it
10:07karolherbst: not even a tiny bit
10:08karolherbst: yeah, I was hoping to push this benchmark to 90%
10:08karolherbst: so that we know that at least the compiler is doing a really good job already for computational tasks
10:09karolherbst: imirkin_: 1620MHz memory clock vs 4008 MHz memory clock: 998 points to 1002 points ...
10:09karolherbst: okay, there is a tiny difference
10:09karolherbst: well there are also some tex calls
10:10karolherbst: imirkin_: I think I'll try to optimize dual issueing a bit more, so that I don't check the three next instruction but until I can't swap anymore in post-ra
10:11karolherbst: if that means perfect dual issueing I am happy :D
10:19karolherbst: uhh now I messed up :/
10:28karolherbst: imirkin_: I guess there is no range based isCommutationLegal which checks a chain of instructions?
10:29imirkin_: not aware of one
10:29karolherbst: k, should be rather simple to write a util function for that though
10:41karolherbst: odd, my better algorithm creates worse result :/
10:42karolherbst: ohh for plot3d the better algorithm is better
10:42karolherbst: I guess that is the result of having a smart algorithm based on a dump assumption
10:44urjaman: imirkin_: http://d11mgdpsdcgrvc.cloudfront.net/meh2.txt (i'm not 100% sure that these were caused by that crash, but maybe)
10:44karolherbst: 41.5% dual issueing is good though
10:46urjaman: and yeah i fell asleep...
10:49urjaman: oh and for some context: i found a website (jolla.com of all...) that ~immediately crashed firefox even on restoring tabs... thats a log of launching a firefox and restoring a lots of tabs (4 windows i think) with jolla.com being one of those thats actually loaded & visible
10:51urjaman: and that is crashed with that GLcontext crashed or something error that i showed previously (that first a few times then segfault... the valgrind run said "Killed" though but i guess that might be some effect of the emulation?)
10:52karolherbst: imirkin_: stock mesa: 34.7%, 38.4% with this: https://github.com/karolherbst/mesa/commit/0ad47523b6b7c0016af6b8c61f5494f6340e596c theoretical possible: 42,9%
10:59karolherbst: imirkin_: any objections to the idea itself? Or is that pass to expensive
11:01karolherbst: impressive, in unigine heaven I have a dual issued rate of above 40%
11:02imirkin_: that pass has nothing to do with dual-issue right?
11:02imirkin_: it just moves instructions down as far as it can, seems like it'd benefit texbar's
11:02karolherbst: mhh no?
11:02karolherbst: see those "target->canDualIssue" calls?
11:02imirkin_: i do
11:03imirkin_: but then i see chained commutation
11:03imirkin_: maybe i'm not understanding what it's doing
11:03karolherbst: take instruction A
11:03imirkin_: ohhhh wait
11:03imirkin_: i see
11:03karolherbst: and see if it can dual issue with A->next
11:03imirkin_: i read it backwards
11:03imirkin_: if it can dual-issue, then you hit continue
11:04karolherbst: otherwise I search for the nearest instruction it can dual issue with
11:04karolherbst: and swap it step by step
11:04karolherbst: and isChainedCommutationLegal just checks if I can do such swap
11:27karolherbst: imirkin_: adding those SET varriants doesn't change a thing though :/
11:28karolherbst: at least not in my shader-db
11:49urjaman: imirkin_: it might also be that the firefox crash left no dmesg traces and that log was from the chrome i was using to read the mmt & demmt instructions...
11:58karolherbst: imirkin_: fixed neg-set patch: https://github.com/karolherbst/mesa/commit/d1b7ca3bd631a8cb7af5e251850af1b13e88995a
12:05imirkin_: karolherbst: doesn't matter if it affects anything... it's the right thing to do
12:07karolherbst: ohh rcp is 1/a not 1/sqrt(a) :/ silly me
12:08imirkin_: rcp vs rsq
12:12karolherbst: imirkin_: bcause I think I might oversaw some rcp(mul) stuff :/
12:20karolherbst: imirkin_: mul(rsq(abs(b)), b)
12:21imirkin_: sqrt(x) :)
12:21karolherbst: take care of the sign
12:21karolherbst: imagine b = -6
12:21karolherbst: but yeah
12:21imirkin_: the abs is there to avoid NaN's
12:22karolherbst: well it turns a negative b to be positive in the sqrt
12:22karolherbst: but we can't just mul(rsq(abs(b)), b) => sqrt(b)
12:22imirkin_: wtvr, as logn as it's not NaN it's fine
12:22imirkin_: who cares
12:22imirkin_: there's no sqrt op anyways
12:22imirkin_: only rsq
12:24karolherbst: imirkin_: and I guess there is no instruction for b^a as well?
12:24imirkin_: which does 2^x
12:24imirkin_: pow() gets lowered to that + log to fix it all up
12:25karolherbst: and I guess that's not cheaper than mul(rsq)
12:25imirkin_: lol no
12:25imirkin_: and a lot less accurate
12:26karolherbst: the hell :/ is it common that gpus don't have pow or sqrt?
12:26imirkin_: almost none do
12:28glennk: radeons have fp32 sqrt
12:30karolherbst: imirkin_: is mov cheaper than add?
12:31imirkin_: i should hope so
12:31karolherbst: then I found a crazy opt which wouldn't be worth implementing :D
12:32karolherbst: imirkin_: https://gist.github.com/karolherbst/7b2a4fb8ea638d256e60
12:34karolherbst: I totally don't see it would be worth the time writing a good pass for this :D
12:37karolherbst: imirkin_: and then comes this: mul ftz f32 $r9 $r5 0.500000 ...
12:39imirkin_: what's wrong with that?
12:40karolherbst: ohh wait, I thought it could be also moved in, but I was wrong
12:48karolherbst: imirkin_: I am thinking if it makes sense to run passes again for changed instruction in the passes, like if a mad gets converted to an add, an AlgebraicOpt could be run for this add again
12:48karolherbst: or would you prefer a loop of the passes instead?
12:48imirkin_: karolherbst: yes, that's known as running optimizations to a fixed point
12:48imirkin_: we don't do that because it's faster and we get 99% of the benefits with just one run through the opts
12:49imirkin_: but perhaps it's worth it doing it up to a fixed point or a max of N times
12:49karolherbst: well I was thinking to do that only it i->op changes or could that in theory change all over again through other passes?
12:50imirkin_: that's the "fixed point" bit of it
12:50imirkin_: i.e. run the opts until nothing changes
12:54karolherbst: imirkin_: or I just write an opt for mad(mul(a, a), mul(a, a), b)
12:54karolherbst: ohh wait
12:55karolherbst: I meant
12:57karolherbst: mad(imm, imm, mul(a,b)) => mad(a, b, imm0*immp)
12:58imirkin_: ideally it'd be add(imm, mul(a,b)) which would then get picked up by algebraicopt on a future run
12:58karolherbst: whats wrong with me today.. I think my last thing is also wrong
12:58karolherbst: but I hope you got what I meant
12:59imirkin_: please try to avoid super-specialized opts
12:59imirkin_: and instead try to do things generically
12:59imirkin_: that will improve lots of situations
12:59imirkin_: rather than one very specific one
12:59karolherbst: yeah I know, but the questio is, do we really want to run some stuff multiple times?
13:27karolherbst: imirkin_: I guess more locals used is a bad thing
13:28imirkin_: it generally means more spilling
13:29karolherbst: what the hell is this shader? :O
13:32karolherbst: imirkin_: https://gist.github.com/karolherbst/ffb2eeb1a9e754925ef4 :/
13:32karolherbst: there is more though
13:34imirkin_: eventually you'll find a l[$r0] type of thing
13:34imirkin_: and that's why it uses the stupid local memory
13:34imirkin_: coz it tries to store to it
13:34imirkin_: with an indirect
13:34karolherbst: no l[ in there
13:35imirkin_: oh wait
13:35imirkin_: those stores are just spills
13:35imirkin_: of constbuf loads
13:35karolherbst: ohh wait, wrong shader
13:35imirkin_: which is the dumbest thing ever
13:35imirkin_: because those can just be rematerialized later
13:35imirkin_: but the spill code inserter doesn't know about that
13:35karolherbst: k, I put the entire output in the gist
13:36karolherbst: there are l[ thingies
13:36karolherbst: I just searched in the wrong binary
13:36karolherbst: but, the hell?
13:36karolherbst: ld u32 $r0 c0[0xdc]; st u32 # l[0x1c] $r0
13:37imirkin_: like i said
13:37imirkin_: it's idiotic
13:37imirkin_: but it's because the spill code inserter doesn't know about constbufs and rematerialization
13:37karolherbst: ohh the shader_test file is alos like 4300 loc, soo
13:37karolherbst: can there be anything done about that?
13:38karolherbst: now this thing is just super weird
13:39karolherbst: imirkin_: especially this: https://gist.github.com/karolherbst/ffb2eeb1a9e754925ef4#file-gistfile1-txt-L526-L536
13:39karolherbst: I know that the 0x0 is from RA and such
13:39karolherbst: it's not like $r6 is somewhat usefully used
13:42imirkin_: actually the second move is from RA
13:43imirkin_: the $r6 is used twice
13:43imirkin_: $r4t = r4:r5:r6
13:43imirkin_: r8t = r8:r9:r10
13:43imirkin_: the 0 has to be in both places
13:56karolherbst: imirkin_: mul(neg(mul(a,rsq(b))), neg(mul(a,rsq(b)))) => div(mul(a,a),b)?
13:57imirkin_: in algebraic opt...
13:57imirkin_: let's see
13:57imirkin_: mul(neg, neg) -> mul
13:57imirkin_: then you want some sort of expression rebalancing to notice that you're multiplying rsq(a) * rsq(a) = rcp(a)
13:57karolherbst: mhh okay
13:58karolherbst: the mul(neg,neg) -> mul thing should be easy :)
13:58imirkin_: but also not too useful
13:58karolherbst: but it makes other optimisations easier
13:59imirkin_: let's say you had like
13:59imirkin_: mul(mul(a,neg(rsq(b))), neg(mul(a,rsq(b)))
14:00imirkin_: you'd still want to detect it
14:00imirkin_: so what you really want is the expression analyzer
14:00imirkin_: aka the chainedmul thing we try to do
14:00imirkin_: except don't do a great job at
14:01karolherbst: oh damn, the result of the rsq is used elsewhere :/
14:02karolherbst: ohh but in the same way in the end
14:03karolherbst: just mul(mul(a,rsq(b)), mul(a,rsq(b)))
14:32karolherbst: imirkin_: just ran my loop or passes another time:
14:32karolherbst: helped 0 0 0 0
14:32karolherbst: hurt 0 320 447 447
14:32imirkin_: sounds like you did osmething wrong
14:33karolherbst: imirkin_: https://github.com/karolherbst/mesa/commit/403e373d7a3279fb8a7435f11089d5440340236b
14:33karolherbst: this helps
14:33karolherbst: but changing the 2 to a 3 has the above additional effect
14:33imirkin_: sounds like you have opts that are fighting
14:34karolherbst: or some opt is just too optimistic
14:35imirkin_: probably want to do a CopyPropagation pass in there too
14:36imirkin_: since ConstantFolding can generate stupid mov's all over
14:37karolherbst: after or before dce?
14:39imirkin_: mmmm... doesn't matter
14:40karolherbst: well then after DCE
14:42karolherbst: mhh, that also somehow hurts more than it does good :/
14:53karolherbst: I think I will investigate tomorrow what each pass might mess up. I bet there might be something where something gets optimized, but a part of the stuff gets used somewhere and we end up with more instructions in the end or some other pass isn't smart enough yet...
15:25imperator: Hi, I'm using Parabola with OpenRC, my games like Minetest and 0ad go at 5 fps. I went to Parabola and they sent me here to ask for help.
15:25imperator: this is the result of dmesg|grep -i nouveau
15:48imperator: So, any of you know how to get more speed without using an older card?
15:48imperator: I forgot to say it's a GTX 980.
15:51imirkin_: imperator: GM20x does not have any acceleration supported
15:51imirkin_: imperator: to get more speed, use the nvidia blob, or get a non-GM20x gpu.
15:52skeggsb: not to mention, it *never* will if you use a crazy-person's kernel like that one ;)
15:52imirkin_: imperator: given that you're on a "deblobbed" kernel that won't work with require_firmware, i don't think that GM20x will *ever* work, since it requires signed firmware from nvidia.
15:52imirkin_: even if nvidia were kind enough to sign our firmware (seems quite unlikely), i doubt it'd be shipped inside of the kernel like the "deblobbed" kernels require
15:53airlied: definitely sounds like a lets ship it and find out :-P
15:57imirkin_: skeggsb: btw did you see the issue on nv40 + agp + g5? i wonder if nv4x + no agp is broken somehow
15:58imirkin_: all that stuff is way out of my comfort zone
16:04imperator: So, I'll have to wait until NVIDIA gets kind.
16:04imperator: Oh well.
16:04imperator: I'll do more productive things like reading my pdfs in this time.
16:05imperator: (Hopefully not 5 years)
17:07airlied: skeggsb: http://paste.fedoraproject.org/315577/53943207/
17:07airlied: those PDISP thing make any sense?
17:10imirkin: airlied: iirc ben fixed one class of those where the cursor was being inabled without the crtc being on
17:11airlied: this is the retina mbp, it gets lost sometimes in bringing up the panel
17:11airlied: works okay if I reboot into OSx and back
17:14skeggsb: it's not the cursor bug that got fixed
17:16skeggsb: it's a core/base channel inconsistency that's triggering that one.. i thought those were all fixed these days though, haven't seen it for a while
17:16airlied: this laptop gets very flaky when I jump between osx/linux and intel/nvidia
17:17imirkin: skeggsb: someone's getting something like that on boot
17:17imirkin: skeggsb: i have no idea how you read those reports...
17:17imirkin: skeggsb: https://bugs.freedesktop.org/show_bug.cgi?id=93834
17:17imirkin: er actually i guess that one was a bit different - an actual crash in the evo stuff
17:18airlied: skeggsb: I'm not sure if it was the edp panel or external dp monitor that pissed it off
17:19skeggsb: airlied: whatever is attached to head 0
17:19skeggsb: imirkin: yeah, that's not the same :)
17:19airlied: okay probably the edp then
17:19glennk: isn't edp muxed on those machines?
17:20skeggsb: it's somewhat weird that rebooting after osx works fine
17:20airlied: glennk: yes, but the mux should be pointing in the right direction
17:20skeggsb: i'm not really sure what state we'd shove through evo differently
17:33mwk: mmh fun
17:33mwk: lots of G80's insides are directly visible through the MP debugging area
17:34mwk: scheduler data, for one
17:36mwk: "don't execute any instruction requiring $r2 or $c1 contents on this warp until some execution unit reports instruction #2 as finished"
17:38mwk: I was hoping to find the place where s lines are marked as locked, but no such luck...
17:43mwk: maybe if I look at the other RAMs...
18:09imirkin: skeggsb: glad you switched to github?
18:15imirkin: mwk: what does this mean: "000000f8: 0433dc43 190e0000 set $p1 0x1 eq u32 $r3 $r1 $c"
18:15imirkin: aka "ISETP.EQ.U32.X.AND P1, PT, R3, R1, PT"
18:15mwk: it's the second part of a 64-bit compare
18:16imirkin: how does the .X play into it?
18:16mwk: I don't remember, but
18:16mwk: IIRC the $c should be set by a previous sub instruction
18:16imirkin: it is
18:17mwk: then you do the set on the high parts
18:17mwk: if they're different, it behaves like a normal set
18:17mwk: but if they're equal, it uses the sub's $c output to determine the comparison result
18:17imirkin: i see
18:17imirkin: so it's like SET_AND
18:17imirkin: er, more like SET_AND_NOT :)
18:17mwk: not really
18:18mwk: it's... well, it's just an arbitrary-precision compare
19:12lanteau: imirkin: did skeggsb ever weigh in on my lovely G5 + AGP + NV40 issue?
19:22lanteau: a debian-powerpc mailing list member asked me to try nouveau.duallink=0, just tried that, did not help
19:26imirkin: your GPU is hanging... this isn't about dual-link :)
19:26imirkin: and you're getting a *weird* protection fault when doing a blit
19:27imirkin: one thing you could try, which might be a disaster, is actually *enabling* agp
19:27imirkin: afaik it's disabled on ppc by default due to ... issues
19:28imirkin: but you could boot with nouveau.config=NvAGP=4 (for 4x agp) on a 4.3+ kernel, or nouveau.agpmode=4 on a pre-4.3 kernel
19:28imirkin: i *highly* doubt this will fix anything
19:28imirkin: but if you're desperate, it's something to try
19:28lanteau: he thought it was weird that my Xorg.log showed [ 31.182] (II) NOUVEAU(0): Output DVI-I-1 connected
19:28lanteau: [ 31.182] (II) NOUVEAU(0): Output DVI-I-2 connected
19:28lanteau: when I only have one monitor connected
19:29lanteau: I know this is the GeForce 6800 Ultra DDL...which I believe the DDL was a Power Mac specific card at the time due to having dual dual-link DVI ports, idk if that would affect anything
19:29imirkin: hmmmm... it thinks both monitors are hooked up to the same monitor?
19:30lanteau: anyway, disabling duallink, I still receive the message of both Output DVI-I-1 connected and Output DVI-I-2 connected
19:30imirkin: that's probably not helping matters
19:30imirkin: yeah, duallink is about max pixel clock allowed
19:31lanteau: This machine is kind of a unicorn, hence why I think getting nouveau working would be cool. But it doesn't seem to want to play along lol
19:43lanteau: imirkin: just tried booting with nouveau.config=NvAGP=4 to no avail...so yep I'm desperate
19:43imirkin: did you end up trying the PCIe variant?
19:44lanteau: Not yet, I can do that. That should be a 6600GT PCIe, let me switch that one in and get the same 4.4.0 kernel on it
19:45imirkin: i'm fairly sure i've heard of people with pcie nv40's getting it to work
19:45imirkin: my nv34 agp works ok too
21:04mwk: launching a compute warp and setting its "warp type" to pixel shader apparently doesn't result in a working pixel shader :(
21:04mwk: oh well, worth a shot
21:23Javantea: My backlight doesn't work, is this a nouveau issue or something else?
21:27Javantea: Laptop: System 76 Oryx Pro, Card: GM204M GTX 970M.
21:28skeggsb: it's entirely possible that something has changed there we haven't seen yet, i've not dealt with a mobile GM20x yet
21:30skeggsb: Javantea: if you could file a bug, and attach a vbios image (while running nouveau, /sys/kernel/debug/dri/0/vbios.rom should have it), we can start to investigate
21:30Javantea: great, will do
21:31skeggsb: bonus points for trying the nvidia binary driver, and seeing if it works there - even more points for grabbing a mmiotrace of the nvidia binary driver
21:32Javantea: skeggsb: I tested the nvidia driver as it came with the System 76, backlight worked fine.
21:32lanteau: imirkin: so interestingly on the PCIe 6600GT G5...Xorg starts, lightdm *kind of* appears, half the screen is there, half the screen is black, but the mouse works and everything
21:34lanteau: the nouveau framebuffer console doesn't seem to work, the screen flickers during the boot and I lose the text console until lightdm appears
21:37skeggsb: Javantea: ok cool, then with the right info, we can probably make it work too :)
23:09Jayhost: pstate frequency change on maxwell pretty cool. Trying to see if cooler room = no gpu lockout.
23:19mwk: imirkin: more useless stuff; found a hidden switch that selects whether .sat on a NaN is a NaN or 0
23:19mwk: hidden, as in you have to poke MMIO, there's no method
23:20mwk: and it's rather buggy...
23:27karolherbst: imirkin: seems like the more instructions come from mul+add => mul+mov+mad conversions where the source mul is still used elsewhere
23:27karolherbst: so instead of doing mul,add we end up doing mul,mov,mad :/
23:32karolherbst: though the mov only comes from the add having an immediate value
23:32karolherbst: but still
23:32karolherbst: but we want that mul+add => mad conversion, but only if we know that the mul will disappear
23:51karolherbst: imirkin_: I found it: LocalCSE splits mad(a, i0, i1) into add(mul(a, i0), i1) :/
23:52karolherbst: but that is only part one of the problem