00:52 mooch3: is it possible for you to use any of the mesa hardware drivers on windows nt 4? i need to know because that's the only way i can currently test the nv4 with mesa drivers
00:53 airlied: no
00:55 mooch3: on any windows at all?
00:55 airlied: nope
00:55 mooch3: ugh, that's dumb
00:55 airlied: why?
00:55 mooch3: because i can't use current linux
00:55 mooch3: in the emulator
00:55 airlied: fix that might be a better plan
00:55 mooch3: it's too complicated for me to fix
00:56 mooch3: and i'm kinda banned on the original maintainer's forum for making fun of a patch for being extremely bare bones
00:56 airlied: the mesa drivesrs are only a small userspace piece of code
00:56 mooch3: so bare bones that i don't see how it works at all
00:56 airlied: the kernel drivers are the ones that drive the hw
00:56 mooch3: yeah, but what if you ported that to windows?
00:57 airlied: you'd have to write a whole new windows kernel driver
00:57 airlied: it wouldn't really be porting
00:57 airlied: more like writing from scratch with guidance
00:57 mooch3: why can't you just port the existing code?
00:58 airlied: because the kernel APIs/runtime environments are very different
00:58 mooch3: well, shyit
00:58 mooch3: *shit
00:58 mooch3: i can't get the nt4 drivers to work because of their weird pfifo usage
00:59 mooch3: and i can't install later drivers because they all require sp4
00:59 mooch3: and sp4 and up doesn't let you login
00:59 mooch3: afaik
01:27 mwk: alright, having a __falcon_xdst builtin that takes care of $xdbase, $xtargets, ... on its own turned out to be a real bad idea
01:28 mwk: it sounds good until you think of the interaction with interrupt handling
01:30 mwk: I could just ignore the issue, but that's extremely poor form
01:31 mwk: I could make interrupt handlers save/restore $xdbase etc. if they call any functions, but that's completely unnecessary most of the time
01:32 mwk: I could make them callee-saved, but that adds mostly completely unnecessary code to any function involving a transfer...
01:33 mwk: sounds like it'd be better to just make the user deal with it
01:35 mwk: and of course the situation is even worse for transfers involving the crypt unit, as you can't save/restore the crypto transfer control register
01:37 mwk: well, shit. I really wanted the compiler to take care of that mess.
02:48 mooch: where in envytools are the docs about nv40 shaders?
07:58 mwk: mooch2: nobody ever got around to document the nv40 shaders in envytools, sorry
08:24 mac-: hey guys
08:24 mac-: I have nvs280 PCI card
08:24 mac-: because need only 2D and xvideo
08:24 mac-: acceleration
08:25 mac-: could you tell me if there is any power management in nouveau driver? because the chip becames extremely hot after a while
08:42 darlac: RSpliet: I enabled ddr2 reclocking on nv96 and it works. Unigine Heaven score without memory reclocking -> 86, with memory reclocking -> 138
08:43 darlac: medium settings
08:44 mac-: https://nouveau.freedesktop.org/wiki/KernelModuleParameters/
08:44 mac-: the parameter runpm
08:44 mac-: is 1 or 0 by default ?
08:53 RSpliet: darlac: I suspected it doesn't take long, but never had the opportunity to test
08:53 RSpliet: patches welcome, but... high doesn't work?
08:57 darlac: it should work
08:58 darlac: I will check
09:15 darlac: RSpliet: 80 on high
09:23 darlac: [ 3017.410667] nouveau 0000:02:00.0: heaven_x64[5853]: nv50cal_space: -16
09:57 mwk: mooch2: you're the one who did the Falcon instruction timing work, right?
09:58 mwk: may I ask how you measured it?
09:58 mwk: 'm particularly interested in iord, 1 cycle sounds impossible
09:58 mwk: errr I meant mupuf :)
10:01 mwk: I can't figure out how to set up the measurement neatly
10:02 mwk: all the obvious ways of getting timing information involve the io space
10:03 mwk: which will obviously interact with the measurement for iord/iowr/...
10:03 mwk: the other obvious way is to export timing markers from Falcon to PCOUNTER, but again all the obvious ways involve doing io
10:04 mwk: I wonder if I could make use of some signal exported by the Falcon processor itself
10:05 mwk: there might be a "waiting for io space" signal, that'd be nice
11:14 mupuf: mwk: I stored the current ptimer time, executed the same command something like a 100 times and then read back the time and print
11:15 mupuf: so, it is not good because it does not take pipelining in consideration
11:15 mupuf: iord may have taken one cycle in average because of internal caches
11:28 mwk: io shouldn't ever be cached, though
11:28 mwk: I'm suspecting a different problem, you probably didn't take ILP into account
11:29 mwk: as in, iord "completes" but the result is not present in the register until some other operation actually needs to use it
11:29 mwk: but... yeah, that's kind of crude measurement :)
11:30 mwk:would at the very least use one of the timers actually tied to the Falcon clock
12:05 mwk: eh, Falcon really seems to be designed for asm syntax with the destination on the right
12:06 mwk: if I just gave in to that, I could simplify so many patterns in llvm...
12:06 mwk: all the register fields would be properly ordered
12:23 Calinou: Falcon… CLOCK!
12:23 Calinou:runs
12:26 mupuf: Calinou: /me does not get the joke
15:58 karolherbst: does anybody know a better solution for this? https://github.com/karolherbst/mesa/commit/9246a9b043128616b7264c14a817834108a0e415
15:59 karolherbst: I could check for isChainedCommutationLegal ...
15:59 karolherbst: because that's what dual issueing is somewhat about anyway
15:59 imirkin_: mooch2: the only "docs" for nv40 shader ISA are in mesa... look for nvfx_*
15:59 karolherbst: we can't dual issue instructions you can't swap anyway
15:59 imirkin_: mooch2: i started writing a disasm for it, but didn't get far.
16:00 imirkin_: karolherbst: first off, that's wrong
16:00 imirkin_: karolherbst: getUniqueInsn is def not what you want
16:01 imirkin_: karolherbst: you should look at the registers that a defines
16:01 imirkin_: karolherbst: and look at whether they overlap
16:02 karolherbst: ohh I meant if I could use isCommutationLegal
16:02 imirkin_: karolherbst: i don't remember what that does
16:03 imirkin_: mac-: i also have a NVS 280 PCIe card, and it also heats up quite a bit. i don't think nouveau has any pm support for those. but i'm not sure that there's any pm support to be had on them in the first place.
16:03 karolherbst: imirkin_: it checks insnCheckCommutationDef*
16:03 imirkin_: mac-: runpm is for optimus gpu's that can turn themselves off via platform (e.g. acpi) helpers
16:04 karolherbst: imirkin_: basically it checks srcs and defs via interfers in a very smart way
16:04 karolherbst: well
16:04 karolherbst: brute force way, but that doesnt matter
16:04 imirkin_: that sounds right.
16:04 karolherbst: k, I use isCommutationLegal in my Pass anyway, so it makes sense to use it there too
16:08 karolherbst: mhh
16:08 karolherbst: using isCommutationLegal now makes dual issueing worse :/
16:10 karolherbst: ohh I now
16:10 imirkin_: argument order matters? :)
16:10 karolherbst: mul $r1 $r2 $r3 + mul $r2 $r2 $r3 // this should be possible to dual issue
16:10 karolherbst: but we can't swap them
16:10 imirkin_: right
16:11 karolherbst: so I only do a part what isCommutationLegal does
16:11 imirkin_: right. only half :)
16:11 karolherbst: insnCheckCommutationDefDef and insnCheckCommutationDefSrc(a,b) should be enough
16:11 karolherbst: ...
16:11 karolherbst: or b,a? :D
16:12 imirkin_: i don't remember what those functions do
16:12 imirkin_: rtfs :p
16:17 karolherbst: nope, I only have to use check for insnCheckCommutationDefDef
16:17 karolherbst: Src doesn't matter here
16:18 imirkin_: well, src of insn2 needs to be compared to def of insn1
16:18 karolherbst: ohh
16:18 karolherbst: right
16:18 imirkin_: whatever function does that is the one you want
16:18 karolherbst: I was too fast
16:18 imirkin_: i don't think you want defdef
16:18 imirkin_: but i haven't read the code
16:18 karolherbst: mhh defdef checks for a->getDef(d)->interfers(b->getDef(c))
16:19 karolherbst: which bascically means if false, that a produces a def which is a src of b
16:19 karolherbst: because otherwise b couldn't replace any def of a
16:20 karolherbst: so I ned defa Srcb
16:20 karolherbst: *need
16:20 karolherbst: like mul $r1 $r2 $r3 + mul $r1 $r4 $r5 is something which should never exist anyway
16:21 imirkin_: rarely.
16:21 karolherbst: well in postRA?
16:21 imirkin_: it could exist for silly reasons.
16:21 karolherbst: mhhh
16:21 karolherbst: right
16:21 imirkin_: (i'll have to think about what those reasons might be)
16:21 karolherbst: well to be safe I could also check for that
16:22 imirkin_: but i can't immediately prove to myself that they wouldn't exist
16:22 imirkin_: definitely shouldn't :)
16:22 karolherbst: is there something like Instruction::ResultDependsOn(Instruction*) ?
16:22 karolherbst: ..
16:22 karolherbst: Source
16:22 imirkin_: rtfs? :p
16:22 karolherbst: :D
16:23 karolherbst: nope, doesn't exist
16:43 karolherbst: https://github.com/karolherbst/mesa/commit/00c7a626be40f5da47d8f25ced2f3001a5f58f95
16:43 imirkin_: why do you check DefDef?
16:44 imirkin_: you need it for commutation
16:44 imirkin_: but not for dual-issue
16:44 karolherbst: dependency of order
16:44 karolherbst: well
16:44 karolherbst: for dual issueing we are screwed here anyway
16:44 karolherbst: " mul $r1 $r2 $r3 + mul $r1 $r4 $r5 " <== I really don't think we can dual issue that
16:45 imirkin_: i think we can.
16:45 karolherbst: if both write into the same reg?
16:45 imirkin_: i assume results are stored "in order"
16:45 imirkin_: it'd be architecturally weird if they weren't
16:45 karolherbst: mhh
16:45 karolherbst: true to that
16:45 imirkin_: at least based on the fact that even when we tell it to dual issue when we shouldn't, it still deals with it just fine
16:45 karolherbst: but usually DCE should have eliminate the former instruction by the way...
16:45 imirkin_: yes.
16:46 karolherbst: anyway, it sounds wrong to dual issue that
16:46 karolherbst: well
16:46 imirkin_: does this canDualIssue run on the full instruction stream
16:46 karolherbst: some amd guy told me, you can dual issue "mov $r1 $r2 + mov $r2 $r1" to swap the regs
16:46 karolherbst: on amd hardware
16:46 imirkin_: or does it run one BB at a time?
16:46 karolherbst: BB at a time
16:46 imirkin_: k
16:47 imirkin_: i think that was on VLIW architecture
16:47 imirkin_: which is different than dual-issue
16:47 karolherbst: well here is my pass: https://github.com/karolherbst/mesa/commit/cb01f5c18c212f4e1b466ffd374645e2a216cdc4
16:47 karolherbst: I hope there are enough comments now
16:47 karolherbst: imirkin_: yeah, I know, it is something else
16:47 karolherbst: but he called it that way
16:47 karolherbst: ahh glennk was it I think :D
16:48 karolherbst: and yeah, it was vliw
16:49 imirkin_: this is like a N^4 th algorithm
16:49 imirkin_: i don't like it :p
16:49 karolherbst: well
16:49 karolherbst: in the worst case we can't dual issue anyway
16:50 karolherbst: so perf is bad
16:50 karolherbst: well
16:50 karolherbst: not that bad, but bad
16:50 karolherbst: in pixmark piano we dual issue around 40% stock
16:50 imirkin_: there's already some case where airlied said CTS generates a test that makes nouveau compiler run forever.
16:50 karolherbst: uhhh
16:50 imirkin_: yeah, i don't care about that. i want to make sure nouveau _works_ all the time
16:50 imirkin_: performance is nice, but shouldn't come in between that concept of working
16:50 karolherbst: well
16:51 karolherbst: but it is N^2 as far as I see it
16:51 imirkin_: outer loop = N
16:51 imirkin_: yes?
16:51 karolherbst: for each instruction we check each instruction at most
16:51 imirkin_: then for the case where next = first instruction
16:51 imirkin_: and check = last instruction
16:52 imirkin_: you do the chained thing, which is another N
16:52 karolherbst: not really
16:52 karolherbst: well
16:52 karolherbst: mhh
16:52 karolherbst: it kind of is
16:52 karolherbst: right
16:52 imirkin_: and you do that check for every pair of (next, check) instructions
16:52 imirkin_: so that's N^3
16:53 imirkin_: ok. i guess i went overboard with N^4th, my bad :)
16:53 karolherbst: yeah, but I can't think of a smarter way currently
16:53 mupuf: imirkin_: how about limiting to 10 instructions before and after?
16:53 karolherbst: and it really improves performance where we have an issue slot bottleneck
16:53 imirkin_: that's an example of something that seems quite reasonable
16:54 mupuf: and actually, you may avoid doing the backward search
16:54 karolherbst: mupuf: sure?
16:54 imirkin_: yes, i think that the chained thing can be done much smarter
16:54 mupuf: because it should be commutative, right
16:54 karolherbst: mhh
16:54 imirkin_: given that you do it over the same instructions
16:54 karolherbst: this is PostRA
16:54 imirkin_: you could cache the things that the intermediate instructions overwrite
16:54 imirkin_: and then you'd know in O(1) whether you can move it up or not
16:54 mupuf: if A + B can be dual issued, then so should B + A, right?
16:55 karolherbst: mupuf: nope
16:55 imirkin_: mupuf: well, he doesn't do it for literally every pair
16:55 imirkin_: only for half the pairs
16:55 imirkin_: which is still N^2 :)
16:55 mupuf: imirkin_: ok :D
16:55 karolherbst: mul "$r1 $r1 $r1 + mul $r2 $r1 $r1"
16:55 karolherbst: mupuf: a+b => no dual issue, b+a => dual issue
16:56 mupuf: yeah, but you if the second mul was later in the program, you cannot put it first
16:56 karolherbst: right
16:56 mupuf: so, non-issue here
16:57 karolherbst: but I don#t think that b+a can be dual issued whenever you can dual issue a+b
16:58 karolherbst: ahh
16:58 karolherbst: now I know why I added that chained thing
16:58 karolherbst: "bool isCommutationLegal(const Instruction *) const; // must be adjacent !"
16:58 imirkin_: you need it. you just end up repeating a TON of work every time you call it.
16:58 karolherbst: ahh okay
16:58 karolherbst: ohhh
16:58 karolherbst: right
16:58 imirkin_: so you could reimplement it in a much more efficient way
16:58 karolherbst: I tried to be smart about that
16:58 karolherbst: but I somehow failed
16:58 karolherbst: now I know what you mean
16:59 imirkin_: keep a set of defs written by all the instructions
16:59 karolherbst: I already checked the previos pairs whenever I call it
16:59 imirkin_: in between next and check
16:59 imirkin_: exactly.
16:59 imirkin_: well, it's not about pairs
16:59 imirkin_: coz you can't reuse THAT work
16:59 imirkin_: but you could reuse the work done as part of checking those pairs
17:01 karolherbst: well we have something like that: i -> next -> a -> b -> c... when I move the check pointer from b to c I already checked it next-a and a-b can be swaped, so I can skip those checks and just check for b-c
17:02 karolherbst: aka check->prev and check
17:03 imirkin_: no
17:03 imirkin_: because you're now commuting a different instruction through the stream
17:03 karolherbst: ohh right...
17:03 imirkin_: but
17:03 imirkin_: the instructions you're commuting it "through" haven't changed
17:03 imirkin_: they still have the same defs
17:03 imirkin_: and the same srcs
17:04 imirkin_: so you could built up a set of all those
17:04 imirkin_: and check for regs in those sets
17:04 karolherbst: ohh
17:04 karolherbst: so I just collect some information and check against that pile of defs and srcs
17:04 imirkin_: and then just increment those sets every time you do check = check->next
17:04 imirkin_: exactly.
17:06 karolherbst: well we will need a target->hasDualIssueing() anyway
17:06 karolherbst: or should I check for that differently?
17:06 imirkin_: that too.
17:06 imirkin_: mmmmm
17:06 imirkin_: well the thing is that nvc0 also has dual issue
17:07 imirkin_: it's just done implicitly
17:07 karolherbst: right
17:07 imirkin_: someone should probably play around with it
17:07 imirkin_: except there's no reclocking
17:07 karolherbst: but we doesn't implement it yet anyway
17:07 imirkin_: and so it's ... not very rewarding work
17:07 karolherbst: right
17:07 karolherbst: because if we really need perf, we could just upclock the engines and dual issueing is unimportant because we will be bottlenecked through memory
17:08 karolherbst: but the pass doesn't need to run on those chipsets where we don't know how to dual issue anyway
17:15 karolherbst: I like how the goal it is to have inst_issued1 == 0 :D
17:25 karolherbst: mhh
17:25 karolherbst: is the blury effect in unigine heaven for you also sometimes "broken"?
17:25 karolherbst: as int there are some edges which look odd?
17:25 imirkin_: well, up until recently, i only had like 1 fps on heaven
17:26 imirkin_: so ... it was hard to tell
17:26 karolherbst: uhh
17:26 imirkin_: :)
17:26 karolherbst: I guess that was also not with 8xmsaa and ultra quality? :D
17:26 imirkin_: the GK208 does a lot better
17:26 imirkin_: no... 4x msaa iirc
17:26 karolherbst: mhh
17:26 imirkin_: ultra quality, but 640x480
17:26 karolherbst: I think either I or somebody else broke it
17:26 karolherbst: :D
17:26 karolherbst: well
17:29 karolherbst: yeah, I think I broke it
17:31 karolherbst: https://i.imgur.com/lkVw6b5.jpg
17:32 imirkin_: i think you did.
17:32 karolherbst: I guess I swap something I am not allowed to swap...
17:32 imirkin_: seems eminently likely
17:32 karolherbst: for pixmark_piano I had to add a check against !check->join, because it also messed something up
17:32 imirkin_: like ... are you swapping something with a texbar?
17:32 karolherbst: mhh
17:33 imirkin_: hmmmm... join should be at the very end of a BB
17:33 karolherbst: no idea, but I can check
17:33 imirkin_: and yeah, you can't move an instruction with a join on it
17:33 imirkin_: it's effectively control flow
17:40 karolherbst: well, I don't move a texbar
17:40 imirkin_: could you move an op across a texbar?
17:41 imirkin_: you should treat texbars as fixed
17:41 karolherbst: very unlikely, because I do: bb->permuteAdjacent(check->prev, check);
17:41 imirkin_: in fact... you can change the place that adds them to mark them as fixed
17:42 karolherbst: well, I don't move joins and no texbar around :/ anything else, which might be bad?
17:42 imirkin_: you're not checking flags somehow?
17:43 karolherbst: not directly at least
17:43 karolherbst: I check for join and fixed
17:43 karolherbst: and for asFlow()
17:43 imirkin_: i mean like carry flags, etc
17:43 karolherbst: nope
17:43 imirkin_: those should be regular sources/defs though
17:44 imirkin_: Instruction *bar = new_Instruction(func, OP_TEXBAR, TYPE_NONE);
17:44 imirkin_: bar->fixed = 1;
17:44 imirkin_: hm, we already make texbar's fixed
17:44 karolherbst: it will be something silly, I am sure of it
17:44 imirkin_: always is.
17:44 karolherbst: anyway, I have lke 11k lines print out ... maybe I find it in there
17:44 imirkin_: good luck :)
17:45 karolherbst: what does dfdy and dfdx do by the way?
17:45 imirkin_: derivatives
17:45 imirkin_: pixel shaders execute in quads
17:45 imirkin_: dfdx does a discrete derivative of the value along the x axis, dfdy does it along the y axis
17:46 karolherbst: mhh I am more confused that you only put one src in those
17:46 imirkin_: right
17:46 karolherbst: ahh
17:46 imirkin_: it looks at the value of that register in different lanes
17:47 imirkin_: it's not really a derivative
17:47 karolherbst: tex 2D $r9 $s0 f32 $r0d $r0d
17:47 imirkin_: it's actually a difference
17:47 karolherbst: is $s0 changed in any way?
17:47 imirkin_: remember that $r9 and $s0 are ... bullshit
17:47 karolherbst: okay
17:47 imirkin_: it should actually be c[0x9*4]
17:47 imirkin_: just a printing artifact
17:47 karolherbst: okay so I can swap it with "tex 2D $r8 $s0 f32 $r4q $r2d" ?
17:48 imirkin_: ahhhhh
17:48 imirkin_: heh
17:48 imirkin_: that's what's going wrong
17:48 karolherbst: :D
17:48 imirkin_: that's pretty insiduous
17:48 imirkin_: so... technically yes.
17:48 imirkin_: BUT
17:48 imirkin_: in practice you might have like
17:48 imirkin_: a = tex()
17:48 imirkin_: b = tex()
17:48 imirkin_: texbar 1
17:48 imirkin_: use(a)
17:48 karolherbst: ohhhh
17:48 imirkin_: texbar 0
17:49 imirkin_: use(b)
17:49 karolherbst: ...
17:49 karolherbst: so for now: don't touch tex
17:49 imirkin_: you can touch tex
17:49 imirkin_: but you can't swap the order of tex's
17:49 karolherbst: mhhh
17:49 imirkin_: i.e. never commute a tex with another tex
17:51 karolherbst: messy
17:53 karolherbst: yep
17:53 karolherbst: that was it
17:53 karolherbst: I ignored TEX instructions completly now, and it is fixed
17:53 imirkin_: =]
17:54 karolherbst: this will be a bit messy then
17:54 karolherbst: because this changed the chain handling I have currently :/
17:54 imirkin_: you can check in your chained commutation dealie
17:54 karolherbst: if the end is tex and one instruction in the chain is tex, I can't swap anymore
17:54 imirkin_: yes.
17:54 imirkin_: but you can keep going
17:54 imirkin_: other ops can be swapped across a tex
17:55 karolherbst: right
18:00 karolherbst: but heaven feels kind of laggy :/
18:00 karolherbst: or "jumpy"
18:22 karolherbst: huh
18:23 karolherbst: mhh
18:23 karolherbst: nice
18:24 karolherbst: I found a program where swapping two pairs of instruction worses dual issueing a bit
18:24 imirkin_: makes sense.
18:24 karolherbst: but it shouldn't :D
18:25 imirkin_: how so?
18:25 imirkin_: you're using a heuristic
18:25 imirkin_: every move you make isn't guaranteed to improve things.
18:26 karolherbst: both times: https://gist.github.com/karolherbst/dc47741be09ab2c944dc73bc9feda944
18:27 imirkin_: i would have guessed the second one would be better
18:27 imirkin_: but what do i know
18:27 karolherbst: yeah
18:27 karolherbst: exactly
18:27 karolherbst: but it is worse
18:27 karolherbst: ohhhhh
18:28 karolherbst: maybe the issue slots are uber smart...
18:28 karolherbst: and merge the pinterp and the add into a single thing
18:28 karolherbst: because the result is just -0.500000 of the pinterp
18:28 karolherbst: and that might explains why the author of the code though dual issueing with overlapping def/src is fine
18:28 karolherbst: and maybe it is... in special cases
18:31 karolherbst: and that would mean we broke a dual issue pair (pinterp+mul // ld +mad) and now (pinterp+ld // add // mad)
18:32 imirkin_: that seems eminently unlikely
18:32 imirkin_: i'm sure it's a much simpler explanation
18:33 karolherbst: like the opclasses are different on the hw?
18:33 imirkin_: no.
18:36 karolherbst: pinterp is sfu and add is arith... maybe ... maybe some classes can be dual issued even when they overlap?
18:36 karolherbst: but it doesn't make sense
18:37 imirkin_: chances are it's nothing to do with those
18:38 imirkin_: but rather the surrounding logic
18:38 karolherbst: ohh
18:38 imirkin_: like the fact that it's in position 7 of the group
18:38 karolherbst: in the emited binary the dual issueing is gone
18:38 imirkin_: that's a problem :)
18:38 karolherbst: but the emited binary looks odd anyway
18:38 karolherbst: 42e282e7 228042e0 -> 42e282e7 2042d200
18:39 karolherbst: and 00428047 2202c272 -> 00428207 2202c272
18:39 karolherbst: in the first case a dual issueing is added
18:39 karolherbst: and in the second one one is gone
18:39 karolherbst: ohh wait
18:39 karolherbst: no
18:39 karolherbst: in the first one, it is moved
18:39 karolherbst: to an earlier position, which is good
18:42 karolherbst: ohh I know now
18:42 karolherbst: some unrelated pair was un dual issued
18:46 karolherbst: imirkin_: by the way, envydis can't handle the scheds for odd reasons :/
18:46 karolherbst: but maybe passing -m nvc0 is also wrong
18:46 imirkin_: ??
18:46 imirkin_: -m gf100 -V gk104
18:46 karolherbst: ahhh
18:46 karolherbst: thanks
18:47 karolherbst: imirkin_: is there an easy way to feed envydis with the binary data from NV50_PROG_DEBUG?
18:47 imirkin_: yea
18:47 imirkin_: envydis -m gf100 -V gk104 -w
18:47 imirkin_: and then paste the binary
18:48 karolherbst: nice, awesome
18:48 karolherbst: thanks
18:49 imirkin_: it's almost like i do that all the time :)
18:52 karolherbst: "fma ftz rn f32 $r2 $r1 c0[0x50] $r2" + "mov b32 $r1 c0[0x10]" we should be able to dual issue those, right?
18:53 karolherbst: because after my fix patch, it doesn't anymore
18:53 imirkin_: some fix :p
18:53 imirkin_: in that order, it seems perfectly fine
18:54 karolherbst: ohhh
18:54 karolherbst: now I see it
18:54 karolherbst: previous block ends with a dual issue
18:56 karolherbst: interp mul f32 $r0 a[0x84] $r0 0x0 + add ftz rn f32 $r0 $r0 0xbf000000 // mov b32 $r2 c0[0x0] // fma ftz rn f32 $r2 $r1 c0[0x50] $r2 + mov b32 $r1 c0[0x10]
18:56 karolherbst: becames
18:56 karolherbst: interp mul f32 $r0 a[0x84] $r0 0x0 // mov b32 $r2 c0[0x0] // add ftz rn f32 $r0 $r0 0xbf000000 + fma ftz rn f32 $r2 $r1 c0[0x50] $r2 // mov b32 $r1 c0[0x10]
18:57 karolherbst: *became...
18:57 imirkin_: optimization is hard.
18:57 imirkin_: let's go to the beach instead
18:57 karolherbst: mhh
18:57 karolherbst: I think I know what might happen
18:57 karolherbst: but that's odd
18:57 karolherbst: is there something which happens after the postra opts which still changes the code?
18:58 karolherbst: allthough...
18:58 karolherbst: I think I know what's the issue is
18:58 karolherbst: the interp is dual issued with it's previous instruction
18:59 karolherbst: he...
18:59 imirkin_: the issue is you're trying to use local decisions to solve a global optimization problem
18:59 karolherbst: yeah
18:59 karolherbst: and I don't respect the pairs
19:00 hakzsam_: imirkin_, is there a way to expose ARB_compute_shader with the compat profile? (for those NVIDIA GL samples)
19:00 karolherbst: so I hope by creating an endless chain of dual issueable instructions, it will be better in the end
19:01 karolherbst: imirkin_: maybe I should move that pass directly before the emit stage? or is there a better place to do that and to be sure, that nothing changes anymore
19:01 imirkin_: hakzsam_: sure, just futz with extension_table.h
19:01 karolherbst: imirkin_: because then I can do it more optimized
19:01 imirkin_: karolherbst: every pass wants to go last
19:01 hakzsam_: imirkin_, okay
19:01 karolherbst: imirkin_: nah, mine really wants to be last :p
19:01 imirkin_: karolherbst: because it is clearly smarter than all previous passes before it
19:02 hakzsam_: imirkin_, well, maybe we will expose descriptions through the HUD in the future :)
19:02 imirkin_: karolherbst: but in the end, there can be only one...
19:02 karolherbst: yeah well. mine pass is a stupid scheduling pass
19:02 karolherbst: somewhat
19:03 imirkin_: https://www.youtube.com/watch?v=_J3VeogFUOs
19:03 imirkin_: wow, i didn't remember it being quite so dumb
19:03 Yoshimo: how does the nvidia driver know which optimizations it has to use for a specific game? Does it check the name of the exe file on windows?
19:04 imirkin_: Yoshimo: they have a 100-1000 person-strong compiler team who writes a pretty solid compiler.
19:04 imirkin_: [probably closer to 100 than 1000]
19:04 hakzsam_: and they have documentation :)
19:04 imirkin_: eh
19:05 imirkin_: that also helps
19:05 hakzsam_: (and more time than us)
19:05 imirkin_: but having full-time developers working on serious compiler optimizations seems like the bigger win
19:05 hakzsam_: I do agree
19:05 karolherbst: Yoshimo: in the end you can hash and replace with hand optimized stuff ;)
19:06 karolherbst: or do hot optimizations and stuff like that
19:06 karolherbst: profile guided opts in gcc also make a big difference
19:06 imirkin_: i've heard that they have manual shader replacement, where they just recognize a shader pattern and plop down a hand-optimized one
19:06 imirkin_: but i have to imagine that's relatively rare
19:06 karolherbst: yeah, that too
19:06 karolherbst: mhh
19:06 Yoshimo: i was just wondering which part of the linux graphic stack would be in charge of game specific improvements
19:06 karolherbst: well
19:06 karolherbst: imirkin_: expect them to do that for every AAA title ;)
19:07 karolherbst: sometimes the windows driver updates say "+30% with that game and this driver" :D
19:07 mooch2: Yoshimo, why would you want that
19:07 imirkin_: otoh, they have full-time developers who work alongside AAA teams to help them optimize their titles for nvidia hw
19:07 mooch2: over opengl conformance
19:07 imirkin_: karolherbst: yeah, but could just be due to a generic compiler improvement that happens to really help some specific game's shaders
19:07 karolherbst: well if you know nothing breaks, you can violate OpenGL specs for games
19:08 mooch2: well, emulators need hard spec conformance
19:08 karolherbst: imirkin_: might be, but that's usually close to game realses
19:08 karolherbst: *releases
19:08 imirkin_: karolherbst: well, perhaps they only looked at making that opt coz of that game :)
19:08 imirkin_: but that doesn't make it a non-general opt
19:08 karolherbst: right
19:08 mooch2: that's why rpcs3's vulkan backend, which started as a carbon copy of the gl backend rose in accuracy simply due to the drivers being less hacky
19:09 karolherbst: :D
19:09 karolherbst: well
19:09 karolherbst: I think nvidia also maps some spec functions to others if they know it is faster on their hardware
19:09 karolherbst: or something like that
19:09 mooch2: oh god
19:09 mooch2: this is why we need mesa on windows
19:09 mooch2: then at least, we can fucking have spec conformance for gl
19:09 imirkin_: microsoft prevents unlicensed people from writing windows drivers
19:10 mooch2: what
19:10 karolherbst: yep
19:10 karolherbst: of course they do :D
19:10 mooch2: don't they have a public ddk tho?
19:10 karolherbst: otherwise secure boot would be like useless?
19:10 karolherbst: :D
19:10 hakzsam_: imirkin_, mmh, ARB_compute_shader seems to be already exposed with the compat profile
19:10 imirkin_: (a) not for WDDM or whatever, and (b) drivers have to be signed
19:10 mooch2: ugh that's right
19:10 mooch2: on win8+ drivers have to be signed
19:11 mooch2: i encountered that problem with an el cheapo capture card
19:11 mooch2: had to override a bunch of stuff to install the drivers
19:11 hakzsam_: imirkin_, at least, glxinfo tells me that
19:11 imirkin_: hakzsam_: hm, so it is.
19:11 karolherbst: imirkin_: anyway, know any place where I could do the opt more efficiently?
19:11 mooch2: i at least want to use mesa drivers on older windows like 9x and nt4
19:11 imirkin_: hakzsam_: fix this: https://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/main/context.h#n343
19:12 karolherbst: mooch2: well, I am sure you can as a software driver
19:12 karolherbst: *software gl
19:12 hakzsam_: imirkin_, right
19:12 imirkin_: hakzsam_: change ctx->API == API_OPENGL_CORE && ctx->Extensions.ARB_compute_shader into _mesa_has_ARB_compute_shader(ctx)
19:12 karolherbst: well
19:12 karolherbst: that will be really hacky with windows
19:12 mooch2: TO TEST MY NV4 EMULATION
19:12 imirkin_: mooch2: use linux.
19:13 karolherbst: because I am sure you have to implement the d3d bits for that thing
19:13 karolherbst: ohh wait
19:13 karolherbst: maybe on those old windows it is actually fine
19:13 imirkin_: i've already pointed you to the exact bmdma setting you need to flip
19:13 mooch2: imirkin_, yeah, but i need a live cd with that already flipped
19:13 imirkin_: so make a livecd with that flipped
19:13 mooch2: how?
19:13 mooch2: also, i'm devving from windows here
19:13 imirkin_: how do you make a livecd without that flipped? :)
19:13 mooch2: i dunno!
19:13 imirkin_: well that's a solvable issue
19:13 imirkin_: then read up about it
19:14 mooch2: well, linux and 64-bit especially are second-class citizens for this emulator
19:14 hakzsam_: karolherbst, I just pushed the patch which adds descriptions of hw events
19:14 mooch2: maybe i should just rewrite this for qemu and shut up :^)
19:14 imirkin_: then stop asking questions about it in a linux-focused chan
19:15 mooch2: sorry, i found pcem easiest to dev for
19:15 imirkin_: i have no problems with that
19:15 hakzsam_: imirkin_, thanks, it works. Do you think we should the patch?
19:15 hakzsam_: +send
19:15 imirkin_: hakzsam_: yes
19:15 hakzsam_: imirkin_, actually, it's more your than mine :)
19:15 imirkin_: hakzsam_: i don't care :)
19:15 hakzsam_: :)
19:15 imirkin_: you wrote the literal code, tested it, you send it.
19:17 hakzsam_: sent
19:17 hakzsam_: this ComputeBasicGLSL demo now works perfectly :)
19:18 hakzsam_: I will try the other ones
19:22 imirkin_: karolherbst: do you have any real games that use UE4?
19:22 imirkin_: not the UE4 demos
19:23 mac-: imirkin_: thx for the clarification
19:24 mac-: imirkin_: I can see very low performance when sliding up and down web pages on Firefox, is that normal ?
19:24 imirkin_: probably... try running firefox with LIBGL_ALWAYS_SOFTWARE=1 or disabling its GL-based logic
19:25 imirkin_: nv34 exposes GL 1.5, so i doubt firefox is well-tuned towards that
19:25 imirkin_: i'm thinking of making it expose GLES2 though
19:26 mac-: uhm, but with LIBGL_ALWAYS_SOFTWARE=1 I won't have any GPU acceleration in Firefox and everything will be done by CPU ?
19:26 imirkin_: right
19:26 imirkin_: which will work way better :)
19:26 mac-: will try
19:26 imirkin_: those gpu's weren't designed for the desktop-style effects that GL is being used for now
19:27 karolherbst: imirkin_: maybe?
19:27 mac-: the flag can be provided just as parameter durign execution ?
19:27 imirkin_: mac-: only at process start
19:27 imirkin_: mac-: e.g. run LIBGL_ALWAYS_SOFTWARE=1 firefox
19:27 mac-: ok
19:27 imirkin_: (in such a way that a new firefox process actually starts... if you have one running, it'll end up doing nothing)
19:28 mac-: if I did run ramebuffer (noevau in this case by modprobe) how can I turn it off to get back to real text console ?
19:28 hakzsam: imirkin_, ComputeWaterSimulation just crashed my compute... and it uses compute shaders
19:29 mac-: imirkin_: which GPU is supported enough and provides hw accel for such things like ff ?
19:29 imirkin_: mac-: you probably don't want that... you still want to use the X-based acceleration
19:29 imirkin_: mac-: but there's a note on how to remove the nouveau kernel module here: https://nouveau.freedesktop.org/wiki/KernelModeSetting/
19:29 imirkin_: mac-: any G80 or later gpu should be fine
19:30 karolherbst: imirkin_: doesn't seem that way
19:30 imirkin_: mac-: but note you'll get way better support if you go with an amd gpu
19:30 mac-: you think?
19:31 imirkin_: well, i know that amd has a team of full-time engineers supporting their hw
19:31 imirkin_: while nouveau is in large part volunteer-supported, with 1 full-timer at redhat
19:32 imirkin_: and the amd guys have docs, we don't
19:32 mac-: I have a netbook based on E-350 APU, had to upgrade my Slackware64 on it and ended up on fixing AMD driver source code because it even didn't want to compile ...
19:32 Calinou: recent GPUs have decent Nouveau support
19:32 Calinou: I wouldn't be surprised if Kepler cards were at AMD level in 2 years
19:32 imirkin_: well, no one's perfect, but i doubt they had such an issue for a long period of time
19:33 Calinou: maybe Maxwell, too
19:33 imirkin_: Calinou: i would be :)
19:33 Calinou: anyway, AMD cards are nowhere to be found in laptops (high-end ones), and on desktops, they aren't as power-efficient
19:33 Calinou: that's why I still prefer nvidia :/
19:34 imirkin_: but you don't care about open-source drivers and use the proprietary blob
19:34 imirkin_: if you cared about using open-source software, amd is the only real choice
19:34 imirkin_: unless you just want to hack on stuff, in which case nouveau can be fun
19:34 Calinou: AMD has proprietary firmware :D
19:34 Calinou: Nouveau doesn't, at least for pre-Maxwell
19:34 imirkin_: so does everything now
19:35 imirkin_: including your cpu
19:35 hakzsam: imirkin_, here's the report for NVIDIA samples http://hastebin.com/iqibiwegoq.md
19:36 hakzsam: I guess some extensions are not exposed with the compat profile
19:36 imirkin_: hakzsam: right
19:36 hakzsam: like ARB_compute_shader which we fixed few minutes ago
19:36 imirkin_: hakzsam: well, that one WAS exposed
19:36 hakzsam: imirkin_, not correctly
19:36 imirkin_: but then mesa rejected commands randomly
19:37 imirkin_: hence the issue
19:37 hakzsam: yep
19:38 imirkin_: but e.g. indirect draws are just not exposed at all in compat
19:38 imirkin_: iirc there's some annoying interactions with client-side arrays
19:38 imirkin_: or something like that
19:38 imirkin_: or maybe something related to the default vao? dunno
19:38 hakzsam: imirkin_, so, we can't do that easily?
19:38 imirkin_: i dunno
19:38 imirkin_: maybe we can
19:39 imirkin_: just not in a spec-conformant fashion
19:39 imirkin_: feel free to futz with the extension_table.h
19:39 hakzsam: will do
19:43 hakzsam: imirkin_, SoftShadows actually needs 4.4, but it works too
19:43 imirkin_: ah cool
19:44 imirkin_: presumably it doesn't use the unimplemented bits of ARB_enhanced_layouts :)
19:44 hakzsam: most likely
19:55 karolherbst: imirkin_: okay, I found no evidence that SFU interactions can be dual issued with add $r0 $r0 immediate
20:02 hakzsam: imirkin_, well, except the missing extensions, all the other fails seem to be related to the GLSL compiler. Usually, syntax error.
20:08 hakzsam: imirkin_, I'll investigate about ComputeWaterSimulation on reator because I don't want to reboot my own machine ;)
20:11 imirkin_: hakzsam: syntax error like what?
20:16 karolherbst: this is odd. when we emit 1066 dual issues out of 3773 sched tags, we should have inst_issued2 28.25% of inst_issued1
20:17 hakzsam: imirkin_, http://hastebin.com/ekezuzavir.vbs
20:17 karolherbst: but it is more like 24.62%
20:17 karolherbst: maybe I find where the difference comes from too :D
20:18 imirkin_: hakzsam: full shader would be good :p
20:18 imirkin_: probably they do something dumb like void main() { };
20:18 hakzsam: http://hastebin.com/elupefutam
20:18 hakzsam: dumped with ST_DUMP_SHADERS
20:19 imirkin_: a shader that fails to compile doesn't make it to that stage
20:19 imirkin_: you want MESA_GLSL=dump or osmething
20:20 hakzsam: http://hastebin.com/agemezoqic.vhdl
20:20 imirkin_: ok, it's what i assumed.
20:21 imirkin_: void main { };
20:21 imirkin_: airlied: --^
20:21 imirkin_: didn't you say CTS was doing that a lot?
20:21 airlied: yup trailing semicolon fail
20:21 hakzsam: yeah, removing the semicolon fixes the issue
20:22 imirkin_: airlied: did you determine this was illegal according to the spec?
20:22 airlied: I couldn't determine it was definitely legal
20:22 airlied: I'm not the greatest spec grammar reader
20:23 imirkin_: :)
20:23 imirkin_: clearly you can have like int x = 5; at global scope
20:24 imirkin_: the question is whether you can have an empty statement at global scope
20:24 airlied: but looking at the SEMICOLONs in the grammar it appears you can't
20:24 imirkin_: afaik that grammar is non-normative though
20:24 imirkin_: it's just there for amusement value
20:25 airlied: no they fixed it up in GLSL4.50
20:26 airlied: so it at least made sense
20:26 imirkin_: ah cool. looking no
20:26 imirkin_: now
20:27 hakzsam: imirkin_, other one http://hastebin.com/ilesizohas.coffee
20:27 imirkin_: airlied: looks like expression_statement can be SEMICOLON
20:28 imirkin_: and expression_statement is one of the things for simple_statement
20:28 hakzsam: [ 3939.191984] nouveau 0000:01:00.0: gr: TRAP ch 7 [007f9ed000 FeedbackParticl[15310]]
20:28 hakzsam: [ 3939.191992] nouveau 0000:01:00.0: gr: SHADER 90000100
20:28 hakzsam: uhu
20:28 airlied: imirkin_: start from translation_unit
20:28 airlied: and see if you can find a path down
20:29 karolherbst: mhh odd, we schedule too many dual issues
20:29 imirkin_: airlied: hm, i see. it can only be a declaration (or function def)
20:29 airlied: yup and declaration has all the SEMICOLON cases
20:29 imirkin_: function_prototype can't be empty
20:30 imirkin_: type_qualifier can't be empty
20:31 karolherbst: and I am stupid
20:31 karolherbst: does anybody know a core limited benchmark with no code branching?
20:31 imirkin_: airlied: yeah ok, i think you're right
20:32 imirkin_: airlied: inside a function you can have as many ;;;; as you want
20:32 imirkin_: airlied: but not at global scope
20:37 hakzsam: imirkin_, did you already see such error?
20:37 imirkin_: hakzsam: it doesn't sound OVERLY familiar... but not totally unfamiliar
20:37 imirkin_: unfortunately i don't remember specifics
20:37 hakzsam: okay
20:38 imirkin_: iirc i got a similar type of error when i tried using fp64 from a shader without flipping the fp64 shader header bit on
20:38 imirkin_: probably not that exact code
20:38 imirkin_: perhaps there's more shader header bit flipping that needs to happen?
20:39 hakzsam: maybe, I need to trace blob
20:39 hakzsam: thanks for the hint
20:39 imirkin_: looks like we don't set anything for fp64 in the descriptor
20:40 karolherbst: ha, we can dual issue flow instructions :)
20:40 karolherbst: buit... why? :D
20:40 imirkin_: why not?
20:41 karolherbst: I don't know, it was explicitly disabled in the code
20:41 karolherbst: inst_issued2: 256M -> 259M after I removed the opclass check against FLOW
20:41 karolherbst: maybe bra can't be dual issued
20:41 karolherbst: but stuff like join and whatelse
20:41 imirkin_: hakzsam: i'm guessing nve4_cp_launch_desc_init_default sets up some of those defaults
20:42 imirkin_: hakzsam: like gmem store/etc
20:42 imirkin_: hakzsam: but i'm guessing not fp64
20:42 imirkin_: hakzsam: do you knwo if that app tries to use fp64?
20:42 hakzsam: imirkin_, yeah, but lot of bits are unknown
20:42 karolherbst: yeah, no change when I exclude OP_BRA
20:42 hakzsam: imirkin_, no idea
20:43 karolherbst: well, also no emited binary change...
20:43 karolherbst: ahh right, because that's get handled in the sched code calculater already
20:46 hakzsam: imirkin_, will have a look later, I'm fixing the mp perf counters stuff
20:46 karolherbst: ...
20:46 karolherbst: now I have this: if b depends on a false else true: inst_issued2 increased again
20:49 karolherbst: perf also increased
20:50 hakzsam: imirkin_, ahah, using c7[] on fermi while the driver cb is on c15[] is not good :)
20:50 imirkin_: oops
20:51 imirkin_: that's the "c7 is not bound, you idiot" error?
20:52 hakzsam: yeah, works better now
20:54 hakzsam: imirkin_, I will try with a UE4 demo now
20:57 hakzsam: imirkin_, yeah, works fine :)
20:57 hakzsam: we can now use MP perf counters with compute shaders
20:57 imirkin_: hakzsam: huh?
20:57 imirkin_: oh, you mean MP counters work fine. but UE4 still broken without hud?
20:58 hakzsam: oh yeah
20:58 hakzsam: sorry for the confusion
20:58 hakzsam: I was talking about that patch http://hastebin.com/omixoyeyep
20:58 imirkin_: cool
20:58 imirkin_: might want to hold off on the last bit :)
21:01 hakzsam: yeah, but I need to do the same change for kepler first
21:01 imirkin_: yep
21:03 karolherbst: ohh nice, two compare operations can be dual issued
21:04 imirkin_: karolherbst: please look at nvidia blob-generated shaders to see what they dual-issue
21:04 imirkin_: just coz you can set the dual-issue bit doesn't mean it's correct
21:04 karolherbst: the counter tell me what the hardware does
21:04 karolherbst: when we dual issue wrong, the counter respects that
21:05 imirkin_: and just coz you can fool the hw sometimes doesn't mean you're right
21:05 karolherbst: and actually shows what the hardware really dual issues
21:05 karolherbst: well, higher inst_issued2 also means more perf, never saw a counter example
21:05 karolherbst: but yeah, I will also check that with the nvidia generated binaries
21:05 imirkin_: but doesn't necessarily mean it's correct
21:05 imirkin_: randomly removing every other op leads to more perf too, i bet
21:06 karolherbst: well, it still looks the same
21:06 imirkin_: :p
21:06 imirkin_: do you not understand my point?
21:06 karolherbst: and as long as there are no more piglit fails, I don't care :D
21:06 karolherbst: yeah, I know what you mean
21:06 imirkin_: that's coz you're not the one debugging weird-ass fails
21:07 karolherbst: well git bisect should put the blame on me real fast
21:07 mac-: imirkin_: I have changed to nvs285 now
21:07 mac-: and it works many times better
21:07 imirkin_: karolherbst: until it's some combination with another futurely-new optimization that triggers the issue
21:07 karolherbst: right
21:08 imirkin_: mac-: what a world of difference a 5 can make :)
21:08 mac-: impressive :p
21:08 imirkin_: and that's only a NV44
21:08 imirkin_: i assume that the fact that it exposes GL 2.1 makes firefox much happier
21:08 mac-: finally nvs280 hung the whole machine
21:08 mac-: I'm afraid that it overheated
21:09 imirkin_: mac-: if you're willing to add another 5 to it and make it a NVS 290, that'll get you into G8x territory
21:13 karolherbst: gnurou: do you think we could get information about which instructions can be dual issued? :/
21:15 karolherbst: nice, I still have that mmt of pixmark_piano :)
21:20 karolherbst: yep
21:20 karolherbst: imirkin_: nvidia dual issues mins :)
21:20 karolherbst: even two mins
21:21 imirkin_: why wouldn't it?
21:21 imirkin_: we don't?
21:21 karolherbst: as I said: we didn't we?
21:21 karolherbst: right
21:21 karolherbst: only two ARTIH
21:21 imirkin_: and min isn't arith?
21:21 karolherbst: nope
21:21 karolherbst: it is compare
21:21 imirkin_: weird
21:21 karolherbst: and max too
21:21 imirkin_: yeah, they're the same op
21:21 karolherbst: mhh
21:21 karolherbst: but yeah, it is odd
21:22 karolherbst: well
21:22 imirkin_: the op is actually FMNMX
21:22 karolherbst: selp is also compare
21:22 imirkin_: and it takes an extra predicate arg at the end
21:22 imirkin_: which tells it whether to take the min or the max
21:22 imirkin_: nfc why that op exists
21:22 imirkin_: and not have it be 2 ops
21:22 imirkin_: but ... whatever
21:23 karolherbst: :D
21:23 karolherbst: mhh
21:23 karolherbst: stupid blob
21:23 karolherbst: I found two set folowing each other, but the first one is dual issued with the instruction before :/
21:24 karolherbst: what does B mean here? "B set ftz $p0 0x1 gt f32 $r2 0x41c80000"
21:24 imirkin_: that means something branches there
21:24 imirkin_: it's just a reading aid
21:24 karolherbst: ahh okay
21:24 karolherbst: well the second set ors the result of the first one :/
21:24 karolherbst: ...
21:24 imirkin_: i.e. there's a branch somewhere with that set as the destination
21:25 karolherbst: maybe I find somthing good :D
21:25 karolherbst: yeah,
21:25 karolherbst: there was a joinat before too
21:25 imirkin_: right. which means that joins thereafter will jump to that set
21:26 karolherbst: okay, can't find any evidence that we can dual issue two sets
21:26 karolherbst: but two mins is fine
21:29 karolherbst: and yeah, it was also min/max for me
21:29 karolherbst: imirkin_: should min/max be arithmetic then? no idea why they are marked as compare
21:29 imirkin_: well, FMNMX is very similar to SELP
21:29 karolherbst: they compare stuff for sure, but mhhh
21:30 imirkin_: which in turn isn't too different from FSETP
21:30 imirkin_: er, FSET
21:30 karolherbst: what is the name for slct?
21:30 imirkin_: i dunno tbh
21:31 karolherbst: shouldn't those opclasses tell us which part of the cores are used on the hardware?
21:31 imirkin_: no clue.
21:31 karolherbst: we will need that maybe anyway
21:32 imirkin_: and no clue how classes are used throughout the compiler
21:32 imirkin_: i assumed it was mostly for sched purposes
21:32 karolherbst: like with the metrics we can find out which parts may be a bottleneck
21:32 karolherbst: well
21:33 karolherbst: those are all on the alu anyway, right?
21:33 imirkin_: are they? dunno.
21:33 karolherbst: well all the others won't fit
21:33 imirkin_: i wouldn't make such wild assumptions :)
21:33 karolherbst: well, allthough those metrics doesn't tell us if the cores are seperated this way
21:34 karolherbst: anyway, we can dual issue min/max together as pairs
21:35 karolherbst: mhh
21:35 karolherbst: maybe I should write a program where we can throw in binary code and it tells us how we would fill the scheds and it tells us every difference :D
21:37 hakzsam: imirkin_, just sent out
21:38 imirkin_: hakzsam: no maxwell counters yet right?
21:38 hakzsam: nope
21:38 hakzsam: logic has changed a bit IIRC
21:38 hakzsam: I didn't RE it
21:38 karolherbst: 52.5 minimal frame time by the way... this is frigging close to nvidia now
21:38 hakzsam: imirkin_, we should also make this change for nv50, but heh no compute shaders and no OpenCL currently
21:39 hakzsam: imirkin_, before pushing, I'll give a little test on gk107 tomorrow
21:40 imirkin_: meh
21:40 imirkin_: you're being overly cautious
21:40 imirkin_: ship it ;)
21:40 imirkin_: have you tested it WITHOUT compute shaders btw?
21:40 imirkin_: like glxgears/etc
21:40 hakzsam: meh, this takes 2 minutes :)
21:40 hakzsam: yes
21:40 imirkin_: k
21:41 karolherbst: 82% compared to nvidia, only 18% to go now
21:44 karolherbst: stock nouveau: 74.4%
21:48 hakzsam: imirkin_, do you know if the UE4 demos replay fine on radeonsi?
21:48 imirkin_: i'm told - yes.
21:48 hakzsam: so, it's not an application bug
21:49 hakzsam: a missing barrier or something like you guessed
21:54 imirkin_: why not?
21:55 hakzsam: well, if they replay fine with other drivers I would say that we have an issue
21:55 hakzsam: anyways, something is wrong between 3d and compute
21:56 hakzsam: we already know that
21:56 karolherbst: mhh break can be dual issued with the next instruction
21:57 imirkin_: hakzsam: that's not sound logic... it could just happen to work fine due to how the hw works
21:57 hakzsam: mmh yeah
22:00 hakzsam: and unfortunately NVIDIA doesn't use the 4.3 renderer...
22:00 hakzsam: that might explain something too
22:00 karolherbst: uhhh, why doesn't nvidia uses breaks... or are they called differently?
22:00 karolherbst: I guess nvidia optimizes just smarter than we do
22:02 imirkin_: probably as a result of how their compiler works
22:02 imirkin_: i don't think it preserves things like loops
22:02 imirkin_: and when they restructurize, they just end up using branches
22:03 karolherbst: mhh
22:03 karolherbst: maybe brk is really bad to use?
22:03 karolherbst: :/
22:03 imirkin_: no, just probably difficult to figure out when to use it in their compiler
22:03 karolherbst: ohh okay
22:03 karolherbst: well I still don't get why we can dual issue breaks + something
22:04 karolherbst: mhh
22:04 karolherbst: maybe it depends on the predicate
22:06 karolherbst: uhh
22:06 karolherbst: this is odd
22:06 karolherbst: inst_issued1 _and_ inst_issued2 are increasing
22:07 karolherbst: and performance got worse
22:09 karolherbst: okay, enough dual issueing fun for today :)
22:10 karolherbst: 1019->1054 score increase is pretty solid already
22:10 mac-: hm, can system hangs after:
22:10 mac-: kernel: [ 3500.410315] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 1 [X[2264]] get beef0200 put 0001d4c8 state c002018c (err: MEM_FAULT) push 00000000
22:10 mac-: ?
22:10 imirkin_: mac-: irrespective of what you write next, the answer is always "yes"
22:10 karolherbst: :D
22:10 karolherbst: well
22:10 karolherbst: mac-: X got messed up
22:11 karolherbst: which means, yeah, your desktop is kind of frozen now :p
22:11 mac-: it was
22:11 mac-: I had to reboot
22:11 karolherbst: well
22:11 karolherbst: the more annoying thing would be to figure out what caused that
22:12 mac-: sth I can fix or it just crashes
22:12 mac-: ?
22:12 imirkin_: you could get an amd video card
22:12 mac-: I have compiled kernel 4.6, maybe thats why
22:12 mac-: will try to work on default one after next crash
22:13 mac-: nouveau-pci-0100 temp1: +58.0°C
22:13 mac-: nothing bad so far
22:18 mac-: but this is strange because nvs280 frozen in same way