06:55 hakzsam: orbea, imirkin I will have a look today, maybe it's related, maybe not
07:56 hakzsam: imirkin, just saw your fix for the gk110 emitter... my bad! I guess this fixes orbea's issue?
10:36 RSpliet: hakzsam: just to confirm that with an older version of the blob, demmt correctly decodes traces
10:36 RSpliet: so I guess the ioctls have changed in their latest driver :-)
10:37 RSpliet:stares at the difference between hypot() and sqrt((a * a) + (b * b))
10:38 RSpliet: fascinating how much more operations NVIDIA emit for hypot()
10:43 RSpliet: with interesting sequences like rsqrt $r2, $r2; rcp $r2, $r2
11:02 mlankhorst: looks like workarounds for hw bugs :P
11:11 hakzsam: RSpliet, yeah, that makes sense
11:16 RSpliet: mlankhorst: or extreme optimisation for precision knowing the inner working of hw rounding :-P
11:19 mlankhorst: possibly
14:28 imirkin: RSpliet: rcp(rsq) is a *really* imprecise way of calculating sqrt()
14:28 imirkin: RSpliet: chances are the extra code is newton-raphson steps
14:30 RSpliet: imirkin: without seriously inspecting code, they must then be unrolled newton-rhapson steps. There's no branches
14:30 imirkin: yeah, you usually do like 2 steps and call it a day :)
14:31 RSpliet: fair enough :-) oddly though, even the base case I look at (without the use of hypot(), thus without the demand for precision), there's a rsqrt, rcp sequence
14:31 RSpliet: there's no reason to do so on Kepler I presume?
14:31 imirkin: also if you happen to use doubles, rcp/rsq only compute the upper 32 bits of the 64-bit value
14:32 imirkin: hm?
14:32 imirkin: well there's no native sqrt()
14:32 imirkin: only rsq and rcp
14:32 RSpliet: ah
14:32 imirkin: no fdiv either
14:32 imirkin: (but there's a fmul + frcp)
14:32 RSpliet: that explains some things too :-)
14:33 imirkin: x * rsq(x) should have better precision than rcp(rsq(x)) but it doesn't behave correctly around 0 =/
14:34 imirkin: and it turns out that people *really* want sqrt(0) == 0
14:34 imirkin: [0 * infinity = nan]
14:34 RSpliet: hehe, I can see why
14:34 Tom^: is there some known issue with vsync when multiple GL applications try to vsync? seems a friend runs both the GL applications with over 400fps both, and if he vsyncs it only runs at like ~40fps
14:35 imirkin: that's quite the monitor if he runs either one at 400fps vsync'd
14:35 Tom^: with it off.
14:35 Tom^: :P
14:35 imirkin: where do i go buy one like that?
14:35 Tom^: dont drink and type
14:36 imirkin: it's the law!
14:36 Tom^: so it doesnt seem like the gpu is hogged down when it can run both at ~400fps at the same time without vsync.
14:36 imirkin: so what fps does *one* application get vsync'd?
14:38 imirkin: http://i.imgur.com/3z14hMH.jpg
14:38 Tom^: =D
14:38 RSpliet: imirkin: in the case you were interested, the hypot version of euclidian distance (a trivial thing) is on http://hastebin.com/ivodiqehic.avrasm
14:38 Tom^: didnt the later generation of crts reach like 220hz?
14:38 imirkin: for some reason i thought the original was "don't drink and drive, it's the law". but no - texting.
14:40 hakzsam: imirkin, does your gk110 emitter change fixes orbea's issue?
14:40 imirkin: RSpliet: lots of fun games going on in there
14:40 imirkin: hakzsam: yes
14:40 hakzsam: imirkin, cool, and sorry :)
14:40 imirkin: np. i missed it too.
14:40 imirkin: apparently the index of an array matters. who knew.
14:40 hakzsam: yeah, apparently
14:52 imirkin: RSpliet: yeah, this is nice, and looks semi-familiar
14:52 imirkin: 00000110: 0021dc80 208ec020 set $p0 0x1 lt f32 abs $r2 0x800000
14:52 imirkin: 00000118: 00208000 5800d2e0 $p0 mul rn f32 $r2 $r2 0x4b800000
14:52 imirkin: ...
14:52 imirkin: 00000138: 00208000 5800ce60 $p0 mul rn f32 $r2 $r2 0x39800000
14:52 imirkin: "if it's too small, just make it bigger!"
14:57 RSpliet: hehe, effective though
14:58 imirkin: anyways, you think this is crazy, look at sqrt(double)
14:59 RSpliet: oh I don't think it's particularly crazy... just interesting to see how precision tuning can lead to such "obfuscation" and added instructions
14:59 RSpliet: hadn't looked at that before in detail
15:00 RSpliet: but explains why the hypot() version of that rodinia benchmark was so much slower :-)
15:00 RSpliet: despite low reg usage
15:20 imirkin: orbea: very timely bug notification btw - the fix was able to make it into 12.0.2 (where the commit that broke it already was queued up). if you had delayed a day, we would have had a bad release for GK110. thanks =]
15:21 hakzsam: that's nice :)
15:27 orbea: :)
15:35 imirkin: hakzsam: did the feral guy ever get back to you about F1?
15:35 hakzsam: not yet
16:34 kloofy: one more comment i am now implementing the the design, i am not so speechy anymore, so if imirkin remembers, then for gpu i do not definitely use this noc method, it was just demonstration purposes, i use manual multithreaded fsm logics
16:35 kloofy: but for cpu firmware i yet do not have the clarity, it reeally depends on yet little kernel research what design i will vote for
16:45 kloofy: need to bring some clearity of the issue as i never followed the schedulers work and cpu hw stuff about that, it's just done inspecting some of the con colivas patchets and probably i can as always read it out, but yeah the idea is to put gpu and cpu all onto one chip with two roms
16:45 kloofy: of the/to the
16:48 kloofy: i'd love to leave my e-mail address if someone would want to discuss the stuff, so i'd have more interesting time to work, but am generally afraid if here are yet some conspiracists and unstable folks
16:56 kloofy: so not much more to talk about, i'd wish you good bug hunting and keeping up the good work, i'd generally would answer to some more questions to kickstart someone elses efforts too, just don't wanna deal with angry pickets towards me
17:55 kloofy: ouh jesus, again me, well i'm aware that things can be done in one address space like edk2, and not running linux kernel ontop of x86 firmware, i.e one feeds the rom with userspace binaries and builds axi based contiguous memory hexmem, then yeah theoretically i could do all in one address space on fpga, even the device drivers but, then the scheduling is done in hw
18:13 kloofy: oh come dudes, it's possible no scam, but the application has to be relinked , either at runtime or so called on disk
18:14 karolherbst1: what is wrong with doing this? "echo "fma f32 $r0 $r1 0x0 $r2" | envyas -w -m gf100"
18:14 kloofy: you can use a linker to generate whole kernel module as shared library which you link into the binary
18:14 imirkin_: rounding mode missing iirc
18:14 karolherbst1: imirkin_: it works if I write it without a pipe
18:14 imirkin_: oh
18:14 imirkin_: sec
18:15 imirkin_: oh lol
18:15 imirkin_: s/"/'/g
18:15 karolherbst1: ...
18:15 karolherbst: k
18:15 karolherbst: :D
18:15 karolherbst: thanks
18:15 hakzsam: and add -V gk104 if you want kepler
18:15 imirkin_: so... i just ran this
18:15 imirkin_: and it outputted 0xfc10020e
18:15 imirkin_: which is, sadly, wrong
18:16 karolherbst: hakzsam: I target the entire gf100 isa first
18:16 karolherbst: mhh
18:16 imirkin_: this is the short encoding
18:16 imirkin_: which doesn't really work
18:16 karolherbst: echo 'fma f32 $r0 $r1 0x0 $r0' | envyas -w -m gf100
18:16 imirkin_: and/or neither we nor nvidia actually make use of it
18:16 imirkin_: so you have to do
18:16 karolherbst: 0xfc10000e
18:16 karolherbst: is this right?
18:16 imirkin_: it's 4 bytes
18:16 imirkin_: it needs to be 8 bytes
18:16 imirkin_: you need the long encodings
18:16 imirkin_: let me remember how to get those... sec
18:16 hakzsam: -W then :)
18:16 imirkin_: lol, no
18:17 karolherbst: ohh right
18:17 karolherbst: still looks wrong with -W
18:17 karolherbst: :D
18:17 imirkin_: hmmmmmm
18:17 imirkin_: so normally we do it by prepending "long"
18:17 imirkin_: ah right
18:17 imirkin_: so you do need the rounding mode =]
18:17 imirkin_: echo 'long fma rn f32 $r0 $r1 0x0 $r2'
18:17 imirkin_: try that
18:18 karolherbst: 0x30040000fc101c00
18:18 imirkin_: also note that you generally need to add either -V gf100 or -V gk104
18:18 imirkin_: leaving the -V off will get you wrong results sometimes
18:18 imirkin_: but not for fma
18:18 imirkin_: -W vs -w just changes whether it outputs a list of u64's or u32's
18:18 imirkin_: it doesn't change the actual data being output
18:18 karolherbst: envydis: 00000000: fc101c00 30040000 fma rn f32 $r0 $r1 0x0 $r2 :)
18:19 karolherbst: okay
18:19 hakzsam: I always use -W actually
18:19 karolherbst: 'long fma rn f32 $r0 $r1 0xc000 $r2' also works
18:20 imirkin_: karolherbst: make sure to double-check those with nvdisasm
18:20 imirkin_: envydis can be wrong
18:20 karolherbst: k
18:20 imirkin_: esp in such matters
18:20 karolherbst: at least envyas supports everything from 0x0 to 0xf000
18:20 imirkin_: makes sense.
18:20 imirkin_: it's a short float
18:21 karolherbst: well only changing the high bits
18:21 karolherbst: 0xef00 doesn'T work
18:21 imirkin_: right
18:21 imirkin_: only get 20 bits
18:21 imirkin_: 0x3f800000 should work though
18:21 imirkin_: (it's the upper 20 bits of a float)
18:21 karolherbst: uhhh
18:21 karolherbst: right
18:22 karolherbst: 0xffff0000 also works
18:22 imirkin_: as should 0xfffff000
18:22 imirkin_: as long as the low 12 bits are all 0's
18:22 karolherbst: ahh okay
18:22 karolherbst: I see
18:22 karolherbst: k
18:22 karolherbst: now the limm mode
18:23 karolherbst: mhh
18:23 karolherbst: how do I use those nvidia tools again?
18:23 imirkin_: oh btw, you don't need 'long' - having the 'rn' in there forces the long encoding :)
18:23 hakzsam: karolherbst, what are you trying to achieve btw?
18:23 karolherbst: hakzsam: post ra constant folding
18:23 imirkin_: perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM20 tt
18:24 imirkin_: that takes a sequence of 32-bit hex numbers
18:24 imirkin_: writes them out as "binary"
18:24 imirkin_: and feeds that binary into nvdisasm
18:24 hakzsam: karolherbst, cool
18:24 karolherbst: hakzsam: mov $r2 0xf000; mad $r0 $r1 $r2 $r3 -> mad $r0 $r1 0xf000 $r3
18:24 karolherbst: yeah
18:24 karolherbst: allthough
18:24 karolherbst: we can do that actually pre ra
18:24 karolherbst: because the regs doesn't amtter
18:24 imirkin_: we should already be using the short immediate version
18:24 karolherbst: just the position
18:25 imirkin_: we just never use the limm one
18:25 imirkin_: although i dunno, maybe we just nuked the whole thing.
18:25 karolherbst: well, on kepler we don't do such thing at all
18:25 imirkin_: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_target_nvc0.cpp#n374
18:26 karolherbst: mhh
18:26 karolherbst: right, the check is there at least
18:26 imirkin_: we don't do it for fma
18:26 imirkin_: but we do it for mad
18:26 imirkin_: (the distinction between the two is purely internal to nv50 ir, the emitted opcode is the same)
18:26 karolherbst: there is actually a difference there?
18:27 imirkin_: mad = splittable, fma = no touch
18:27 karolherbst: mhh right
18:27 karolherbst: do we emit fma at all?
18:27 karolherbst: never saw it
18:27 imirkin_: only if a shader uses it
18:27 karolherbst: I see
18:27 imirkin_: http://docs.gl/sl4/fma
18:27 karolherbst: well as far as I know, we don't do it for mad either on kepler/fermi
18:27 imirkin_: hm. well we should be doing it. i dunno why that wouldn't work.
18:28 karolherbst: mhh
18:28 karolherbst: let me check
18:28 imirkin_: but that's not a direct claim saying that it *does* work ;)
18:28 imirkin_: just that i dunno why it wouldn't :)
18:29 kloofy: but now yeah off to bed, there are so many methods in the cpu design, that is a bit occurate, wery many options and different ways possible to do stuff
18:33 karolherbst: imirkin: mhh, you are actually right
18:33 karolherbst: it is done already
18:33 imirkin_: :)
18:33 kloofy: wether you compile in a native fpga circuit x86 linker + userspace scheduler, doing the scheduling from scratch on fpga or using noc based stuff, using linker on hard core and scheduler in fpga, or running the whole kernel , stuff like that
18:33 karolherbst: only the limm thing is missing then
18:35 karolherbst: imirkin_: well, there is no limm support in envyas currently, and somehow mad doesn't work at all
18:35 imirkin_: could be.
18:36 karolherbst: "echo '0x300ccfc4f0b19c00' | perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM20 tt" gives me "nvdisasm error : Unrecognized operation at address 0x00000000"
18:37 imirkin_: right, that won't work
18:37 imirkin_: read the instructions i gave
18:37 imirkin_: they're not THAT long :p
18:37 karolherbst: ohh 32bit values
18:37 imirkin_: if you want it to take 64-bit values, change the thing in the pack
18:37 imirkin_: i dunno what the 64bit int thing is offhand
18:38 karolherbst: "@P3 VMAD.S8.U8.PO.SHR_7.SAT R51.CC, R0.B1, R12, R24;" ...
18:38 karolherbst: very easy to udnerstand :D
18:38 imirkin_: "this is the wrong order"
18:38 karolherbst: ohh
18:38 imirkin_: this is why you should just use -w
18:39 karolherbst: now it makes more sense :D
18:39 imirkin_: the V* ops are pretty epic though
18:39 imirkin_: i don't think i was even aware of the SHR_* options on them
18:39 karolherbst: simd stuff?
18:39 imirkin_: yea
18:40 karolherbst: can we actually use them for shaders?
18:41 karolherbst: any reason why that won't work? echo 'mad ftz f32 $r8 $r9 $r15 $r8' | envyas -w -m gf100 -V gk104
18:44 imirkin_: no such op
18:44 imirkin_: (a) fma, not mad, (b) you're missing rounding mode
18:45 karolherbst: okay, but even with rn it won't work and I though mad != fma
18:45 karolherbst: *thought
18:45 imirkin_: in nv50 ir, they're different
18:45 karolherbst: ahhh I see
18:45 imirkin_: but when emitted, they're the same op
18:45 imirkin_: it's just a bookkeeping measure
18:45 karolherbst: okay, I understand
18:45 karolherbst: so on hw there is only one op for both
18:46 karolherbst: and we have fma just for not splitting it up
18:46 karolherbst: okay
18:46 imirkin_: of course we could run into the idiotic situation that we have both fma and mad computing the same result and don't realize they're the same, but ... meh.
18:46 karolherbst: now there is the thing, nv50ir can emit the limm mode, but envyas can't assemble it :)
18:47 imirkin_: all this would require properly caring about the 'precise' qualifier
18:47 imirkin_: and i'm too old for that shit :)
18:49 tobijk: imirkin_: so you'll do it anyway? (as the inspiration for your last sentence always does) :D
18:50 imirkin_: unlikely
18:57 karolherbst: imirkin_: is that -b SM20 for gf100 and gk104?
18:57 imirkin_: it's for SM20 - shader model 2.0
18:57 imirkin_: which is fermi
18:57 imirkin_: kepler1 is SM30
18:57 imirkin_: kepler2 is SM35
18:57 imirkin_: maxwell is SM50
18:57 imirkin_: pascal is SM60
18:58 karolherbst: do you have a handy command for pushing the emited binary from NV50_PROG_DEBUG into nvdisasm?
18:59 imirkin_: copy + paste
18:59 karolherbst: ohh wait right, I though "IPA" is something odd
18:59 imirkin_: you just run the command i gave you
18:59 karolherbst: ...
18:59 imirkin_: and paste it in
18:59 imirkin_: and then hit ^D
18:59 imirkin_: and you're done
18:59 imirkin_: A = shader inputs. IPA = InterPolate A. i think.
19:00 imirkin_: (actually A is also shader outputs)
19:00 imirkin_: so you have stuff like ALD/AST in pre-frag shaders
19:00 karolherbst: can I somehow tell nvdisasm to not print those offsets?
19:00 imirkin_: you could make your terminal wider
19:01 karolherbst: no, I don't want them at all
19:01 imirkin_: i understand.
19:01 imirkin_: there's -raw i think
19:01 imirkin_: look at the help
19:01 imirkin_: iirc there's an option that reduces the output a little, but for some reason i didn't like it
19:02 karolherbst: raw still does print the offset
19:04 imirkin_: ok
19:06 karolherbst: it seems like that the emitted code from nv50ir is indeed right
19:07 karolherbst: allthough FFMA -> FFMA32I
19:07 karolherbst: but I doubt that matters much
19:07 imirkin_: that's what all their limm ops are called
19:07 imirkin_: *32I
19:07 karolherbst: I see
19:07 imirkin_: even MOV32I
19:08 karolherbst: okay, so I guess envydis needs to be told about those
19:09 karolherbst: and with those I mean ffma32i
19:09 imirkin_: i think there's a comment in there about it missing
19:49 karolherbst: imirkin_: isn't that the right thing? "{ 0x2000000000000002ull, 0xf800000000000007ull, N("fma"), T(ftz6), T(fmz7), T(sat5), T(farm), N("f32"), DST, T(acout3a), T(neg9), SRC1, LIMM, T(neg8), DST },"
19:49 imirkin_: ah yea
19:50 karolherbst: mhh, I am wondering why that doesn't get picked up then
19:50 imirkin_: go the other way
19:50 imirkin_: get it to decode
19:50 karolherbst: ohh okay
19:51 karolherbst: like that? "00000000: 3628dc62 20fd3333 fma ftz sat rm f32 $r35 $r34 0x3f4ccccd $r35"
19:51 imirkin_: like that.
19:52 karolherbst: mhh
19:52 imirkin_: it wants the rounding mode after ftz i guess
19:52 karolherbst: assembling works with that now
19:52 karolherbst: the rm thing?
19:52 imirkin_: yes.
19:52 karolherbst: I just checked, rn doesn't work
19:53 imirkin_: erm
19:53 karolherbst: bu rm
19:53 karolherbst: *but
19:53 imirkin_: check what tabfarm is
19:53 karolherbst: rn, rm, rp, rz
19:54 karolherbst: rz doesn't work either
19:54 karolherbst: and rp too
19:54 karolherbst: so just rm
19:54 imirkin_: dunno why that'd be the case.
19:55 karolherbst: mhh interessting
19:56 karolherbst: "{ 0x2000000000000001ull, 0xf800000000000007ull, N("fma"), T(farm), N("f64"), DSTD, T(acout30), T(neg9), SRC1D, T(ds2w3), T(neg8), T(ds3) }"
19:56 karolherbst: "{ 0x2000000000000002ull, 0xf800000000000007ull, N("fma"), T(ftz6), T(fmz7), T(sat5), T(farm), N("f32"), DST, T(acout3a), T(neg9), SRC1, LIMM, T(neg8), DST }"
19:56 RSpliet: round nearest, round zero, round plus, round minus?
19:56 imirkin_: that's f64
19:56 imirkin_: RSpliet: yes
19:56 karolherbst: ohh right
19:56 karolherbst: "{ 0x3000000000000000ull, 0xf800000000000007ull, N("fma"), T(ftz6), T(fmz7), T(sat5), T(farm), N("f32"), DST, T(acout30), T(neg9), SRC1, T(fs2w3), T(neg8), T(is3) }"
19:56 imirkin_: plus being "towards plus infinity"
19:57 imirkin_: karolherbst: i guess it gets confused, but i don't see how the rounding mode selection matters for this
19:57 karolherbst: maybe the acout30 and acout3a somehow itnerfere?
19:57 karolherbst: that's the only obvious difference between those
19:58 imirkin_: they should both have the same token though
19:58 imirkin_: (but different bits set)
19:58 imirkin_: should just be a $c register
19:59 karolherbst: will try to get nvidia print those other rounding modes
20:02 karolherbst: mhh odd
20:02 imirkin_: i believe it doesn't print RN
20:02 karolherbst: right
20:02 imirkin_: it should print the other ones though
20:02 karolherbst: it doesn't
20:03 karolherbst: well
20:03 karolherbst: there is eomthing odd though
20:03 karolherbst: will show you
20:03 karolherbst: 0x20fd33333628dc02 & 0x0180000000000000 => 0x0080000000000000 == rm, right?
20:03 karolherbst: so 0x21fd33333628dc02 should be rz
20:03 karolherbst: 18 is rz, where 08 is rm
20:04 karolherbst: "0x3628dc02 0x20fd3333" -> FFMA32I R35, R34, 0.80000001192092895508, R35;
20:04 karolherbst: "0x3628dc02 0x21fd3333" -> FFMA32I R35, R34, 0.80000001192092895508, R35;
20:04 karolherbst: ...
20:04 karolherbst: "0x3628dc02 0x21fd3333" -> FFMA32I R35, R34, 2.72225897593232691501e+38, R35;
20:04 karolherbst: so
20:05 karolherbst: tell me if I did anything wrong here
20:05 imirkin_: could be that flag doesn't exist for the FFMA32I version
20:05 karolherbst: yeah, I think so too
20:05 imirkin_: or it could be that it's somewhere else
20:05 karolherbst: mhh good idea
20:05 karolherbst: will search for it
20:05 imirkin_: usually i start with all 0's
20:05 imirkin_: and then stick a bunch of f's in some places
20:06 imirkin_: until i see the desired flag appear
20:06 karolherbst: k
20:07 karolherbst: k, found the .CC thing at 0x0400000000000000
20:08 karolherbst: what is that? "FFMA32I.INVALIDFMZ3.SAT.S R0, R0, 0, R0;"
20:09 imirkin_: FMZ is something else
20:09 imirkin_: .CC = acout
20:09 karolherbst: .S ?
20:09 imirkin_: join
20:10 karolherbst: okay, so it doesn't seem to be there at all
20:10 imirkin_: check on the FMZ thing
20:10 karolherbst: fmz is there
20:10 imirkin_: there could be another bitfield that's kinda part of it
20:10 karolherbst: @P0 FFMA32I.FMZ R0, R0, 0, R0; /* 0x2000000000000082 */
20:10 karolherbst: @P0 FFMA32I.FTZ R0, R0, 0, R0; /* 0x2000000000000042 */
20:10 imirkin_: oh
20:10 karolherbst: @P0 FFMA32I.SAT R0, R0, 0, R0; /* 0x2000000000000022 */
20:10 imirkin_: but you can't have FTZ + FMZ
20:10 imirkin_: right
20:10 karolherbst: 0x10 is .S
20:10 imirkin_: which is what gets you the INVALIDFMZ3 thingie
20:10 karolherbst: and 0x100 is the reg
20:11 imirkin_: what about 0x27.....
20:11 karolherbst: nope
20:11 karolherbst: 0x100 is a neg mod
20:11 karolherbst: imirkin_: immediate value
20:11 karolherbst: @P0 FFMA32I R0.CC, R0, -2, R0; /* 0x2700000000000002 */
20:11 imirkin_: ?
20:12 imirkin_: 0x2ff
20:12 karolherbst: @P0 FADD32I R0.CC, R0, -2.65845599156983174581e+36; /* 0x2ff0000000000002 */
20:12 imirkin_: er, right, i meant 0x27f
20:13 karolherbst: @P0 FFMA32I R0.CC, R0, -2.65845599156983174581e+36, R0; /* 0x27f0000000000002 */
20:13 imirkin_: keep adding f's until registers start changing
20:13 karolherbst: I already did
20:13 imirkin_: kk
20:13 imirkin_: so the T(farm) is probably not there
20:13 imirkin_: tbh i'm not *100%* sure what all this rounding business with floats is
20:14 karolherbst: @P0 FFMA32I R0.CC, R48, -QNAN , R0; /* 0x27ffffffff000002 */
20:14 karolherbst: one odd thing though
20:15 karolherbst: 0x27fffffffffffffa doesn't work
20:15 karolherbst: seems like one bit does nothing on the limm fma version then
20:15 karolherbst: uhh
20:15 karolherbst: that looks odd: @!PT FFMA32I.INVALIDFMZ3.SAT.S RZ.CC, -RZ, -QNAN , -RZ; /* 0x27fffffffffffff2 */
20:16 imirkin_: why?
20:16 karolherbst: those -RZ
20:16 imirkin_: negative flags
20:16 karolherbst: k, then the RZ looks odd
20:16 karolherbst: or is it zero reg?
20:16 imirkin_: RZ = R255
20:16 karolherbst: k
20:16 imirkin_: (or R63)
20:16 karolherbst: yeah, understood
20:16 imirkin_: (depending on the isa)
20:17 karolherbst: should I just remove T(farm) then?
20:17 imirkin_: i think so
20:17 karolherbst: shall I check fermi before?
20:17 imirkin_: SM20 tends to == SM30
20:17 imirkin_: except around images/texturing
20:18 karolherbst: yeah, looks pretty much the same
20:18 imirkin_: and kepler also gained a few new ops, like AL2P
20:19 karolherbst: imirkin_: is this fine? https://github.com/karolherbst/envytools/commit/8a7f7d2800ec95279c6884b2591c45fdb3e70b0f
20:20 imirkin_: let's see... farm is on bits 55:56
20:20 karolherbst: yeah, the limm is there
20:20 imirkin_: static struct rbitfield limmoff = { { 0x1a, 32 }, .wrapok = 1 };
20:21 imirkin_: yes, so T(farm) can't possibly exist there
20:21 imirkin_: so that change lgtm
20:21 imirkin_: it's also why parsing most likely failed
20:21 karolherbst: envydis does the right thing too
20:21 imirkin_: the two fields were fighting
20:21 karolherbst: 00000000: 00000002 2f000000 $p0 add f32 $r0 $c $r0 0xc0000000
20:21 imirkin_: did you mean to do 27?
20:21 karolherbst: ...
20:22 imirkin_: 28 = add, 20 = fma
20:22 karolherbst: yes
20:22 karolherbst: 00000000: 00000002 27f00000 $p0 fma rz f32 $r0 $c $r0 0xfc000000 $r0
20:22 imirkin_: where'd the rz come from?
20:22 karolherbst: @P0 FFMA32I R0.CC, R0, -2.65845599156983174581e+36, R0; /* 0x27f0000000000002 */
20:22 karolherbst: from the tabfarm, I didn'T rebuilt
20:22 imirkin_: oh :)
20:22 karolherbst: I just edited it on github
20:23 imirkin_: pick an immediate like 0x3f800000 so it isn't astronomical
20:23 karolherbst: @P0 FFMA32I R0.CC, R0, -512, R0; /* 0x2710000000000002 */
20:23 karolherbst: $p0 fma rp f32 $r0 $c $r0 0xc4000000 $r0
20:23 imirkin_: that sounds like it could be right
20:24 karolherbst: guess those things are fighting real bad in envydis
20:25 karolherbst: with the fixed one:
20:25 karolherbst: 00000000: 00000002 27100000 $p0 fma f32 $r0 $c $r0 0xc4000000 $r0
20:25 imirkin_: cool
20:25 karolherbst: 0xc4000000 is -512 indeed
20:26 karolherbst: so, what is the default rounding mode then? .D
20:26 imirkin_: rz
20:27 karolherbst: k
20:28 karolherbst: mhh how should this be handled within nv50ir then?
20:28 karolherbst: just ignore those bits?
20:28 imirkin_: ignore
20:28 imirkin_: we never touch those
20:28 karolherbst: okay
20:28 karolherbst: I guess they only affect things really slightly
20:28 imirkin_: yeah
20:29 imirkin_: they obviously matter for float -> int conversions, but that's about it
20:51 karolherbst: imirkin_: do you think it would be good enough to just check against the chipset in NV50PostRaConstantFolding or is there an actually better solution to this?
20:51 karolherbst: like moving it into the target
20:52 imirkin_: i haven't thought about it
20:53 karolherbst: well there could be a target function like "constantFold(insn)" or something
20:53 karolherbst: mhh
20:53 karolherbst: looks ugly
20:54 karolherbst: thing is, for gf100 there is indeed a predicate on the limm fma
20:54 karolherbst: @P3 FFMA32I R0.CC, -R48, -512.00018310546875, -R0; /* 0x271000000f000f02 */
20:54 imirkin_: huh?
20:55 karolherbst: @P1 FFMA32I R0.CC, R0, 0, R0; /* 0x2400000000000402 */
20:55 karolherbst: P1
20:55 karolherbst: p2: @P1 FFMA32I R0.CC, R0, 0, R0; /* 0x2400000000000402 */
20:55 imirkin_: that's the same for all instructions
20:55 karolherbst: ...
20:55 karolherbst: @P2 FFMA32I R0.CC, R0, 0, R0; /* 0x2400000000000802 */
20:55 karolherbst: ahh I see
20:55 karolherbst: because the current const fold pass checks for that
20:55 imirkin_: yeah, it's a restriction on nv50
20:55 imirkin_: on nv50 not all instruction forms can be predicated
20:56 karolherbst: I see
20:56 imirkin_: fermi+ is really a much more regular isa
20:56 karolherbst: more sane as well I guess
20:56 imirkin_: still pretty irregular
20:56 imirkin_: but more regular ;)
20:56 karolherbst: mhh
20:57 karolherbst: a gf100 version of that pass would be a lot more simple
20:57 karolherbst: the >=64 reg change still needs to be done for fermi I think?
20:58 karolherbst: wow, with my patch echo '$p2 fma f32 $r0 $c $r0 0x0 $r0' | envyas works now :)
21:00 karolherbst: or was fermi also 63 reg?
21:01 imirkin_: fermi has 63 regs
21:01 imirkin_: kepler1 has 63
21:01 imirkin_: kepler2+ has 255
21:01 karolherbst: k
21:01 karolherbst: tesla has 127?
21:01 imirkin_: ya
21:01 karolherbst: okay
21:01 imirkin_: 254 half-regs, depending how you count
21:01 karolherbst: k
21:01 karolherbst: now let me check SM35
21:02 imirkin_: iirc i had a change fixing it up for SM35
21:02 imirkin_: i forget if i pushed it
21:02 imirkin_: (to emit_gk110.cpp that is)
21:02 imirkin_: nope
21:03 imirkin_: https://github.com/imirkin/mesa/commit/6233d4b6e761c8ba6831b42835478c527617687a
21:03 karolherbst: k, well first I will play with nvdisasm
21:03 imirkin_: this was the result of playing with nvdisasm :)
21:04 karolherbst: the todo for the limm is within gk110 by the way
21:06 karolherbst: uhh nice, looks pretty much the same over all
21:12 karolherbst: odd
21:12 karolherbst: I've added "{ 0x6000000000000000ull, 0xf800000000000003ull, N("fma"), N("f32"), DST, SRC1, LIMM, DST },"
21:12 karolherbst: shouldn't fma f32 $r0 $r0 0x0 $r0 work with this?
21:16 imirkin_: ought to, yea
21:16 karolherbst: 00000000: 00000000 60000000 ??? [unknown: 00000000 60000000] [unknown instruction]
21:16 karolherbst: mhh
21:16 imirkin_: although it hink SRC1 is wrong
21:17 imirkin_: might need to be SRC3, i forget
21:17 karolherbst: still nothing though
21:17 karolherbst: do I have to add it to tabm?
21:18 imirkin_: uhh
21:18 imirkin_: dunno
21:18 karolherbst: the other fma is in tabm
21:19 karolherbst: ohh
21:19 karolherbst: added it to tabc
21:19 karolherbst: works
21:19 imirkin_: i don't remember how the dispatch happens between the various tables
21:19 imirkin_: have a look at tabroot
21:20 karolherbst: yeah, tabc seems to work for now
21:21 karolherbst: SRC1 is right by the way
21:22 imirkin_: ok
21:22 karolherbst: what is dnz in codegen?
21:22 imirkin_: FMZ
21:22 karolherbst: k
21:26 karolherbst: odd
21:26 karolherbst: there seems to be no neg modifier
21:26 imirkin_: i found it
21:26 imirkin_: look at my commit
21:27 karolherbst: I know
21:27 karolherbst: but nvdisasm doens't think so :/
21:27 karolherbst: maybe I do something wrong
21:27 imirkin_: i think you do.
21:27 karolherbst: @!PT FFMA32I.INVALIDFMZ3.SAT.S RZ.CC, RZ, -QNAN , RZ; /* 0x67fffffffffffffc */
21:28 imirkin_: right
21:28 karolherbst: ahh
21:28 karolherbst: now I got it
21:28 imirkin_: so the sign on the immediate is flipped
21:28 karolherbst: @P0 FFMA32I R0, -R0, 0, R0; /* 0x6800000000000000 */
21:28 imirkin_: :)
21:29 imirkin_:  if (neg1)
21:29 imirkin_:  code[1] ^= 1 << 27;
21:29 imirkin_: that'd be that one.
21:29 karolherbst: yeah
21:29 imirkin_: and on src it's at
21:29 imirkin_: NEG_(3c, 2);
21:29 karolherbst: mhhh
21:29 karolherbst: something is odd
21:29 imirkin_: so the next bit over
21:30 imirkin_: er, on src2
21:30 karolherbst: ohh 3c
21:30 imirkin_: i.e. try 7000000000
21:30 karolherbst: wait a sec
21:30 karolherbst: "{ 0x6000000000000000ull, 0xf800000000000003ull, N("fma"), T(ftz38), T(fmz39), T(sat3a), N("f32"), DST, T(neg3b), SRC1, LIMM, DST },"
21:31 karolherbst: shouldn't fma sat f32 $r0 neg $r1 0x0 $r0 assemble then?
21:32 imirkin_: seems reasonable
21:32 karolherbst: mhhh
21:32 karolherbst: ohh wait
21:33 karolherbst: mhh
21:33 karolherbst: no should be right
21:33 imirkin_: might want to make the mask be e000000000000000
21:33 imirkin_: instead of f80000000000
21:33 karolherbst: ohhh I see
21:34 karolherbst: yep, 0x7.. os neg on the last src
21:34 imirkin_: and make sure to throw in a T(neg3c) before the final src
21:34 imirkin_: [and add a neg3c if it doesn't already exist)
21:34 karolherbst: yep, it doesn't exist
21:35 karolherbst: it starts to look good
21:36 karolherbst: cc is at 37
21:40 karolherbst: how do I get that .CC thing into envydis?
21:41 imirkin_: acout
21:41 imirkin_: $c register
21:41 karolherbst: ahh right
21:42 imirkin_: on nv50 there were actually a bunch of them
21:42 imirkin_: and they could hold funky values
21:42 imirkin_: a lot closer to EFLAGS on x86
21:42 karolherbst: uhh
21:42 imirkin_: but nvc0 it's just a single-bit value
21:43 karolherbst: "{ 0x6000000000000000ull, 0xe000000000000003ull, N("fma"), T(p), T(ftz38), T(fmz39), T(sat3a), N("f32"), DST, T(acout37), T(neg3b), SRC1, LIMM, T(neg3c), DST }," anything missing?
21:43 imirkin_: seems reasonable
21:44 karolherbst: there is a frm36 on the fimm one
21:44 karolherbst: but I guess that is gone for the limm
21:44 imirkin_: yes
21:44 imirkin_: i think the LIMM hits that bit
21:44 karolherbst: right
21:44 imirkin_: the acout37 definitely does
21:45 karolherbst: nope, the acout37 is fine :p
21:45 imirkin_: i think it conflicts with frm36, no?
21:45 karolherbst: yes
21:45 karolherbst: @P0 FFMA32I R0.CC, R0, 0, -R0; /* 0x7080000000000000 */
21:45 karolherbst: ...
21:46 karolherbst: @P0 FFMA32I R0, R0, -0, -R0; /* 0x7040000000000000 */
21:46 karolherbst: or is 36 just the immediate neg?
21:46 karolherbst: would be odd if something like that exists
21:46 karolherbst: ehm
21:46 karolherbst: I am silly
21:47 imirkin_: it's just the high bit of the immed
21:47 karolherbst: yeah
21:48 karolherbst: mhhh, I am wondering
21:48 karolherbst: @P3 FFMA32I R0, R192, 2, -R0; /* 0x70200000000f0000 */
21:48 karolherbst: ohh wait
21:48 karolherbst: it looks fine actually
21:49 karolherbst: k, I think this is right then :)
21:56 karolherbst: the gm107 stuff looks strange
21:56 imirkin_: it's closer to how nvdisasm does things
21:56 karolherbst: yeah, I noticed
21:56 imirkin_: i think ben wanted to avoid thinking as much as possible
21:56 karolherbst: mhh
21:57 karolherbst: now we have to think more
21:57 karolherbst: or does NV50_PROG_DEBUG also look more like disasm for maxwell?
21:57 karolherbst: I doubt that though
21:58 imirkin_: heh
21:58 imirkin_: NV50_PROG_DEBUG is printing the nv50 ir
21:58 imirkin_: and nothing else
21:58 karolherbst: but there is a nice limm fma for maxwell already
21:59 karolherbst: 53 bit isn't set though
21:59 karolherbst: lets see
22:00 karolherbst: ohh wait, it is ftz
22:00 karolherbst: but why
22:04 karolherbst: odd
22:04 karolherbst: "Unaligned instruction found"
22:04 karolherbst: does nvdisasm actually require those sched things?
22:05 imirkin_: SM50 requires a full block at a time
22:05 imirkin_: sched + 3 instructions
22:05 imirkin_: super-annoying
22:05 karolherbst: mhh
22:05 karolherbst: segfault
22:05 imirkin_: can't just feed it crap
22:05 imirkin_: has to all be real
22:05 karolherbst: messy
22:05 imirkin_: very.
22:06 karolherbst: like I care about maxwell with conditions like that
22:06 imirkin_: now you know how i feel :)
22:07 karolherbst: doesn't "0x00000000 0x00000000 0x00000000 0x0c000000 0x00000000 0x0c000000 0x00000000 0x0c000000" look sane to you?
22:07 imirkin_: sorry
22:07 imirkin_: grab the first 2 words from a real thing
22:08 imirkin_: try fc0007e0 001f8000
22:08 karolherbst: thanks
22:08 karolherbst: it works
22:10 karolherbst: anyway, I did enough for today :)
22:10 karolherbst: it will get messy to make the pass in a way that I don't mess up the mods and other things
22:11 karolherbst: maybe I should just write one version for each isa or so
22:14 imirkin_: thus far we've avoided that sort of thing
22:14 karolherbst: maybe I just write a new one for gf100+
22:14 karolherbst: tesla has all that messy things, so the code would just look even wrose
22:14 karolherbst: *worse
22:23 RSpliet:grmbls a bit
22:23 RSpliet: whatever that thing is at the position in the PDAEMONs channel info at the PDE offset, it's not a PDE
22:24 imirkin_: you've stared it square in the face, and it doesn't look like a PDE to you? :)
22:24 RSpliet: it has all the hallmarks of a PDE from a distance, but it points to something way beyond my physmem
22:26 imirkin_: high bit means secure?
22:26 imirkin_: i dunno how big PA is
22:26 imirkin_: VA is 40-bit
22:26 imirkin_: where are you looking at this?
22:28 RSpliet: ok, retracing my steps. It's a fermi machine, and if I interpret nvkm/subdev/mmu/gf100 correctly, the high 28 bits should be the high 28 bits of the 40-bit ptr
22:29 RSpliet: two words (don't know why; one big pages, one small pages?) should be at offset 0x1000 and 0x1004
22:31 RSpliet: I get the channel info ptr from reg 0x10a47c ( 7022166d -> 0x22166d000, PDEs should then be at 22166e000)
22:32 imirkin_: keep in mind that there are also hugepages
22:33 imirkin_: i think.
22:33 RSpliet: according to that logic, my PDEs are 44453401, 4d204341
22:33 RSpliet: their enable bit nicely set
22:35 RSpliet: but they appear to point to the moon
22:35 RSpliet: skeggsb: is there an obvious fallacy in my logic?
22:37 RSpliet: (because last time I checked, I didn't have ~270GB of RAM)
23:59 kloofy: seems like you know it uses context, and kernel sets some sort of interrupt/exception to change context on cpu's