00:12 karolherbst: ehh.. now demmt segfaults... very nice
08:48 karolherbst: mhhh
08:48 karolherbst: we need a solution for this uvm stuff :/
14:20 random-nick: will nouveau ever have vulkan drivers for Kepler cards?
14:23 HdkR: Better question is if there will ever be Nouveau based vulkan drivers at all
14:29 gnarface: don't hold your breath
14:29 gnarface: unless AMD decides to pony up for that
14:31 HdkR: hah. AMD paying for Nouveau stack? Sound counterproductive
15:12 karolherbst: uff, new methods
15:12 karolherbst: 0x00001001
15:14 HdkR: :D
15:15 karolherbst: only 12 bytes, so... managable
15:15 karolherbst: and the first uint32_t is a handle
15:15 karolherbst: and the third is a out bool for success
15:20 RSpliet: HdkR: actually AMD has contributed quite a lot to nouveau, through investment in common DRM and mesa infrastructure.
15:20 wrl: AMD funding nouveau would be so cheeky and I would enjoy it so much
15:21 karolherbst: RSpliet: well, only because they have to
15:21 HdkR: Common code is understandable
15:21 karolherbst: right, but I never got the feeling AMD itself is dedicated to common code
15:22 karolherbst: it was just there and it made sense to improve it
15:22 RSpliet: every business pursues a selfish agenda. That's not always incompatible with the common good :-)
15:22 HdkR: You could say that Intel is doing the same thing :P
15:22 karolherbst: intel has their classic driver
15:22 karolherbst: although at least from intel devs I get the feeling they are more interested in sharing code than AMD
15:23 RSpliet: I bet AMDs team is just significantly smaller. Engineers are under pressure to make things work, not to make things necessarily right. I felt a lot of them try hard :-)
15:26 karolherbst: probably yes
15:29 karolherbst:wished that nvidia-uvm would be under a proper open source license ....
15:29 karolherbst: ohh wait.. maybe it is
15:30 karolherbst: it's MIT
15:31 karolherbst: mhhhh
15:31 karolherbst: cool
15:31 karolherbst: I guess I can add uvm support to demmt then
15:31 karolherbst: we just have to respect the version used somehow
15:36 karolherbst: mhh
15:36 karolherbst: HdkR: I like when values just appear out of thin air :/
15:37 HdkR: Oh?
15:38 karolherbst: https://gist.githubusercontent.com/karolherbst/1f7a30baea2894f872db0be4928f5456/raw/f0f59fdfe47d745bb670d89720b9d2ca20cdc96e/gistfile1.txt
15:38 karolherbst: "0xcaf00005"
15:38 karolherbst: mhh, although
15:38 karolherbst: maybe that's a result of NVRM_IOCTL_VSPACE_MAP
15:39 HdkR: Maybe
19:28 karolherbst: found it!
19:28 karolherbst: https://gist.githubusercontent.com/karolherbst/b90ade9c36c27867b8560867ba1ac4b6/raw/c5b50f8ff13fb57701fffb71e92e167304eca8f1/gistfile1.txt
19:30 karolherbst: okay.. now what's that for a werido ioctl
19:47 karolherbst: uff, a new nvrm_ioctl_memory variant
19:47 karolherbst: great
20:08 karolherbst: \o/
20:08 karolherbst: it is wokring
20:12 karolherbst: huh
20:13 karolherbst: heh.. github has deactivated our IRC bot :/
20:14 imirkin: that was ages ago
20:14 karolherbst: uff
20:14 karolherbst:being the bot himself: https://github.com/envytools/envytools/pull/192
20:15 karolherbst: imirkin: and we have to get that nvidia-uvm stuff working. As it seems the nvidia header files are MIT licensed, so I guess we could just copy those into envytools
20:15 karolherbst: (locally I just add the unpacked driver as an include directory)
20:15 imirkin: worksforme
20:15 karolherbst: yeah.. no idea what to do about versioning though
20:15 karolherbst: we kind of have to keep track of changes
20:17 karolherbst: mhh, meh
20:17 karolherbst: mupuf: do you know the pw of the envytools user?
20:17 karolherbst: or... did we use something else actually?
20:20 imirkin: envytools user?
20:20 imirkin: it's a group
20:20 imirkin: that you are most likely a member of
20:20 karolherbst: no, the IRC user
20:20 imirkin: oh
20:20 karolherbst: anyway... github made it a pain to hook into IRC
20:21 karolherbst: but.. do we care?
20:21 karolherbst: ufff
20:21 karolherbst: eh
20:21 karolherbst: crappy github
20:24 karolherbst: mhh, let's see if I find this shader...
20:24 karolherbst: UFFFF
20:24 karolherbst: UFFF
20:24 karolherbst: imirkin: remember _our_ shaders for that running out of stack space?
20:25 karolherbst: guess what nvidia does
20:25 karolherbst: https://gist.githubusercontent.com/karolherbst/176443454e1f402cf51f9d2bb0630fec/raw/850050629b64822f02702d3fa79e528740317595/gistfile1.txt
20:25 karolherbst: is it actually the same or just random shader...
20:28 karolherbst: mhhh. I think this is the shader
20:28 karolherbst: RSpliet: ^^
20:28 karolherbst: sooo, that's what we have to do
20:31 karolherbst: that's insane
20:32 karolherbst: ohh, and they even optimize this useless loop away
20:32 karolherbst: oh bother
20:33 karolherbst: uhm, if
20:33 karolherbst: not loop
20:33 karolherbst: heh.. "0x00000600 3 = { WARP_CSTACK_SIZE = 1536 }"
20:33 karolherbst: and this as well
20:33 karolherbst: weird
20:34 karolherbst: 999 -> 1 iterations, and the stack size is 0
20:36 karolherbst: with 100 iterations that shader doesn't look that bad
20:38 karolherbst: mhh, 25 iterations, just 1kb
20:44 karolherbst: imirkin: okay... so I guess we just detect silly nested loops and bail to the off-chip stack...
21:21 RSpliet: "ipa pass"... never realised Maxwell was such a hipster
21:22 karolherbst: mhh, still some errors in demmt... maybe I figure those out as well
21:22 karolherbst: at least with those patches I am able to decode traces from my pascal gpu
21:22 RSpliet: But... essentially they completely got rid of the break logic. Predicated sync sounds odd to me. What does that mean?
21:23 karolherbst: RSpliet: predicated branch
21:23 karolherbst: either you jump and sync, or you don't
21:24 RSpliet: That... means it's a divergence point?
21:24 karolherbst: yes
21:24 karolherbst: sync == bra + thread sync
21:24 karolherbst: well, with the target pushed through ssy
21:25 RSpliet: I always considered sync as a pop off the stack. restoring all threads that were active at ssy. I guess $p1 sync is the same as "disable all threads for which !$p1 and continue"
21:25 karolherbst: like a normal $p1 bra
21:26 RSpliet: Only I see no reason to branch...
21:26 karolherbst: maybe not enough predicates
21:26 karolherbst: or ...
21:26 karolherbst: dunno
21:26 karolherbst: I am sure there is a reason
21:27 RSpliet: I'm reluctant to talk, because we've already shown my/the patents understanding of HW is lacking
21:27 RSpliet: But
21:27 RSpliet: :-D
21:27 karolherbst: that shader is insane, isn't it?
21:28 RSpliet: It doesn't shock me
21:29 karolherbst: well, I am surprised by how much nvidia tries to flatten the loops
21:29 karolherbst: no idea what they do if there are deeper loops
21:29 RSpliet: ahh, yeah I'm pretty sure $p1 sync should mean "disable all threads for which $p1 is true". The pop off the stack (the actual branch) will occur somewhere in the future
21:29 karolherbst: ohhhh
21:29 karolherbst: I know what's happening
21:29 RSpliet: what do you mean with flatten? Unroll?
21:29 karolherbst: flatten
21:30 karolherbst: so you don't have nested loops anymore
21:30 karolherbst: sooo
21:30 karolherbst: okay, I see what's going on here
21:30 RSpliet: What, like merging invariants?
21:30 karolherbst: RSpliet: they make it so, that all threads go through the same path, but through predicates they select which of the original path each thread goes through
21:30 karolherbst: and the threads get synced on each loop iteration
21:31 karolherbst: so insread of having a loop { loop { loop { ... } } } thing
21:31 RSpliet: that's... the nature of SIMD execution
21:31 karolherbst: they do a loop { $p0 = path0_selection; $p1 = path1_selector; .... }
21:32 karolherbst: essentially all loops merged into one
21:32 karolherbst: with tons of predicated instructions
21:32 karolherbst: (that's why this shader also uses _all_ predicates available)
21:33 karolherbst: but that has one huge benefit: no diverged threads
21:33 karolherbst: (if we ignore this little sync stuff at the top)
21:34 karolherbst: mhh and $p0 is even uniform
21:34 karolherbst: as it's the outer loop
21:34 karolherbst: so there are those "not $p6 bra 0x40... not $p0 bra 0x18" at the end of the shader
21:34 karolherbst: or well, loop
21:35 RSpliet: In any SIMD processor, divergence is *always* handled by predicated execution. The only choice you have is implicit (ssy+sync, prebrk+brk, call+ret, exit) or explicit using predicate registers...
21:35 karolherbst: but the threads sync up on immediatly at the loop header
21:35 karolherbst: RSpliet: sure.. but the original shader is like 3 nested loops
21:35 karolherbst: this binary has one loop
21:36 karolherbst: funny enough, each loop has a constant iteration count.. wondering what happens if we make them all runtime variable
21:36 RSpliet: I wonder how they re-worked the loop invariant then. I recall it being outermost from 0 to 3, inbetween from 0 to 3, innermost 0 upto 1000
21:36 karolherbst: yeah
21:37 karolherbst: well, they have 7 predicates available
21:37 karolherbst: should be enough
21:37 karolherbst: well, it is obviously
21:37 RSpliet: What I'm saying is: you can't just iterate from 0 to 16000, as you'd lose the innermost subtlety that you can break out early
21:37 karolherbst: outer loop: "isetp ge and $p0 0x1 $r54 0x4 0x1"
21:37 RSpliet: you can merge the outermost two loops quite easily...
21:38 karolherbst: middle loop: isetp ne and $p6 0x1 $r56 0x4 0x1
21:38 karolherbst: inner loop: "isetp ge and $p1 0x1 $r58 0x3e8 0x1" and "isetp lt and $p1 0x1 $r53 0x3e8 0x1"
21:38 karolherbst: mhh
21:38 karolherbst: the middle loop was switched with the inner one as it seems
21:39 karolherbst: RSpliet: thing is, those loops aren't actually "merged"
21:39 karolherbst: they are "flattened"
21:39 karolherbst: although maybe in the end it's the same
21:40 karolherbst: but they idea is not "how do I make one loop out of it", but "how to I implement loops as predication"
21:40 RSpliet: Yeah I got that
21:41 karolherbst: it's essentially the way how you have to implement crypto :D
21:41 RSpliet: I think I'm beginning to see what you could do. Not sure if that's what really happens... but you can update all three loop invariants sort-of-conditionally (only update one, not all three) at the end of each iteration of this loop, and then you only need a single branch
21:42 karolherbst: each threads takes all the paths, but the result you select at the end
21:42 RSpliet: branching is just an easy mechanism to avoid the other two invariants from being updated
21:42 karolherbst: only that nv hw has predicates, which makes the selecting a bit easier
21:42 RSpliet: Yeah
21:43 RSpliet: The coolness here is that threads can actually diverge _more_ without a penalty
21:43 RSpliet: What I mean by that is... eh
21:43 karolherbst: _but_
21:43 karolherbst: this is a speed opt
21:43 karolherbst: nv also enables the off-chip cache
21:43 karolherbst: 1.5 kB
21:44 karolherbst: inner loop with 25 iteration leads to 1kB
21:44 RSpliet: if we represent the loop counters as a three-tuple (outermost, middle, innermost), one thread in a WG could be in iteration (0,1,15) and another one in iteration (0,0,27)
21:44 RSpliet: that first thread doesn't have to wait for all threads in the WG to reach (0,1,0)
21:45 karolherbst: thing is, they won't be out of sync
21:45 RSpliet: So you kind of cancel out the penalty paid by the "random" breaking out of the inner loop
21:45 RSpliet: or... amortise it rather
21:45 karolherbst: or mhhh, maybe they could be actually?
21:45 karolherbst: shouldn't matter anyway
21:46 karolherbst: yeah...
21:46 RSpliet: They can be, there's per-thread iterators. No this matters, it's a really awesome optimisation :-D
21:46 karolherbst: yeah
21:46 karolherbst: anyway, they still enable the off-chip cache, so we have to do it as well :p
21:47 karolherbst: 1.5kB seems to be the most nvidia goes with though
21:48 RSpliet: That's already quite a lot of stack
21:48 karolherbst: especially because it's VRAM
21:49 RSpliet: (thanks for pointing out "what NVIDIA does" by the way. Understanding that made me feel clever yeah :-P)
21:49 karolherbst: but I guess the L1/L2 cache will take care of that so it doesn't suck
21:50 RSpliet: Oh 100%. Besides, we should keep in mind "enabling" and "using" are two different things. Being on the safe side isn't harmful unless it means disabling the on-chip stack altogether
21:50 karolherbst: I am sure it disables the on-chip stack
21:50 RSpliet: Oh...
21:50 karolherbst: yeah...
21:50 karolherbst: because with 512 bytes off-chip the issue is usually worse
21:51 karolherbst: or it was with the one shader I found
21:51 karolherbst: but this could also be caused by the off-chip latency
21:51 karolherbst: anyway, nvidia doesn't use 512 bytes as well
21:51 karolherbst: either 0, 1kB or 1.5kB as it seems
21:51 karolherbst: the value has to be a multiple of 512B anyway
23:11 karolherbst: ehh, "ERROR: trying to destroy object 0xc1d00059 / 0xc1d00059 which does not exist!" annoying :/ cant just grep that stuff
23:12 karolherbst: ohh, uh, that's child/parent... ehh
23:12 karolherbst: that's weird
23:13 karolherbst: I think I leave that one alone
23:13 karolherbst: okay.. so that's the only error left I think on my pascal with the 430 driver