01:48 television: hi
13:00 imirkin_: karolherbst: you good to handle https://bugs.freedesktop.org/show_bug.cgi?id=111006 ? or need me to have a look?
13:01 karolherbst: uff
13:01 imirkin_: looks like something VERY weird is going on
13:01 karolherbst: I can look into that
13:01 karolherbst: yeah...
13:01 karolherbst: I think I know what
13:01 imirkin_: and probably something quite dumb.
13:02 karolherbst: probably NaN
13:02 karolherbst: random floats are always messy
13:03 karolherbst: ohh wait
13:03 karolherbst: it isn't random
13:03 karolherbst: oh well
13:03 karolherbst: I will take a look
13:10 karolherbst: imirkin_: can't reproduce
13:10 RSpliet: imirkin_: does the first "_GLF_outVarBackup_GLF_color = _GLF_color;" even have defined behaviour when it's defined as "out vec4 _GLF_color" ? Looks to me like a use before assignment, and that could mess with liveness analysis in strange ways...
13:10 karolherbst: mhh
13:10 karolherbst: will check with the mesa commit specified in the bug
13:10 imirkin_: karolherbst: which gpu?
13:10 karolherbst: 1050
13:11 imirkin_: shouldn't be materially different from the GTX 970
13:11 karolherbst: uhh
13:11 karolherbst: ahh, no, it is a 1050
13:11 karolherbst: yeah
13:11 karolherbst: I have a 970M here though as well. So I could check there
13:12 imirkin_: like i said though - shouldn't matter
13:12 karolherbst: RSpliet: it's dead code, but we don't do those opts yet
13:13 karolherbst: ohhh
13:13 karolherbst: now I get something funky
13:13 karolherbst: ahhh
13:13 karolherbst: that explains
13:13 imirkin_: "now" as in "on that commit"?
13:13 karolherbst: I think it's some memory stuff
13:14 imirkin_: erm
13:14 karolherbst: the correct stuff was already in memory
13:14 karolherbst: "[137017.538280] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000000 [] warp 10016 []" :)
13:14 karolherbst: ohh, for fuck sakes.. this recursive fault bug is super annoying
13:15 RSpliet: karolherbst: it's only dead code if that injectionSwitch uniform was a constant rather than a shader param.
13:15 karolherbst: oh, right, we don't do shader variants
13:16 karolherbst: ehh, I think this "Fixing recursive fault but reboot is needed!" triggers only with runpm
13:26 karolherbst: imirkin: yep.. after a fresh reboot I also get those errors.. so let's recheck master
13:27 karolherbst: also, the reference doesn't trigger traps, but the variant does
13:27 imirkin_: good
13:27 imirkin_: that means we're doing something demonstrably dumb
13:28 karolherbst: yeah
13:28 karolherbst: should be easy to figure out
13:28 karolherbst: the shader is quite simple
13:28 imirkin_: unless it's a stack thing
13:28 karolherbst: yeah...
13:28 karolherbst: I feared as much, but I doubt that, kind of
13:28 karolherbst: ohh, but there is a loop
13:28 karolherbst: mhh, so maybe
13:28 karolherbst: I still have this other shader which causes that stack issue
13:29 karolherbst: and this was actually my first guess on the issue
13:29 karolherbst: okay, happens on master as well
13:34 karolherbst: uf, how did I enable this off chip stack again
13:34 imirkin_: shader header
13:35 karolherbst: sure... but I was more wondering about which byte that was
13:35 imirkin_: shader header is documented
13:35 imirkin_: both in rnndb and on nvidia's open-gpu-doc
13:35 karolherbst: ohh, true
13:35 HdkR: memset the whole header to 0xff and just hope for the best :P
13:35 imirkin_: *finally* a voice of reason :p
13:36 HdkR: What generations is the header documented for?
13:36 karolherbst: ShaderLocalMemoryCrsSize?
13:36 karolherbst: I think it was that
13:37 imirkin_: HdkR: tnt2 :p
13:37 HdkR: lol
13:37 imirkin_: https://nvidia.github.io/open-gpu-doc/Shader-Program-Header/Shader-Program-Header.html
13:37 imirkin_: afaik that holds up through pascal, at least
13:37 HdkR: Oh
13:37 imirkin_: and until there's a gt 1630 released, no later generation exists to me :)
13:38 HdkR: :D
13:38 karolherbst: "The SPH field ShaderLocalMemoryCrsSize sets the additional (off chip) call/return stack size (CRS_SZ). Units are in Bytes/Warp. Minimum value 0, maximum 1 megabyte. Must be multiples of 512 bytes." yeah, that was it
13:38 HdkR: So nobody gets to get about Turing SPH then
13:39 HdkR: Such a simple thing to document and isn't available :/
13:39 imirkin_: no accel firmware for turing either
13:39 imirkin_: so it works out nicely
13:39 HdkR: Just means nobody /has/ to care about Turing
13:39 imirkin_: i have a hard time caring past kepler.
13:40 HdkR: :)
13:40 imirkin_: only reason i'll be plugging my gt1030 in is to play with the hdr infoframe
13:40 karolherbst: *sigh*
13:40 imirkin_: i guess i could do it on kepler, but then i don't get the vendor infoframe...
13:40 karolherbst: less errors with a small off chip stack
13:41 karolherbst: gone with a 1kb stack
13:41 imirkin_: ok. that's annoying.
13:41 karolherbst: so we probably want to fix this issue :/
13:41 imirkin_: can you report your findings in the bug?
13:41 karolherbst: yep
13:43 karolherbst: imirkin_: and we might want to make the dmesg print more meaningful :)
13:43 karolherbst: gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 300016 []
13:44 karolherbst: 0x16 is probably this call/return stack overflow stuff
13:44 imirkin_: probably.
13:44 HdkR: imirkin_: Turing changes HDR interactions as well. Will be some more fun things there :P
13:44 imirkin_: at least on that shader gen
13:44 imirkin_: HdkR: until dell starts shipping gt 1630's in all their new pc's for free, i doubt i'll have a look.
13:45 imirkin_: HdkR: it's just an infoframe, no?
13:45 imirkin_: (the "vendor" infoframe)
13:45 imirkin_: or are you talking about CTM/etc?
13:46 imirkin_: HdkR: i think we already know that turing's internal processing is basically all done in fp16 now
13:46 HdkR: I don't actually know, I just remember HDR bits changing there
13:46 karolherbst: imirkin_: anyway, any "smart" ideas on how to fix it?
13:46 HdkR: Sounds correct
13:46 imirkin_: karolherbst: https://previews.123rf.com/images/andreykuzmin/andreykuzmin1409/andreykuzmin140900082/31721808-scared-ostrich-burying-its-head-in-sand.jpg
13:47 karolherbst: mhhh
13:47 imirkin_: it's a fool-proof technique.
13:47 karolherbst: we could use a 1kb stack whenever we find a loop in a shader and just go on with live...
13:47 karolherbst: or well..
13:47 HdkR: karolherbst: Make the CRS like 8MB and call it good :D
13:47 karolherbst: a loop with some if or another loop
13:47 imirkin_: karolherbst: or we could rewrite the loop to not be dumb like this, iirc
13:47 karolherbst: HdkR: mhh, yeah, doesn't matter anyway
13:48 imirkin_: i think if we just flip the "if" and "else" bits, the problem goes away
13:48 karolherbst: imirkin_: thing is, the loop doesn't do anything dumb
13:48 imirkin_: the glsl code is fine. we're the idiots.
13:48 karolherbst: there is just the normal prebreak + break stuff
13:48 RSpliet: karolherbst: do we have a compiler pass that can determine the maximum stack depth statically?
13:48 karolherbst: + the joinat+join stuff from the if
13:48 karolherbst: nothing more
13:48 karolherbst: RSpliet: no, as we have only a poor understanding of this
13:48 karolherbst: we talked about it before
13:48 imirkin_: where by "we" karolherbst means him and me.
13:49 imirkin_: whereas RSpliet and everyone else has a great understanding of it
13:49 karolherbst: and our estimate was multiple orders of magnitude to low
13:49 imirkin_: for some reason i can't seem to wrap my head around it
13:49 karolherbst: imirkin_: not good enough to make an estimation of how much of the stack we use
13:49 imirkin_: even though in principle the concept should be simple.
13:49 karolherbst: that's also what RSpliet failed at
13:50 karolherbst: we could poke nvidia about it though and get an answer in a year :p
13:50 RSpliet: karolherbst: to be fair, I never tried. The concept of the stack is quite straightforward and there's even some patents explaining how they work ;-)
13:50 karolherbst: sure
13:50 karolherbst: but again, we didn't understand what was going on with that issue when I asked the last time about it
13:51 karolherbst: anyway, if a loop + if is able to trigger this issue, we are kind of screwed anyway
13:51 RSpliet: Not denying that. I also only threw limited hours against it.
13:51 karolherbst: what is there really to do besides just enabling the off-chip stack?
13:51 karolherbst: sure.. we could try to figure out the amount of iterations of the loop and stuff
13:52 karolherbst: but doesn't help in the case we don't
13:52 RSpliet: karolherbst: a well written loop is one for which the number of iterations doesn't matter for the stack depth
13:52 karolherbst: RSpliet: so.. the thing is, whenever we do an iteration, we push to the stack via a joinat
13:52 karolherbst: there is no other explanation
13:52 RSpliet: Yeah, that's my definition of a poorly written loop :-D
13:53 karolherbst: well
13:53 karolherbst: how else should we handle the if then?
13:53 karolherbst: for the general case
13:53 RSpliet: Oh let me rephrase that. In the body of a well-written loop you push as many items as you pop
13:53 karolherbst: this is the shader btw: https://gist.githubusercontent.com/karolherbst/8ed234cd9e3e0e51c90524d67f9488ee/raw/48624aca37767209bd11d6c5d3b82c278affe58f/gistfile1.txt
13:54 karolherbst: BB:[10,17] is the loop
13:54 karolherbst: mhh + 21
13:54 RSpliet: Yeah, I haven't got time to dive into specifics
13:54 RSpliet: Sorry man, I'd love to
13:55 karolherbst: mhh, anyway, what is happening is, that we do joinat; join; bra; joinat; join; bra; joinat; ....
13:55 karolherbst: and this somehow messes things up
13:55 RSpliet: The conditional bra pushes.
13:55 karolherbst: mhh, true
13:55 karolherbst: I forgot about that
13:56 karolherbst: when does the condition bra pops?
13:56 RSpliet: Either with an explicit join, or whenever the control mask is all-0
13:56 karolherbst: mhh, okay, but we have that join
13:56 karolherbst: loop: joinat; $p0 bra; join; bra
13:57 karolherbst: RSpliet: mhh, but what happens if one thread breaks?
13:57 karolherbst: but the break is after the join
13:57 karolherbst: weird
13:58 RSpliet: to the best of my understanding (!), the breaking thread sets it's break mask bit to 0. the control mask is like an AND of like a jump/bra mask, a break mask, a ret mask and an exit mask
13:59 RSpliet: when the control mask (the bitwise AND) is all-0, you pop, which restores one of those four masks depending on the stack entry type
13:59 karolherbst: mhh
13:59 karolherbst: maybe we have to write an emulator....
13:59 karolherbst: and try to reproduce the issue
14:00 RSpliet: I actually implemented a very similar scheme in my Sim-D simulator
14:00 RSpliet: which is still behind closed doors, because academia sucks
14:00 karolherbst: I am sure it's not academia which sucks here, but the way universities decide to publish stuff :p
14:01 karolherbst: RSpliet: I highly doubt that with open access you'd have the same issues
14:01 RSpliet: I hope to be able to publish source code along with my dissertation, but right now I'm still code-monkeying away
14:10 imirkin_: RSpliet: you've been doing that for a while now, no?
14:10 imirkin_: [your PhD, that is]
14:10 RSpliet: too long
14:14 imirkin_: that's a universal answer... :)
14:29 RSpliet: imirkin_: currently edging towards the end of my 4th year
14:29 RSpliet: Whoaaaaa, we're halfway there!
14:31 imirkin_: hopefully a bit more than half-way...
14:31 imirkin_: i think 5y is a common length for people who don't have to wait for the LHC to get built
14:32 RSpliet: It's officially 3 years over here
14:32 RSpliet: But it's also somewhat common to get all your results in the final two months, so I'm still hopeful
14:32 imirkin_: yeah, the 5y often involves getting a MS first
14:32 imirkin_: hehe
17:03 karolherbst: imirkin_: we can't really calculate how many iterations we do inside a loop, can we?
17:03 karolherbst: at least not with the current codegen
17:03 karolherbst: or well, we could, but I imagine this to be very painful
17:05 karolherbst: so I don't see any other solution here than to say: if we have conditional branches inside a loop, we have to use the off-chip stack with some high enough value.
17:07 karolherbst: RSpliet: would you agree with that or not? I think we could try to get rid of some trivial control flow inside a loop (flattening ifs to predicated execution instead of brancing), but generally we can't really do much here, can we?
17:09 karolherbst: like we could convert this code like this: https://gist.github.com/karolherbst/18a17d3f57e4a38ce00b7dd0a8437308
17:09 imirkin_: karolherbst: there's a loop-variable-finding thing in glsl somewhere
17:09 karolherbst: ohh, even the joinats could be removed
17:09 imirkin_: could be passed through to tgsi i suspect
17:09 imirkin_: basically unless we can prove that the loop happens a deterministic number of times, it could be huge
17:10 imirkin_: (or has an upper bound...)
17:11 karolherbst: yeah, that's what I am worrying about as well
17:11 karolherbst: mhh, with nir we could just say we unroll up to 64 iterations and then we know for sure we have to assume a ridiculous high amount of iterations... but with TGSI we couldn't probably to that without adding more stuff to TGSI
17:11 karolherbst: radeon has LLVM, so they have that covered there :/
17:16 imirkin_: we can do that with tgsi too.
17:16 imirkin_: there's various unroll controls in PIPE_CAP
17:16 imirkin_: (and if there aren't enough, we can add)
17:17 karolherbst: mhh, interesting
17:17 imirkin_: there's a max thing, which if you set to 0, you get forced unrolling
17:17 imirkin_: perhaps if we set it to like 64, then it can be ok?
17:18 imirkin_: or 32 or whatever
17:18 karolherbst: imirkin_: heh.. that's not a TGSI thing
17:18 karolherbst: it's glsl ir
17:18 karolherbst: but yeah, would work as well
17:18 imirkin_: sure
17:18 imirkin_: is there a difference?
17:19 karolherbst: in regards to TGSI I don't think so
17:19 imirkin_: :)
17:19 karolherbst: normally you could unroll more with optimizations, but I guess we can ignore that for the TGSI path
17:20 imirkin_: what we really want though
17:20 karolherbst: okay.. so if we assume that all the loops are unrolled, we just need to detect the pattern and just enable the off-chip stack...
17:20 imirkin_: is that imputed count to be passed through tgsi
17:20 karolherbst: mhh, yeah
17:20 imirkin_: which shouldn't be difficult
17:20 karolherbst: but what are the more annoying cases is where it depends on an uniform
17:21 imirkin_: then we have to do the stack
17:21 imirkin_: that's basically it -- if we have the count and it's less than N, no stack
17:21 karolherbst: well.. we could do a shader variant instead
17:21 imirkin_: otherwise off-chip stack
17:21 imirkin_: how does that help?
17:21 karolherbst: might unroll
17:21 imirkin_: and which one to execute?
17:21 imirkin_: based on what?
17:22 karolherbst: well, I don't know how other drivers do variants, but that kind of depends on the uniforms
17:22 karolherbst: not that it would always work
17:22 karolherbst: but shader variants might be a valid perf opt we might want to do anyway
17:22 imirkin_: at that point, we should do a recompilation pass with uniforms fed in
17:22 imirkin_: why just this? :)
17:23 imirkin_: anti-uber-shaders
17:23 karolherbst: well, that's a shader variant, no?
17:23 imirkin_: sure
17:23 imirkin_: basically you're adding the uniform state into your key
17:23 karolherbst: yeah
17:23 karolherbst: but this could lead to a lot of shaders
17:23 imirkin_: might as well not precompile
17:23 karolherbst: right
17:24 karolherbst: but we want our native shader cache for this before doing that :D
17:24 karolherbst: anyway, this should make this situation less painful overall
17:24 karolherbst: but that's for later I guess
17:25 karolherbst: mhh, could also lead to ridiculous VRAM consumption
17:25 karolherbst: soo maybe not a good idea
17:26 karolherbst: anyway. I guess we could detect the pattern for now (condition branches inside a loop) and just use the off-chip stack with some sane value
17:26 karolherbst: 1kb seemed to be enough for all shaders I've encountered so far, so maybe do 2kb and move on
17:31 imirkin_: sounds reasonable
17:31 imirkin_: this is an issue on nv50 as well btw
17:31 imirkin_: i tried to set up the off-chip stack, but was less-than-successful
17:31 imirkin_: (i think the on-chip stack is smaller there, so the problem is more visible)
17:39 karolherbst: btw.. who are always those people replying random stuff on bugs who have obviously no clue :/
17:39 karolherbst: and asking something from the reporter
17:39 imirkin_: i hate that shit
17:39 karolherbst: yeah...
17:41 karolherbst: imirkin_: maybe we should get official contributer marked in bugzilla, so a reporter knows who might have more reasonable requests than others :/ but that's also painful to handle :/
17:44 imirkin_: or even better - disallow random people from commenting in the first place
17:44 imirkin_: that way you also avoid the problem of "i am having the same issue, except i have a totally unrelated issue"
17:47 karolherbst: yeah.. but that's a bit harsh generally. Sometimes people also just want to ask for a status or something
17:47 imirkin_: sometimes.
17:47 karolherbst: or maybe even commenting on potential fixes or something (although the latter is less likely with nouveau)
17:47 karolherbst: I'd rather blobk annoying people from commenting on random bugs :p
17:48 imirkin_: you'll become the Grand Arbiter of Annoyance
17:48 karolherbst: yep
19:01 RSpliet: karolherbst: There should always be a way to transform a loops such that you need a constant number of control stack entries.
19:02 karolherbst: RSpliet: sure, but not without hurting perf
19:02 RSpliet: karolherbst: unrolling control stacks hurts perf quite badly
19:03 RSpliet: every pop should be treated as a branch, flushing the pipeline
19:04 karolherbst: RSpliet: sure.. but how else to deal with conditional breaks out of loops?
19:04 RSpliet: same way you deal with early returns from function calls
19:04 RSpliet: sorry, that's a lame answer :-P
19:04 karolherbst: and not even correct :p
19:05 RSpliet: it is actually
19:05 RSpliet: same mechanism, different mask
19:05 karolherbst: early returns you eliminate by reworking the function so that you only have one return, but that's not the issue we have here
19:05 imirkin_: as i understand it
19:05 karolherbst: and our loop only has one exit
19:05 karolherbst: not two
19:06 imirkin_: instead of pushing in the uncommon case
19:06 imirkin_: we push in the common case
19:06 karolherbst: we push in every case
19:06 imirkin_: flipping the way the if statement works should Just Work (tm)
19:06 karolherbst: the predicated bra itself already pushes
19:06 imirkin_: either the bra or the break has to be predicated
19:07 karolherbst: ahh, so you would rather predicate the break.. right
19:07 karolherbst: that should make the situation less bad
19:07 karolherbst: but, not in this case
19:07 karolherbst: here you have this issue that some threads exit very early, where other threads have tons of iterations
19:08 RSpliet: brk; doesn't push to the control stack
19:08 karolherbst: (which was similiar to the other shader I found having this issue)
19:09 karolherbst: RSpliet: so the issue we have here is that we essentially have a divergency hell and a lot of entries pushed onto the stack, while nothing cleaning that up as we never converge except every thread exited the loop, no?
19:09 karolherbst: on a higher level
19:09 karolherbst: if all threads follow the same path on a predicated bra, there is no need to push onto the stack, right?
19:09 RSpliet: karolherbst: I get that, but that's a flaw in the way our compiler transforms and issues control flow.
19:09 karolherbst: okay, then what would be the better way to deal with this?
19:10 RSpliet: I haven't looked at the code in great detail. But in short. The prebrk pushes a convergence point (PC) along with the current break mask and it's type "brk"
19:11 karolherbst: we only call prebrk once
19:11 RSpliet: that's good!
19:11 RSpliet: Inside the loop, brk just manipulates the break mask, disabling the threads for which the predicate is true
19:12 karolherbst: uff, wait
19:12 karolherbst: ehh...
19:12 karolherbst: why didn't I see it before
19:12 karolherbst: no, we have nested loops again
19:12 karolherbst: why?
19:12 karolherbst: the hell?
19:12 karolherbst: ohh, uhhh
19:12 karolherbst:should have read the entire glsl code
19:13 RSpliet: nesting loops is absolutely fine too...
19:13 RSpliet: Still no uncontrollable number of stack entries
19:13 karolherbst: mhh, we actually have three level of loops
19:13 karolherbst: RSpliet: okay.. so before we enter a loop, we call prebrk
19:13 karolherbst: and we do so for nested loops for each iteration
19:14 karolherbst: that sounds like a huge issue, no?
19:14 RSpliet: Shouldn't be
19:14 karolherbst: well.... the hardware disagrees
19:14 karolherbst: or maybe it does
19:14 karolherbst: we don't know
19:14 karolherbst: what would happen if we move those prebrks outside of the loops?
19:14 karolherbst: and have three right before entering the first?
19:14 karolherbst: (given that the entering of the deeper loop isn't conditional)
19:14 RSpliet: as long as the prebrk _inside_ the outer loop is matched with a pop when exiting the inner loop, it should be absolutely fine
19:15 karolherbst: ohh, I was talking about moveing the prebreaks outside of all loops
19:15 RSpliet: that pop could be triggered by a brk or a join or... relatively arbitrary
19:15 RSpliet: not sure if that's a good idea
19:16 karolherbst: if the prebreak is in the first block inside the loop, it should be safe, no?
19:17 karolherbst: oh ehh
19:17 karolherbst: actually not
19:17 karolherbst: that's a stupid idea
19:20 RSpliet: karolherbst: here's what a single loop with early break should conceptually look like - it gets more hairy if you also need "continue", but this is what NVIDIA transforms to
19:20 RSpliet: https://paste.fedoraproject.org/paste/QNtpn8YbvXBGLj3gdICMxA
19:20 RSpliet: prebrk pushes, the pop happens "automatically" after the last thread hit the brk
19:21 karolherbst: okay, but then our problem is most likely not, that we build the loops incorrectly, but that we have three loops nested and completly divergent threads
19:21 RSpliet: Let me quickly give an example of what a nested loop should look like
19:24 RSpliet: https://paste.fedoraproject.org/paste/CyOu-6dUVgzdEI-hAXyarA
19:24 RSpliet: two control stack entries max, guaranteed, no matter what the loop invariants amount to
19:25 karolherbst: fun thing is, that shader looks like this as well
19:25 RSpliet: you can't and shouldn't move around the "prebrk end-inner", because you need one convergence point for the inner loop on _every_ iteration.
19:25 karolherbst: or mhh
19:25 karolherbst: maybe that predicated bra is the actual issue
19:25 karolherbst: and we should never do that inside loops
19:26 karolherbst: mhh, but that predicated bra should be harmless
19:27 karolherbst: as both paths converge quite fast
19:27 RSpliet: I find this textual representation very hard to reason about.
19:27 karolherbst: RSpliet: no, I think your statement about two entries max isn't true
19:27 karolherbst: otherwise it wouldn't make sense that our shader messes up
19:27 karolherbst: we had this the last time
19:27 RSpliet: in the example pseudo-code I gave you, absolutely 2 entries max
19:27 karolherbst: and you applied this rule as well and came to the conclusion that the hw is lying, which nobody believes anyway
19:28 karolherbst: I mean, it sounds reasonable
19:28 karolherbst: but it doesn't help if the hw disagrees
19:28 karolherbst: and that's probably due to something we don't know
19:29 karolherbst: mhh, let me remove everything irrelevant from the shader, otherwise it's hard to analyze this
19:29 RSpliet: That'd be very helpful, thanks
19:34 RSpliet: Do you know why predicated breaks are annotated with the break target? That info shouldn't end up in the final opcode
19:35 karolherbst: because it ends up in the opcode
19:36 karolherbst: ohh
19:36 karolherbst: breaks
19:36 karolherbst: uhm...
19:36 karolherbst: I read bras
19:36 karolherbst: I guess for convienience
19:36 imirkin_: RSpliet: needs to be there for ssa linkage
19:36 imirkin_: er, control flow linkage
19:36 imirkin_: which is needed for proper phi's/etc
19:36 RSpliet: imirkin_: I guess that makes sense :-)
19:36 imirkin_: but you're right - it's not in the final opcode
19:37 imirkin_: neither is it for join
19:37 imirkin_: or anything else except joinat and bra
19:37 imirkin_: (and call, i guess)
19:37 RSpliet: I hope so :-D
19:39 karolherbst: RSpliet: https://gist.githubusercontent.com/karolherbst/464600985b3b84ffdcdcf54daae2a3d5/raw/1511e9a8ba6b7adfc5300f765d1ce19da02933e3/gistfile1.txt
19:40 karolherbst: intendation is according to loop depth
19:47 RSpliet: Ok, so you'd expect a maximum stack depth of 6
19:50 RSpliet: prebrk, prebrk, prebrk, $p0 bra, joinat, $p0 bra
19:51 karolherbst: yeah
19:51 karolherbst: but I doubt the hw is playing a trick on us
19:51 karolherbst: so there has to be more to it
19:51 karolherbst: RSpliet: keep in mind, that 512 bytes per group isn't enough either
19:52 karolherbst: uhm, warp is the nv term
19:53 RSpliet: although you'd expect 48B/warp to be sufficient
19:53 karolherbst: so.. yeah
19:53 karolherbst: but you how that is about theories vs hardware :p
19:53 RSpliet: yep
19:58 RSpliet: So I guess an interesting question would be "what happens if we hit exit while the stack isn't empty"
19:58 RSpliet: Is the next work-group going to end up with a clean stack and a freshly reset SP. Or will it be smaller
19:59 imirkin_: RSpliet: there's an error for that
19:59 RSpliet: Not that I spotted an exit path that could leave part of the stack intact, but heck..
20:00 imirkin_: https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/engine/gr/gf100.c#L1195
20:00 imirkin_: hm, maybe not.
20:02 RSpliet: karolherbst: I don't know the NVIDIA ISA well enough... but. In theory I'd expect that break BB:11 to trigger a pop. Should/could we attach a join flag to it to make sure it pops the RHS of "not $p0 bra BB:17"?
20:03 RSpliet: In theory that'd be superfluous :-P
20:03 karolherbst: there is no join flag on the maxwell ISA anymore ;)
20:03 RSpliet: Then how do you join?
20:03 karolherbst: it's a SSY/SYNC pair
20:04 RSpliet: Ohh ok
20:04 karolherbst: joinat/join in our isa
20:04 RSpliet: but that's an instruction rather than a flag
20:04 imirkin_: same effect.
20:05 imirkin_: it's called "SYNC" now vs ".S". same diff.
20:05 karolherbst: well, it might be a flag on the bra instruction in the end... not 100% sure, but the difference is, we have to push the target now
20:05 karolherbst: or did we had to do that before as well?
20:05 imirkin_: also on nv50, the join doesn't do a branch, it's an indicator that things are now joined
20:05 karolherbst: ahh
20:05 imirkin_: on nvc0, the join itself is a branch to the joinat address.
20:27 RSpliet: karolherbst: is this the final code by the way?
20:27 karolherbst: what do you mean by final?
20:28 RSpliet: As in: no more optimisations
20:28 karolherbst: yeah, the last one printed
20:28 karolherbst: imirkin_: although with this shader, some threads could exit earlier
20:28 imirkin_: only pessimizations after that.
20:28 karolherbst: no idea if that matters though
20:28 karolherbst: imirkin_: what kind?
20:28 imirkin_: the joke kind?
20:28 karolherbst: ahh
20:29 karolherbst: ...
20:29 imirkin_: obviously not a very good one =/
20:29 RSpliet: karolherbst: if you look at this snippet: https://paste.fedoraproject.org/paste/zCQO8dvKcEL2vndDgWBqBg
20:29 karolherbst: yeah, that one was quite weak :p
20:29 RSpliet: if you turn around the condition on the bra, then both BB:15 and the joinat BB:16 can be eliminated
20:29 karolherbst: RSpliet: we could also predicate the branch entirely
20:29 karolherbst: but mhh, maybe we don't do this with phis?
20:30 karolherbst: dunno
20:30 karolherbst: RSpliet: but that doesn't help us with the worst case scenario
20:30 imirkin_: yeah, weird that didn't get picked up
20:30 karolherbst: I am sure there are cases we really needs thi off-chip stack, otherwise nvidia wouldn't have added it in the first place
20:31 karolherbst: at least we should add the code to enable it if the shader reports it is needed
20:31 karolherbst: we don't have yet to turn it on though
20:31 RSpliet: imirkin_: which is why I want to be sure 100% that this is the final. If a dodgy post-this optimisation just blandly removes the join you end up with broken code :-P
20:31 karolherbst: no, it's after the post RA opts
20:31 imirkin_: RSpliet: i dunno if what karol showed you is final
20:31 imirkin_: but we do post-ra opts, one called "flattening"
20:31 imirkin_: which takes care of stuff like this
20:32 RSpliet: Can flattening be broken? :-P
20:32 karolherbst: yeah.. I should check why flattening bails here
20:32 karolherbst: RSpliet: it's just very careful
20:32 imirkin_: RSpliet: it uses a big hammer. it's unbreakable!
20:32 RSpliet: What I'm looking at still has BB info and labels rather than absolute or relative addresses
20:32 RSpliet: imirkin_: did it snap twice?
20:33 imirkin_: don't get that reference...
20:33 karolherbst: RSpliet: we never print addresses
20:33 imirkin_: the addresses are filled in very very very late
20:33 imirkin_: esp on nv50 where they have to be absolute :)
20:33 imirkin_: [relative to the code segment]
20:35 RSpliet: karolherbst: opcodes don't contain BB addresses :-P is it easy for you to fetch this assembly by mmt or otherwise in a form where it's really just linear code, no BBs no annotations no nothing?
20:35 karolherbst: imirkin_: forcing flattening fixes the issue btw
20:35 karolherbst: bug uhmmm
20:35 karolherbst: it flattened in the middle loop
20:35 karolherbst: wait...
20:36 karolherbst: ehh
20:36 karolherbst: I still use the off-chip stack
20:58 RSpliet: karolherbst: can you disable the propagateJoin pass and see if it makes a difference?
20:59 RSpliet: in nv50_ir_lowering_nvc0.cpp, NVC0LegalizePostRA::visit()
21:04 RSpliet: That pass looks like it has the potential to incorrectly remove BB:15
21:07 RSpliet: Or do worse things to the conditional branch in BB:14 :-P
21:50 karolherbst: RSpliet: this pass is actually required, no?
21:51 RSpliet: From what I can tell it looks like an optimisation...
21:51 imirkin_: it's an opt.
21:53 karolherbst: well, disabling the code doesn't make it better
21:53 karolherbst: if at all, it completly breaks it
21:53 RSpliet: Uh-oh
21:53 karolherbst: like even the prior working shader is now broken
21:53 karolherbst: and that even completly as well
21:54 karolherbst: so even if that's an opt, I'd say it's really required to do so :p
21:54 RSpliet: Opts aren't supposed to paper over bugs like that
21:54 karolherbst: no, but here we are
21:55 imirkin_: afaik it's totally an opt
21:55 imirkin_: unless i'm thinking of different code
21:55 imirkin_: oh wait
21:55 imirkin_: hold on
21:55 imirkin_: is this the thing that flips the anterior joins to be interior joins?
21:55 imirkin_: then that's required ;)
21:55 karolherbst: yeah
21:55 karolherbst: it makes the bra -> join replacement
21:55 imirkin_: yeah, that's to flip between nv50-style and nvc0-style
21:56 imirkin_: yeah. that's required.
21:56 imirkin_: can't not have that.
21:56 imirkin_: "join" is basically a branch on nvc0+
21:56 imirkin_: while the IR has it at the join site.
21:56 imirkin_: so it has to be moved to each branch site.
22:00 karolherbst: RSpliet: also, the bug here is that we don't switch over to the off-chip stack, but maybe we can do better to avoid that
22:00 imirkin_: i thought you guys were talking about flattening
22:00 imirkin_: sorry for flip-flopping on that =/
22:01 RSpliet: imirkin_: no worries, I'd just like to understand that function in greater detail then
22:08 imirkin_: RSpliet: so the IR basically has like joinat; $p0 bra; ... ; bra end; ... bra end; end: join
22:08 imirkin_: which is the correct thing for nv50
22:08 imirkin_: however for nvc0, we want
22:08 imirkin_: joinat; $p0 bra; ... join; ... join; end: nothing.
22:09 imirkin_: so this function makes it so.
22:09 RSpliet: yeah that makes perfect sense
22:10 karolherbst: imirkin_: mhh, a break doesn't sync, does it?
22:10 imirkin_: a break jumps to the prebrk location
22:10 karolherbst: maybe we have to do the prejoin/join dance for each loop?
22:10 karolherbst: otherwise we have all threads divergent at some point
22:10 karolherbst: and this might cause this issue?
22:11 RSpliet: imirkin_: well, from what I gathered from the patent not quite
22:11 imirkin_: "has the effect of ... " :)
22:11 imirkin_: since it clears the mask
22:11 RSpliet: a break should just disable threads. If all threads are disabled it pops an entry off the cstack, which may or may not be a prebrk entry
22:11 imirkin_: hm. so we need more joins? or pops or whatever?
22:11 imirkin_: dunno
22:52 karolherbst: RSpliet: I am sure that since that patent was created the hardware changed quite a bit
22:52 karolherbst: and if I have to decide on wheather to trust what the hardware behaves like or the patent, I always choose the hardware
22:53 karolherbst: the patent is worthless if it's not able to explain what's happening
22:55 RSpliet: karolherbst: the patent explains the concept, implementation details vary. Problem is that hardware isn't telling you how it behaves. Do you have a way to step through shaders and inquire the number of control stack entries?
22:55 karolherbst: RSpliet: potentially yes
22:56 karolherbst: any idea where to read them out?
22:56 karolherbst: we have a debugging interface we could make use of to debug shaders
22:56 karolherbst: I just didn't work out how to fully use it
22:56 RSpliet: "them" being the stack pointer? Phoa, no never looked at that
22:57 karolherbst: RSpliet: mhh, but we could dump the stack at least...
22:57 karolherbst: I just wait on some documentation for this to make it less painful
22:58 karolherbst: but I think I know how to enable debug mode, we just have to tell the kernel to pass the traps through
22:58 RSpliet: Yeah understandable. Being able to dump the stack seems valuable. You can always launch a 1-warp program to reduce noise :-P
22:59 RSpliet: Although having a stack pointer seems useful - it's not as if entries are scrubbed...
22:59 karolherbst: mhh, although we should be able to single step as well
22:59 karolherbst: it's... just a lot of work in order to debug this
23:01 karolherbst: RSpliet: I think it's easier to come up with a model which behaves the same actually
23:01 karolherbst: RSpliet: okay. but here is something interesting: the warps where threads diverge more are more likely to trap
23:01 karolherbst: than the other ones
23:03 RSpliet: karolherbst: why? are they not reconverging?
23:03 karolherbst: or maybe it's just warps with more loop iterations
23:03 karolherbst: RSpliet: well, it depends if they run out of stack before converging or not :p
23:03 karolherbst: but anyway.. amount of iterations do matter here
23:03 karolherbst: for... whatever reason
23:04 karolherbst: so it is something which pushes to the stack on each iteration of some of the loops
23:04 karolherbst: otherwise that wouldn't make sense, correct?
23:05 karolherbst: mhh, I should probably run that through frameretracer
23:05 RSpliet: Oh agreed. And that's bound to be your bug. The code shouldn't push on every iteration of a loop
23:05 karolherbst: much simplier to investigate such issues
23:05 karolherbst: yeah.. question is just, what pushes?
23:05 RSpliet: Sorry, let me rephrase that. An iteration shouldn't push more than it pops :-P
23:06 karolherbst: mhhh
23:06 karolherbst: I am wondering
23:06 RSpliet: Do you run into the same problems on Fermi/Kepler?
23:07 karolherbst: I am not sure
23:09 karolherbst: I think I want to blame that not $p0 bra BB:17
23:09 karolherbst: but no idea why
23:10 karolherbst: RSpliet: yeah... I kind of think we have to sync nested loops...
23:10 karolherbst: that "not $p0 bra BB:17" can lead to diverged execution of the entire loop
23:11 karolherbst: and there is no sync point
23:12 karolherbst: RSpliet: so... let's say we have 32 diverged threads... this means that a stack of 512 would be reduced to 16 bytes per thread, right?
23:13 RSpliet: My knowledge on NV50/Fermi/Kepler dictates that the control stack contains per-warp entries, not per-thread
23:14 RSpliet: an entry being a (type, mask, pc) 3-tuple. in envydocs this is documented as a 64-bit entry. The mask being 32, the type probably 3 or 4 bits, the remainder reserved for PC
23:14 karolherbst: so we have 64 entries / warp?
23:15 RSpliet: is the size per-warp or per... eh. SM?
23:17 RSpliet: I'd highly suspect that on-chip would be limited to a figure like 8 rather than 64. But that's just from staring at OpenCL assembly and looking at how complex control flow generally really is in kernels on average
23:18 karolherbst: per warp
23:18 karolherbst: well the off chip is per warp at least
23:18 karolherbst: and 512 bytes wasn't enough
23:19 karolherbst: but 1024 bytes is
23:23 RSpliet: Right, if 8 bytes per entry is still valid that means the stack grows to 128+whatever on-chip space there is
23:36 RSpliet: I can defo see it go to shite on a first gen Maxwell here on stock Fedora
23:36 karolherbst: yeah, but what about kepler?
23:36 karolherbst: or fermi
23:37 RSpliet: I thought GM10x had the same assembly as GK110?
23:37 karolherbst: no
23:38 RSpliet: Oh, I was wrong
23:39 RSpliet: I already questioned myself as I was typing, as I could swear it had over twice as many scheduling headers :-P
23:40 RSpliet: Anyway, only have a 940M at my fingertips...
23:41 karolherbst: for now it doesn't matter anyway
23:41 RSpliet: and too incompetent to even build mesa these days
23:41 RSpliet: ../src/gallium/auxiliary/util/u_helpers.c:140:46: fout: ‘struct util_cpu_caps’ has no member named ‘cores_per_L3’
23:41 RSpliet: I'm pretty sure that field is there
23:41 karolherbst: RSpliet: you know that autotools got removed?
23:42 RSpliet: using meson... I think
23:42 karolherbst: just in case you were still using make
23:42 karolherbst: ahh
23:42 RSpliet: but my build directory is a mess
23:48 RSpliet: git clean did the trick