13:00imirkin_: karolherbst: you good to handle https://bugs.freedesktop.org/show_bug.cgi?id=111006 ? or need me to have a look?
13:01imirkin_: looks like something VERY weird is going on
13:01karolherbst: I can look into that
13:01karolherbst: I think I know what
13:01imirkin_: and probably something quite dumb.
13:02karolherbst: probably NaN
13:02karolherbst: random floats are always messy
13:03karolherbst: ohh wait
13:03karolherbst: it isn't random
13:03karolherbst: oh well
13:03karolherbst: I will take a look
13:10karolherbst: imirkin_: can't reproduce
13:10RSpliet: imirkin_: does the first "_GLF_outVarBackup_GLF_color = _GLF_color;" even have defined behaviour when it's defined as "out vec4 _GLF_color" ? Looks to me like a use before assignment, and that could mess with liveness analysis in strange ways...
13:10karolherbst: will check with the mesa commit specified in the bug
13:10imirkin_: karolherbst: which gpu?
13:11imirkin_: shouldn't be materially different from the GTX 970
13:11karolherbst: ahh, no, it is a 1050
13:11karolherbst: I have a 970M here though as well. So I could check there
13:12imirkin_: like i said though - shouldn't matter
13:12karolherbst: RSpliet: it's dead code, but we don't do those opts yet
13:13karolherbst: now I get something funky
13:13karolherbst: that explains
13:13imirkin_: "now" as in "on that commit"?
13:13karolherbst: I think it's some memory stuff
13:14karolherbst: the correct stuff was already in memory
13:14karolherbst: "[137017.538280] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000000  warp 10016 " :)
13:14karolherbst: ohh, for fuck sakes.. this recursive fault bug is super annoying
13:15RSpliet: karolherbst: it's only dead code if that injectionSwitch uniform was a constant rather than a shader param.
13:15karolherbst: oh, right, we don't do shader variants
13:16karolherbst: ehh, I think this "Fixing recursive fault but reboot is needed!" triggers only with runpm
13:26karolherbst: imirkin: yep.. after a fresh reboot I also get those errors.. so let's recheck master
13:27karolherbst: also, the reference doesn't trigger traps, but the variant does
13:27imirkin_: that means we're doing something demonstrably dumb
13:28karolherbst: should be easy to figure out
13:28karolherbst: the shader is quite simple
13:28imirkin_: unless it's a stack thing
13:28karolherbst: I feared as much, but I doubt that, kind of
13:28karolherbst: ohh, but there is a loop
13:28karolherbst: mhh, so maybe
13:28karolherbst: I still have this other shader which causes that stack issue
13:29karolherbst: and this was actually my first guess on the issue
13:29karolherbst: okay, happens on master as well
13:34karolherbst: uf, how did I enable this off chip stack again
13:34imirkin_: shader header
13:35karolherbst: sure... but I was more wondering about which byte that was
13:35imirkin_: shader header is documented
13:35imirkin_: both in rnndb and on nvidia's open-gpu-doc
13:35karolherbst: ohh, true
13:35HdkR: memset the whole header to 0xff and just hope for the best :P
13:35imirkin_: *finally* a voice of reason :p
13:36HdkR: What generations is the header documented for?
13:36karolherbst: I think it was that
13:37imirkin_: HdkR: tnt2 :p
13:37imirkin_: afaik that holds up through pascal, at least
13:37imirkin_: and until there's a gt 1630 released, no later generation exists to me :)
13:38karolherbst: "The SPH field ShaderLocalMemoryCrsSize sets the additional (off chip) call/return stack size (CRS_SZ). Units are in Bytes/Warp. Minimum value 0, maximum 1 megabyte. Must be multiples of 512 bytes." yeah, that was it
13:38HdkR: So nobody gets to get about Turing SPH then
13:39HdkR: Such a simple thing to document and isn't available :/
13:39imirkin_: no accel firmware for turing either
13:39imirkin_: so it works out nicely
13:39HdkR: Just means nobody /has/ to care about Turing
13:39imirkin_: i have a hard time caring past kepler.
13:40imirkin_: only reason i'll be plugging my gt1030 in is to play with the hdr infoframe
13:40imirkin_: i guess i could do it on kepler, but then i don't get the vendor infoframe...
13:40karolherbst: less errors with a small off chip stack
13:41karolherbst: gone with a 1kb stack
13:41imirkin_: ok. that's annoying.
13:41karolherbst: so we probably want to fix this issue :/
13:41imirkin_: can you report your findings in the bug?
13:43karolherbst: imirkin_: and we might want to make the dmesg print more meaningful :)
13:43karolherbst: gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 300016 
13:44karolherbst: 0x16 is probably this call/return stack overflow stuff
13:44HdkR: imirkin_: Turing changes HDR interactions as well. Will be some more fun things there :P
13:44imirkin_: at least on that shader gen
13:44imirkin_: HdkR: until dell starts shipping gt 1630's in all their new pc's for free, i doubt i'll have a look.
13:45imirkin_: HdkR: it's just an infoframe, no?
13:45imirkin_: (the "vendor" infoframe)
13:45imirkin_: or are you talking about CTM/etc?
13:46imirkin_: HdkR: i think we already know that turing's internal processing is basically all done in fp16 now
13:46HdkR: I don't actually know, I just remember HDR bits changing there
13:46karolherbst: imirkin_: anyway, any "smart" ideas on how to fix it?
13:46HdkR: Sounds correct
13:46imirkin_: karolherbst: https://previews.123rf.com/images/andreykuzmin/andreykuzmin1409/andreykuzmin140900082/31721808-scared-ostrich-burying-its-head-in-sand.jpg
13:47imirkin_: it's a fool-proof technique.
13:47karolherbst: we could use a 1kb stack whenever we find a loop in a shader and just go on with live...
13:47karolherbst: or well..
13:47HdkR: karolherbst: Make the CRS like 8MB and call it good :D
13:47karolherbst: a loop with some if or another loop
13:47imirkin_: karolherbst: or we could rewrite the loop to not be dumb like this, iirc
13:47karolherbst: HdkR: mhh, yeah, doesn't matter anyway
13:48imirkin_: i think if we just flip the "if" and "else" bits, the problem goes away
13:48karolherbst: imirkin_: thing is, the loop doesn't do anything dumb
13:48imirkin_: the glsl code is fine. we're the idiots.
13:48karolherbst: there is just the normal prebreak + break stuff
13:48RSpliet: karolherbst: do we have a compiler pass that can determine the maximum stack depth statically?
13:48karolherbst: + the joinat+join stuff from the if
13:48karolherbst: nothing more
13:48karolherbst: RSpliet: no, as we have only a poor understanding of this
13:48karolherbst: we talked about it before
13:48imirkin_: where by "we" karolherbst means him and me.
13:49imirkin_: whereas RSpliet and everyone else has a great understanding of it
13:49karolherbst: and our estimate was multiple orders of magnitude to low
13:49imirkin_: for some reason i can't seem to wrap my head around it
13:49karolherbst: imirkin_: not good enough to make an estimation of how much of the stack we use
13:49imirkin_: even though in principle the concept should be simple.
13:49karolherbst: that's also what RSpliet failed at
13:50karolherbst: we could poke nvidia about it though and get an answer in a year :p
13:50RSpliet: karolherbst: to be fair, I never tried. The concept of the stack is quite straightforward and there's even some patents explaining how they work ;-)
13:50karolherbst: but again, we didn't understand what was going on with that issue when I asked the last time about it
13:51karolherbst: anyway, if a loop + if is able to trigger this issue, we are kind of screwed anyway
13:51RSpliet: Not denying that. I also only threw limited hours against it.
13:51karolherbst: what is there really to do besides just enabling the off-chip stack?
13:51karolherbst: sure.. we could try to figure out the amount of iterations of the loop and stuff
13:52karolherbst: but doesn't help in the case we don't
13:52RSpliet: karolherbst: a well written loop is one for which the number of iterations doesn't matter for the stack depth
13:52karolherbst: RSpliet: so.. the thing is, whenever we do an iteration, we push to the stack via a joinat
13:52karolherbst: there is no other explanation
13:52RSpliet: Yeah, that's my definition of a poorly written loop :-D
13:53karolherbst: how else should we handle the if then?
13:53karolherbst: for the general case
13:53RSpliet: Oh let me rephrase that. In the body of a well-written loop you push as many items as you pop
13:53karolherbst: this is the shader btw: https://gist.githubusercontent.com/karolherbst/8ed234cd9e3e0e51c90524d67f9488ee/raw/48624aca37767209bd11d6c5d3b82c278affe58f/gistfile1.txt
13:54karolherbst: BB:[10,17] is the loop
13:54karolherbst: mhh + 21
13:54RSpliet: Yeah, I haven't got time to dive into specifics
13:54RSpliet: Sorry man, I'd love to
13:55karolherbst: mhh, anyway, what is happening is, that we do joinat; join; bra; joinat; join; bra; joinat; ....
13:55karolherbst: and this somehow messes things up
13:55RSpliet: The conditional bra pushes.
13:55karolherbst: mhh, true
13:55karolherbst: I forgot about that
13:56karolherbst: when does the condition bra pops?
13:56RSpliet: Either with an explicit join, or whenever the control mask is all-0
13:56karolherbst: mhh, okay, but we have that join
13:56karolherbst: loop: joinat; $p0 bra; join; bra
13:57karolherbst: RSpliet: mhh, but what happens if one thread breaks?
13:57karolherbst: but the break is after the join
13:58RSpliet: to the best of my understanding (!), the breaking thread sets it's break mask bit to 0. the control mask is like an AND of like a jump/bra mask, a break mask, a ret mask and an exit mask
13:59RSpliet: when the control mask (the bitwise AND) is all-0, you pop, which restores one of those four masks depending on the stack entry type
13:59karolherbst: maybe we have to write an emulator....
13:59karolherbst: and try to reproduce the issue
14:00RSpliet: I actually implemented a very similar scheme in my Sim-D simulator
14:00RSpliet: which is still behind closed doors, because academia sucks
14:00karolherbst: I am sure it's not academia which sucks here, but the way universities decide to publish stuff :p
14:01karolherbst: RSpliet: I highly doubt that with open access you'd have the same issues
14:01RSpliet: I hope to be able to publish source code along with my dissertation, but right now I'm still code-monkeying away
14:10imirkin_: RSpliet: you've been doing that for a while now, no?
14:10imirkin_: [your PhD, that is]
14:10RSpliet: too long
14:14imirkin_: that's a universal answer... :)
14:29RSpliet: imirkin_: currently edging towards the end of my 4th year
14:29RSpliet: Whoaaaaa, we're halfway there!
14:31imirkin_: hopefully a bit more than half-way...
14:31imirkin_: i think 5y is a common length for people who don't have to wait for the LHC to get built
14:32RSpliet: It's officially 3 years over here
14:32RSpliet: But it's also somewhat common to get all your results in the final two months, so I'm still hopeful
14:32imirkin_: yeah, the 5y often involves getting a MS first
17:03karolherbst: imirkin_: we can't really calculate how many iterations we do inside a loop, can we?
17:03karolherbst: at least not with the current codegen
17:03karolherbst: or well, we could, but I imagine this to be very painful
17:05karolherbst: so I don't see any other solution here than to say: if we have conditional branches inside a loop, we have to use the off-chip stack with some high enough value.
17:07karolherbst: RSpliet: would you agree with that or not? I think we could try to get rid of some trivial control flow inside a loop (flattening ifs to predicated execution instead of brancing), but generally we can't really do much here, can we?
17:09karolherbst: like we could convert this code like this: https://gist.github.com/karolherbst/18a17d3f57e4a38ce00b7dd0a8437308
17:09imirkin_: karolherbst: there's a loop-variable-finding thing in glsl somewhere
17:09karolherbst: ohh, even the joinats could be removed
17:09imirkin_: could be passed through to tgsi i suspect
17:09imirkin_: basically unless we can prove that the loop happens a deterministic number of times, it could be huge
17:10imirkin_: (or has an upper bound...)
17:11karolherbst: yeah, that's what I am worrying about as well
17:11karolherbst: mhh, with nir we could just say we unroll up to 64 iterations and then we know for sure we have to assume a ridiculous high amount of iterations... but with TGSI we couldn't probably to that without adding more stuff to TGSI
17:11karolherbst: radeon has LLVM, so they have that covered there :/
17:16imirkin_: we can do that with tgsi too.
17:16imirkin_: there's various unroll controls in PIPE_CAP
17:16imirkin_: (and if there aren't enough, we can add)
17:17karolherbst: mhh, interesting
17:17imirkin_: there's a max thing, which if you set to 0, you get forced unrolling
17:17imirkin_: perhaps if we set it to like 64, then it can be ok?
17:18imirkin_: or 32 or whatever
17:18karolherbst: imirkin_: heh.. that's not a TGSI thing
17:18karolherbst: it's glsl ir
17:18karolherbst: but yeah, would work as well
17:18imirkin_: is there a difference?
17:19karolherbst: in regards to TGSI I don't think so
17:19karolherbst: normally you could unroll more with optimizations, but I guess we can ignore that for the TGSI path
17:20imirkin_: what we really want though
17:20karolherbst: okay.. so if we assume that all the loops are unrolled, we just need to detect the pattern and just enable the off-chip stack...
17:20imirkin_: is that imputed count to be passed through tgsi
17:20karolherbst: mhh, yeah
17:20imirkin_: which shouldn't be difficult
17:20karolherbst: but what are the more annoying cases is where it depends on an uniform
17:21imirkin_: then we have to do the stack
17:21imirkin_: that's basically it -- if we have the count and it's less than N, no stack
17:21karolherbst: well.. we could do a shader variant instead
17:21imirkin_: otherwise off-chip stack
17:21imirkin_: how does that help?
17:21karolherbst: might unroll
17:21imirkin_: and which one to execute?
17:21imirkin_: based on what?
17:22karolherbst: well, I don't know how other drivers do variants, but that kind of depends on the uniforms
17:22karolherbst: not that it would always work
17:22karolherbst: but shader variants might be a valid perf opt we might want to do anyway
17:22imirkin_: at that point, we should do a recompilation pass with uniforms fed in
17:22imirkin_: why just this? :)
17:23karolherbst: well, that's a shader variant, no?
17:23imirkin_: basically you're adding the uniform state into your key
17:23karolherbst: but this could lead to a lot of shaders
17:23imirkin_: might as well not precompile
17:24karolherbst: but we want our native shader cache for this before doing that :D
17:24karolherbst: anyway, this should make this situation less painful overall
17:24karolherbst: but that's for later I guess
17:25karolherbst: mhh, could also lead to ridiculous VRAM consumption
17:25karolherbst: soo maybe not a good idea
17:26karolherbst: anyway. I guess we could detect the pattern for now (condition branches inside a loop) and just use the off-chip stack with some sane value
17:26karolherbst: 1kb seemed to be enough for all shaders I've encountered so far, so maybe do 2kb and move on
17:31imirkin_: sounds reasonable
17:31imirkin_: this is an issue on nv50 as well btw
17:31imirkin_: i tried to set up the off-chip stack, but was less-than-successful
17:31imirkin_: (i think the on-chip stack is smaller there, so the problem is more visible)
17:39karolherbst: btw.. who are always those people replying random stuff on bugs who have obviously no clue :/
17:39karolherbst: and asking something from the reporter
17:39imirkin_: i hate that shit
17:41karolherbst: imirkin_: maybe we should get official contributer marked in bugzilla, so a reporter knows who might have more reasonable requests than others :/ but that's also painful to handle :/
17:44imirkin_: or even better - disallow random people from commenting in the first place
17:44imirkin_: that way you also avoid the problem of "i am having the same issue, except i have a totally unrelated issue"
17:47karolherbst: yeah.. but that's a bit harsh generally. Sometimes people also just want to ask for a status or something
17:47karolherbst: or maybe even commenting on potential fixes or something (although the latter is less likely with nouveau)
17:47karolherbst: I'd rather blobk annoying people from commenting on random bugs :p
17:48imirkin_: you'll become the Grand Arbiter of Annoyance
19:01RSpliet: karolherbst: There should always be a way to transform a loops such that you need a constant number of control stack entries.
19:02karolherbst: RSpliet: sure, but not without hurting perf
19:02RSpliet: karolherbst: unrolling control stacks hurts perf quite badly
19:03RSpliet: every pop should be treated as a branch, flushing the pipeline
19:04karolherbst: RSpliet: sure.. but how else to deal with conditional breaks out of loops?
19:04RSpliet: same way you deal with early returns from function calls
19:04RSpliet: sorry, that's a lame answer :-P
19:04karolherbst: and not even correct :p
19:05RSpliet: it is actually
19:05RSpliet: same mechanism, different mask
19:05karolherbst: early returns you eliminate by reworking the function so that you only have one return, but that's not the issue we have here
19:05imirkin_: as i understand it
19:05karolherbst: and our loop only has one exit
19:05karolherbst: not two
19:06imirkin_: instead of pushing in the uncommon case
19:06imirkin_: we push in the common case
19:06karolherbst: we push in every case
19:06imirkin_: flipping the way the if statement works should Just Work (tm)
19:06karolherbst: the predicated bra itself already pushes
19:06imirkin_: either the bra or the break has to be predicated
19:07karolherbst: ahh, so you would rather predicate the break.. right
19:07karolherbst: that should make the situation less bad
19:07karolherbst: but, not in this case
19:07karolherbst: here you have this issue that some threads exit very early, where other threads have tons of iterations
19:08RSpliet: brk; doesn't push to the control stack
19:08karolherbst: (which was similiar to the other shader I found having this issue)
19:09karolherbst: RSpliet: so the issue we have here is that we essentially have a divergency hell and a lot of entries pushed onto the stack, while nothing cleaning that up as we never converge except every thread exited the loop, no?
19:09karolherbst: on a higher level
19:09karolherbst: if all threads follow the same path on a predicated bra, there is no need to push onto the stack, right?
19:09RSpliet: karolherbst: I get that, but that's a flaw in the way our compiler transforms and issues control flow.
19:09karolherbst: okay, then what would be the better way to deal with this?
19:10RSpliet: I haven't looked at the code in great detail. But in short. The prebrk pushes a convergence point (PC) along with the current break mask and it's type "brk"
19:11karolherbst: we only call prebrk once
19:11RSpliet: that's good!
19:11RSpliet: Inside the loop, brk just manipulates the break mask, disabling the threads for which the predicate is true
19:12karolherbst: uff, wait
19:12karolherbst: why didn't I see it before
19:12karolherbst: no, we have nested loops again
19:12karolherbst: the hell?
19:12karolherbst: ohh, uhhh
19:12karolherbst:should have read the entire glsl code
19:13RSpliet: nesting loops is absolutely fine too...
19:13RSpliet: Still no uncontrollable number of stack entries
19:13karolherbst: mhh, we actually have three level of loops
19:13karolherbst: RSpliet: okay.. so before we enter a loop, we call prebrk
19:13karolherbst: and we do so for nested loops for each iteration
19:14karolherbst: that sounds like a huge issue, no?
19:14RSpliet: Shouldn't be
19:14karolherbst: well.... the hardware disagrees
19:14karolherbst: or maybe it does
19:14karolherbst: we don't know
19:14karolherbst: what would happen if we move those prebrks outside of the loops?
19:14karolherbst: and have three right before entering the first?
19:14karolherbst: (given that the entering of the deeper loop isn't conditional)
19:14RSpliet: as long as the prebrk _inside_ the outer loop is matched with a pop when exiting the inner loop, it should be absolutely fine
19:15karolherbst: ohh, I was talking about moveing the prebreaks outside of all loops
19:15RSpliet: that pop could be triggered by a brk or a join or... relatively arbitrary
19:15RSpliet: not sure if that's a good idea
19:16karolherbst: if the prebreak is in the first block inside the loop, it should be safe, no?
19:17karolherbst: oh ehh
19:17karolherbst: actually not
19:17karolherbst: that's a stupid idea
19:20RSpliet: karolherbst: here's what a single loop with early break should conceptually look like - it gets more hairy if you also need "continue", but this is what NVIDIA transforms to
19:20RSpliet: prebrk pushes, the pop happens "automatically" after the last thread hit the brk
19:21karolherbst: okay, but then our problem is most likely not, that we build the loops incorrectly, but that we have three loops nested and completly divergent threads
19:21RSpliet: Let me quickly give an example of what a nested loop should look like
19:24RSpliet: two control stack entries max, guaranteed, no matter what the loop invariants amount to
19:25karolherbst: fun thing is, that shader looks like this as well
19:25RSpliet: you can't and shouldn't move around the "prebrk end-inner", because you need one convergence point for the inner loop on _every_ iteration.
19:25karolherbst: or mhh
19:25karolherbst: maybe that predicated bra is the actual issue
19:25karolherbst: and we should never do that inside loops
19:26karolherbst: mhh, but that predicated bra should be harmless
19:27karolherbst: as both paths converge quite fast
19:27RSpliet: I find this textual representation very hard to reason about.
19:27karolherbst: RSpliet: no, I think your statement about two entries max isn't true
19:27karolherbst: otherwise it wouldn't make sense that our shader messes up
19:27karolherbst: we had this the last time
19:27RSpliet: in the example pseudo-code I gave you, absolutely 2 entries max
19:27karolherbst: and you applied this rule as well and came to the conclusion that the hw is lying, which nobody believes anyway
19:28karolherbst: I mean, it sounds reasonable
19:28karolherbst: but it doesn't help if the hw disagrees
19:28karolherbst: and that's probably due to something we don't know
19:29karolherbst: mhh, let me remove everything irrelevant from the shader, otherwise it's hard to analyze this
19:29RSpliet: That'd be very helpful, thanks
19:34RSpliet: Do you know why predicated breaks are annotated with the break target? That info shouldn't end up in the final opcode
19:35karolherbst: because it ends up in the opcode
19:36karolherbst: I read bras
19:36karolherbst: I guess for convienience
19:36imirkin_: RSpliet: needs to be there for ssa linkage
19:36imirkin_: er, control flow linkage
19:36imirkin_: which is needed for proper phi's/etc
19:36RSpliet: imirkin_: I guess that makes sense :-)
19:36imirkin_: but you're right - it's not in the final opcode
19:37imirkin_: neither is it for join
19:37imirkin_: or anything else except joinat and bra
19:37imirkin_: (and call, i guess)
19:37RSpliet: I hope so :-D
19:39karolherbst: RSpliet: https://gist.githubusercontent.com/karolherbst/464600985b3b84ffdcdcf54daae2a3d5/raw/1511e9a8ba6b7adfc5300f765d1ce19da02933e3/gistfile1.txt
19:40karolherbst: intendation is according to loop depth
19:47RSpliet: Ok, so you'd expect a maximum stack depth of 6
19:50RSpliet: prebrk, prebrk, prebrk, $p0 bra, joinat, $p0 bra
19:51karolherbst: but I doubt the hw is playing a trick on us
19:51karolherbst: so there has to be more to it
19:51karolherbst: RSpliet: keep in mind, that 512 bytes per group isn't enough either
19:52karolherbst: uhm, warp is the nv term
19:53RSpliet: although you'd expect 48B/warp to be sufficient
19:53karolherbst: so.. yeah
19:53karolherbst: but you how that is about theories vs hardware :p
19:58RSpliet: So I guess an interesting question would be "what happens if we hit exit while the stack isn't empty"
19:58RSpliet: Is the next work-group going to end up with a clean stack and a freshly reset SP. Or will it be smaller
19:59imirkin_: RSpliet: there's an error for that
19:59RSpliet: Not that I spotted an exit path that could leave part of the stack intact, but heck..
20:00imirkin_: hm, maybe not.
20:02RSpliet: karolherbst: I don't know the NVIDIA ISA well enough... but. In theory I'd expect that break BB:11 to trigger a pop. Should/could we attach a join flag to it to make sure it pops the RHS of "not $p0 bra BB:17"?
20:03RSpliet: In theory that'd be superfluous :-P
20:03karolherbst: there is no join flag on the maxwell ISA anymore ;)
20:03RSpliet: Then how do you join?
20:03karolherbst: it's a SSY/SYNC pair
20:04RSpliet: Ohh ok
20:04karolherbst: joinat/join in our isa
20:04RSpliet: but that's an instruction rather than a flag
20:04imirkin_: same effect.
20:05imirkin_: it's called "SYNC" now vs ".S". same diff.
20:05karolherbst: well, it might be a flag on the bra instruction in the end... not 100% sure, but the difference is, we have to push the target now
20:05karolherbst: or did we had to do that before as well?
20:05imirkin_: also on nv50, the join doesn't do a branch, it's an indicator that things are now joined
20:05imirkin_: on nvc0, the join itself is a branch to the joinat address.
20:27RSpliet: karolherbst: is this the final code by the way?
20:27karolherbst: what do you mean by final?
20:28RSpliet: As in: no more optimisations
20:28karolherbst: yeah, the last one printed
20:28karolherbst: imirkin_: although with this shader, some threads could exit earlier
20:28imirkin_: only pessimizations after that.
20:28karolherbst: no idea if that matters though
20:28karolherbst: imirkin_: what kind?
20:28imirkin_: the joke kind?
20:29imirkin_: obviously not a very good one =/
20:29RSpliet: karolherbst: if you look at this snippet: https://paste.fedoraproject.org/paste/zCQO8dvKcEL2vndDgWBqBg
20:29karolherbst: yeah, that one was quite weak :p
20:29RSpliet: if you turn around the condition on the bra, then both BB:15 and the joinat BB:16 can be eliminated
20:29karolherbst: RSpliet: we could also predicate the branch entirely
20:29karolherbst: but mhh, maybe we don't do this with phis?
20:30karolherbst: RSpliet: but that doesn't help us with the worst case scenario
20:30imirkin_: yeah, weird that didn't get picked up
20:30karolherbst: I am sure there are cases we really needs thi off-chip stack, otherwise nvidia wouldn't have added it in the first place
20:31karolherbst: at least we should add the code to enable it if the shader reports it is needed
20:31karolherbst: we don't have yet to turn it on though
20:31RSpliet: imirkin_: which is why I want to be sure 100% that this is the final. If a dodgy post-this optimisation just blandly removes the join you end up with broken code :-P
20:31karolherbst: no, it's after the post RA opts
20:31imirkin_: RSpliet: i dunno if what karol showed you is final
20:31imirkin_: but we do post-ra opts, one called "flattening"
20:31imirkin_: which takes care of stuff like this
20:32RSpliet: Can flattening be broken? :-P
20:32karolherbst: yeah.. I should check why flattening bails here
20:32karolherbst: RSpliet: it's just very careful
20:32imirkin_: RSpliet: it uses a big hammer. it's unbreakable!
20:32RSpliet: What I'm looking at still has BB info and labels rather than absolute or relative addresses
20:32RSpliet: imirkin_: did it snap twice?
20:33imirkin_: don't get that reference...
20:33karolherbst: RSpliet: we never print addresses
20:33imirkin_: the addresses are filled in very very very late
20:33imirkin_: esp on nv50 where they have to be absolute :)
20:33imirkin_: [relative to the code segment]
20:35RSpliet: karolherbst: opcodes don't contain BB addresses :-P is it easy for you to fetch this assembly by mmt or otherwise in a form where it's really just linear code, no BBs no annotations no nothing?
20:35karolherbst: imirkin_: forcing flattening fixes the issue btw
20:35karolherbst: bug uhmmm
20:35karolherbst: it flattened in the middle loop
20:36karolherbst: I still use the off-chip stack
20:58RSpliet: karolherbst: can you disable the propagateJoin pass and see if it makes a difference?
20:59RSpliet: in nv50_ir_lowering_nvc0.cpp, NVC0LegalizePostRA::visit()
21:04RSpliet: That pass looks like it has the potential to incorrectly remove BB:15
21:07RSpliet: Or do worse things to the conditional branch in BB:14 :-P
21:50karolherbst: RSpliet: this pass is actually required, no?
21:51RSpliet: From what I can tell it looks like an optimisation...
21:51imirkin_: it's an opt.
21:53karolherbst: well, disabling the code doesn't make it better
21:53karolherbst: if at all, it completly breaks it
21:53karolherbst: like even the prior working shader is now broken
21:53karolherbst: and that even completly as well
21:54karolherbst: so even if that's an opt, I'd say it's really required to do so :p
21:54RSpliet: Opts aren't supposed to paper over bugs like that
21:54karolherbst: no, but here we are
21:55imirkin_: afaik it's totally an opt
21:55imirkin_: unless i'm thinking of different code
21:55imirkin_: oh wait
21:55imirkin_: hold on
21:55imirkin_: is this the thing that flips the anterior joins to be interior joins?
21:55imirkin_: then that's required ;)
21:55karolherbst: it makes the bra -> join replacement
21:55imirkin_: yeah, that's to flip between nv50-style and nvc0-style
21:56imirkin_: yeah. that's required.
21:56imirkin_: can't not have that.
21:56imirkin_: "join" is basically a branch on nvc0+
21:56imirkin_: while the IR has it at the join site.
21:56imirkin_: so it has to be moved to each branch site.
22:00karolherbst: RSpliet: also, the bug here is that we don't switch over to the off-chip stack, but maybe we can do better to avoid that
22:00imirkin_: i thought you guys were talking about flattening
22:00imirkin_: sorry for flip-flopping on that =/
22:01RSpliet: imirkin_: no worries, I'd just like to understand that function in greater detail then
22:08imirkin_: RSpliet: so the IR basically has like joinat; $p0 bra; ... ; bra end; ... bra end; end: join
22:08imirkin_: which is the correct thing for nv50
22:08imirkin_: however for nvc0, we want
22:08imirkin_: joinat; $p0 bra; ... join; ... join; end: nothing.
22:09imirkin_: so this function makes it so.
22:09RSpliet: yeah that makes perfect sense
22:10karolherbst: imirkin_: mhh, a break doesn't sync, does it?
22:10imirkin_: a break jumps to the prebrk location
22:10karolherbst: maybe we have to do the prejoin/join dance for each loop?
22:10karolherbst: otherwise we have all threads divergent at some point
22:10karolherbst: and this might cause this issue?
22:11RSpliet: imirkin_: well, from what I gathered from the patent not quite
22:11imirkin_: "has the effect of ... " :)
22:11imirkin_: since it clears the mask
22:11RSpliet: a break should just disable threads. If all threads are disabled it pops an entry off the cstack, which may or may not be a prebrk entry
22:11imirkin_: hm. so we need more joins? or pops or whatever?
22:52karolherbst: RSpliet: I am sure that since that patent was created the hardware changed quite a bit
22:52karolherbst: and if I have to decide on wheather to trust what the hardware behaves like or the patent, I always choose the hardware
22:53karolherbst: the patent is worthless if it's not able to explain what's happening
22:55RSpliet: karolherbst: the patent explains the concept, implementation details vary. Problem is that hardware isn't telling you how it behaves. Do you have a way to step through shaders and inquire the number of control stack entries?
22:55karolherbst: RSpliet: potentially yes
22:56karolherbst: any idea where to read them out?
22:56karolherbst: we have a debugging interface we could make use of to debug shaders
22:56karolherbst: I just didn't work out how to fully use it
22:56RSpliet: "them" being the stack pointer? Phoa, no never looked at that
22:57karolherbst: RSpliet: mhh, but we could dump the stack at least...
22:57karolherbst: I just wait on some documentation for this to make it less painful
22:58karolherbst: but I think I know how to enable debug mode, we just have to tell the kernel to pass the traps through
22:58RSpliet: Yeah understandable. Being able to dump the stack seems valuable. You can always launch a 1-warp program to reduce noise :-P
22:59RSpliet: Although having a stack pointer seems useful - it's not as if entries are scrubbed...
22:59karolherbst: mhh, although we should be able to single step as well
22:59karolherbst: it's... just a lot of work in order to debug this
23:01karolherbst: RSpliet: I think it's easier to come up with a model which behaves the same actually
23:01karolherbst: RSpliet: okay. but here is something interesting: the warps where threads diverge more are more likely to trap
23:01karolherbst: than the other ones
23:03RSpliet: karolherbst: why? are they not reconverging?
23:03karolherbst: or maybe it's just warps with more loop iterations
23:03karolherbst: RSpliet: well, it depends if they run out of stack before converging or not :p
23:03karolherbst: but anyway.. amount of iterations do matter here
23:03karolherbst: for... whatever reason
23:04karolherbst: so it is something which pushes to the stack on each iteration of some of the loops
23:04karolherbst: otherwise that wouldn't make sense, correct?
23:05karolherbst: mhh, I should probably run that through frameretracer
23:05RSpliet: Oh agreed. And that's bound to be your bug. The code shouldn't push on every iteration of a loop
23:05karolherbst: much simplier to investigate such issues
23:05karolherbst: yeah.. question is just, what pushes?
23:05RSpliet: Sorry, let me rephrase that. An iteration shouldn't push more than it pops :-P
23:06karolherbst: I am wondering
23:06RSpliet: Do you run into the same problems on Fermi/Kepler?
23:07karolherbst: I am not sure
23:09karolherbst: I think I want to blame that not $p0 bra BB:17
23:09karolherbst: but no idea why
23:10karolherbst: RSpliet: yeah... I kind of think we have to sync nested loops...
23:10karolherbst: that "not $p0 bra BB:17" can lead to diverged execution of the entire loop
23:11karolherbst: and there is no sync point
23:12karolherbst: RSpliet: so... let's say we have 32 diverged threads... this means that a stack of 512 would be reduced to 16 bytes per thread, right?
23:13RSpliet: My knowledge on NV50/Fermi/Kepler dictates that the control stack contains per-warp entries, not per-thread
23:14RSpliet: an entry being a (type, mask, pc) 3-tuple. in envydocs this is documented as a 64-bit entry. The mask being 32, the type probably 3 or 4 bits, the remainder reserved for PC
23:14karolherbst: so we have 64 entries / warp?
23:15RSpliet: is the size per-warp or per... eh. SM?
23:17RSpliet: I'd highly suspect that on-chip would be limited to a figure like 8 rather than 64. But that's just from staring at OpenCL assembly and looking at how complex control flow generally really is in kernels on average
23:18karolherbst: per warp
23:18karolherbst: well the off chip is per warp at least
23:18karolherbst: and 512 bytes wasn't enough
23:19karolherbst: but 1024 bytes is
23:23RSpliet: Right, if 8 bytes per entry is still valid that means the stack grows to 128+whatever on-chip space there is
23:36RSpliet: I can defo see it go to shite on a first gen Maxwell here on stock Fedora
23:36karolherbst: yeah, but what about kepler?
23:36karolherbst: or fermi
23:37RSpliet: I thought GM10x had the same assembly as GK110?
23:38RSpliet: Oh, I was wrong
23:39RSpliet: I already questioned myself as I was typing, as I could swear it had over twice as many scheduling headers :-P
23:40RSpliet: Anyway, only have a 940M at my fingertips...
23:41karolherbst: for now it doesn't matter anyway
23:41RSpliet: and too incompetent to even build mesa these days
23:41RSpliet: ../src/gallium/auxiliary/util/u_helpers.c:140:46: fout: ‘struct util_cpu_caps’ has no member named ‘cores_per_L3’
23:41RSpliet: I'm pretty sure that field is there
23:41karolherbst: RSpliet: you know that autotools got removed?
23:42RSpliet: using meson... I think
23:42karolherbst: just in case you were still using make
23:42RSpliet: but my build directory is a mess
23:48RSpliet: git clean did the trick