00:16 imirkin: mwk: if i prebreak, how do i "cancel" it? (i.e. pop it off the call stack?)
00:38 imirkin: also wtf is STACK_MISMATCH?
05:25 mwk: imirkin: by break, obviously
12:04 imirkin: mwk: and "bra" has no effect on the call stack one way or the other, right?
12:05 mwk: imirkin: it has
12:05 mwk: if you use a divergent bra, it chooses one branch, and pushes the other onto the stack
12:05 imirkin: ok, this sounds like my case
12:06 imirkin: i don't have any joinat's/join's (this is tesla btw)
12:07 imirkin: mwk: and what's the proper way of dealing with that?
12:07 mwk: I assume the bras are nested within prebrk/break?
12:08 imirkin: you assume correctly
12:08 mwk: it should work
12:09 mwk: could you paste the code?
12:09 imirkin: but the bra's are divergent inside the prebreak/break
12:09 imirkin: yeah, hold on
12:09 mwk: right, that should be fine
12:09 mwk: it's quite expected even
12:09 mwk: for (...) { ... if (x) break; ... }
12:09 mwk: it's the whole point of having break
12:10 imirkin: :)
12:11 imirkin: mwk: here are a bunch of code samples: https://bugs.freedesktop.org/show_bug.cgi?id=78161
12:11 imirkin: the one in comment 6 is nvidia's (and works ok on their blob)
12:11 imirkin: the rest are nouveau-generated
13:10 pendingchaos: imirkin: when assigning a method for a macro in nvc0_macros.h, should I leave space for the conservative rasterization macro?
13:17 imirkin_: pendingchaos: nnnot sure what the question is
13:18 imirkin_: mwk: i guess an interesting exercise would be to take nvidia's code and see if nouveau is ok with it
13:18 imirkin_: unfortunately we don't have an easy way of doing that (esp including relocations)
13:25 pendingchaos: in the conservative rasterization patches, a macro is assigned to the next available method
13:25 pendingchaos: while creating a new macro in some other patch, should I leave the method used in the conservative rasterization patches available or should I use the same one
13:27 imirkin_: pendingchaos: your call. in the end, there can be only one :)
13:27 imirkin_: so you'll save yourself some rebase issues
13:34 imirkin_: pendingchaos: your 1/3 is questionable. i'm going to have to do some digging.
13:34 pendingchaos:nods
13:34 imirkin_: (not necessarily wrong, just not obviously correct)
13:35 imirkin_: i'm impressed you were able to grok the query_buffer_write macro...
14:14 mwk: imirkin_: and what's the result? a trap, or just visual glitches?
14:15 mwk: ah, there is a trap
14:15 mwk: hm
14:19 mwk: ah
14:19 mwk: imirkin_: it seems the error involved is STACK_LIMIT
14:19 mwk: so the fix is rather obvious
14:20 mwk: the thing is, you have a branch in a loop
14:20 mwk: and the "continue loop" path is executed first
14:20 mwk: so you end up with a divergent bra stack entry for every iteration that involved a breaking thread
14:21 mwk: so, up to 31
14:21 mwk: + 1 stack entry for the breakaddr
14:21 mwk: imirkin_: how much stack does nouveau allocate?
14:24 imirkin_: mwk: i played around with bumping the stack
14:24 imirkin_: perhaps i did it wrong
14:24 imirkin_: currently we allocate 8 iirc
14:24 imirkin_: i tried bumping it to 4096 but that wasn't happy either
14:24 imirkin_: perhaps i should do an in-between with 32 :)
14:24 imirkin_: which apparently is the DX10 setting
14:25 imirkin_: mwk: can i reorder it somehow (in a predictable way) so that this doesn't happen?
14:31 mwk: imirkin_: that really should be fixed
14:32 mwk: basically you need 1 for every nesting level of call/prebreak/... + 31 for divergent bras
14:32 RSpliet: imirkin_, mwk: break shouldn't result in increasing the size of the stack though? But perhaps the conditional branch in #5 offset e0 does?
14:32 mwk: RSpliet: yeah, it's the conditional branches
14:32 mwk: right, the e0 branch
14:32 imirkin_: RSpliet: breakaddr pushes onto the stack
14:32 mwk: right, but only once
14:33 imirkin_: right.
14:33 imirkin_: but you could have loop-inside-loop
14:33 imirkin_: [not in this shader, but in general]
14:33 mwk: imirkin_: swap the exit paths of e0 branch
14:33 mwk: that should lower stack usage to 2 entries max
14:33 mwk: since then the breaking part will be executed first
14:37 RSpliet: I guess e8-108 could(/should) be "lg $c0" conditional, then the branch from e0 can be eliminated altogether. Question is how to write a generic pass to always do this right
14:38 mwk: that would work too, right
14:38 mwk: though as imirkin_ notes, with 5 instructions that are rarely executed, this slows down the shader
14:39 RSpliet: Right
14:43 RSpliet: Shame the movs leading up to the break aren't all equal, otherwise it could have been a trivial case for CSE, hoisting that code after the break
14:43 RSpliet: *after the loop
14:46 RSpliet: Guess you might have limited scope of setting a condition variable in the "else" clause of the tests (the BB ending with a break), and post-loop testing the condition to perform that body. That way you execute those insn once instead of once per iteration
14:47 RSpliet: But... that means defeating the logic that in-lines the movs to $r[0-2] *into* the loop.
14:47 mwk: eh? just use a branch
14:49 RSpliet: The point is that a BB ending with a break will only be executed *once* per thread. There's no point in keeping that code inside a loop body, because you can also conditionally execute that once all threads have synchronised again
14:53 mwk: right
14:54 mwk: but the simplest way is to just move the code out of the loop body and branch to it
14:55 mwk: CSE is not that important
14:55 mwk: not outside of the loop, anyway
15:08 RSpliet: Ah yes, I think the CSE point I wanted to make is "if all those BBs ending with a break have the same body, we can stick that code (minus the break) at the start of the break target BB and not worry about branches at all"
15:12 RSpliet: Semi, but consider it unrelated, here we have the interesting situation where there is a default init of $r[0-2] is copied *into* the loop body (judging by the TGSI it used to reside outside) - which sounds clever because it avoids doubly-assigning to those regs, but is presumably detrimental to performance when you consider the sort of lock-step SIMD execution model of threads that means the code might now be executed 32 times
19:31 Lyude: time to play more with clockgating/maybe iso hub :)
19:33 mupuf: Lyude: cool!
19:56 Lyude: (also ,could someone make sure that the patches I just sent to nouveau's mailing list get approved? i'm on the ml, but it seems they got rejected because my cc_cmd added too many people)
19:57 imirkin_: i dunno who would have such powers
21:16 stoatwblr: hi imirkin , just dropped the radeon card in, Worked first time no config needed. :)
21:28 imirkin_: stoatwblr: welcome to open-source support. enjoy.
21:36 stoatwblr: imirkin_, thanks :)
23:08 Lyude: mupuf: poke, a while ago while I was still stuck in an MST rabbit hole you mentioned that there had been some clockgating stuff included in the vbios documentation, did anyone ever take a look at that?