00:16 karolherbst: imirkin: sure, but does this set even compare anything?
00:17 karolherbst: ehh.. in envydis it's "l" not "lt" :/ uhh
00:17 imirkin: karolherbst: nv50 is different from nvc0
00:17 imirkin: it's not a one-shot predicate reg
00:18 karolherbst: yeah, I am aware
00:18 imirkin: but a full "eflags" type flags register
07:33 pmoreau: imirkin: I’ll check on how branching works.
07:45 pmoreau: imirkin: And looks like you were right with your second suggestion: the time out seems to be coming from executing too many instructions or something. If I reduce the amount of work done per thread, the test passes and does not trigger any timeouts.
07:50 pmoreau: The limit seems to be between 160 and 240 loop iterations for that test, with the loop body being about 15 instructions long.
08:43 pmoreau: With a simpler kernel I can get to about 100,000 iterations without any timeouts (the inner loop is still about 12 instructions). This is annoying as there does not seem to be a good way to predict those timeouts, and no way either to report back an error to the user either AFAICT (well besides them looking at dmesg) so they end up with code that is correct, no errors reported by the API, but the output is wrong.
10:08 glennk: pmoreau, NOUVEAU_SHADER_WATCHDOG=0 ?
10:11 glennk: appears to be a separate flag for compute on nv50, nouveau has the reg def but isn't setting it
11:27 pmoreau: glennk: Ah thanks, I’ll look into it.
14:29 pmoreau: glennk: Good call! It seems to be enabled by default for compute, so adding some code to disable it solved it.
15:36 epsilon_0: hello
15:36 epsilon_0: am not sure if I am audible
15:38 epsilon_0: nouveau doesn't work for my display, the dmesg keeps on erroring out with -
15:38 epsilon_0: < https://matrix.org/_matrix/media/r0/download/matrix.org/CfVoxETeOLHIrphsYNWVRmuk/message.txt >
15:38 imirkin: pmoreau: oh, is it a loop with variable iterations?
15:38 imirkin: pmoreau: we do something very wrong on nv50 for those
15:39 imirkin: pmoreau: there's a bug about it
15:39 karolherbst: uhh.. nested loops?
15:39 karolherbst:runs away
15:39 imirkin: karolherbst: no
15:39 imirkin: nested loops don't help
15:39 imirkin: but this issue is nest-less
15:39 karolherbst: ahh
15:39 karolherbst: I guess RA being RA
15:39 imirkin: no
15:40 imirkin: RA is fine
15:41 imirkin: pmoreau: https://bugs.freedesktop.org/show_bug.cgi?id=78161
15:42 imirkin: the conclusion was that we were ordering stuff wrong wrt the side that jumps vs doesn't jump
15:42 imirkin: but i never quite grasped the issue completely myself
15:43 karolherbst: yeah...
15:43 karolherbst: understanding divergency is a pain :p
15:44 karolherbst: condiitonal bras are pushing to the stack
15:45 imirkin: right, and it hits on every iteration
15:45 karolherbst: yep
15:45 karolherbst: that's very bad :D
15:45 karolherbst: so you run out of stack sooner or later
15:46 karolherbst: on kepler we always use the memory stack for compute
15:46 karolherbst: because... compute
15:46 imirkin: i don't think you run out of stack
15:46 imirkin: but it just ends up executing highlyl suboptimally
15:46 imirkin: or something
15:46 karolherbst: it might be that running out of stack is broken on nv50
15:47 karolherbst: because at some point the shader should stop, especially given how limited nv50 is
15:47 karolherbst: I wouldn't be surprised if that's causing huge issues
15:48 karolherbst: pmoreau: mind reducing the amount of threads and see if lower local sizes "help"?
15:48 karolherbst: mainly curious
15:48 karolherbst: or run single threaded and see if that works :D
16:06 imirkin: anyways, it was never extremely clear what exactly we need to do differently for that case
16:06 imirkin: feels like a BB layout issue
16:07 imirkin: epsilon_0: your display is not responding with EDID.
16:08 epsilon_0: Imirkin can I fix that somehow?
16:08 imirkin: epsilon_0: tell me more about the setup...
16:09 imirkin: epsilon_0: and most importantly, does it work under any other conditions
16:09 epsilon_0: Display works with Nvidia drivers
16:09 karolherbst: imirkin: doubtful. Our model is just broken
16:09 karolherbst: doesn't even work for nvc0
16:10 karolherbst: or later even
16:10 epsilon_0: This is an acer xb270h monitor 1920x1080
16:10 epsilon_0: Kernel version 5.10.33
16:10 imirkin: epsilon_0: ok, so it works with blob drivers with the exact same cable plugged into the same port?
16:10 epsilon_0: Yep
16:11 imirkin: :(
16:11 imirkin:hates DP
16:11 epsilon_0: 😔
16:11 karolherbst: Lyude might know something
16:12 pmoreau: karolherbst: Sure, could try that.
16:29 pmoreau: karolherbst: Looks like it helps.
16:30 imirkin: pmoreau: iirc RSpliet explained what was happening at the time
16:30 imirkin: although i still didn't fully understand.
16:30 imirkin:is a bit dense
16:33 pmoreau: If it was trivial, it would probably have been fixed by now, so it most likely isn’t. :-)
16:50 karolherbst: imirkin: yeah.. RSpliet explanation still conflicted with what I figured out on nvc0
16:50 karolherbst: pmoreau: well...
16:50 karolherbst: I have a solution but I don't like it
16:50 imirkin: pmoreau: i suspect it's not that hard if one has a crystal clear understanding of the problem
16:50 imirkin: i never did.
16:50 imirkin: karolherbst: your solution is for loop-within-loop
16:50 karolherbst: imirkin: the issue is, this general problem is not fixable inside shaders
16:50 imirkin: does your thing do anything for this case?
16:50 karolherbst: imirkin: nope
16:50 karolherbst: it's more general
16:50 imirkin: ah
16:50 karolherbst: it can also happen with one loop
16:51 karolherbst: just.. it's less likely
16:51 imirkin: i mean ... nvidia's generated shader worked fine
16:51 karolherbst: two loops after each other can also cause such issues quickly
16:51 imirkin: it's almost like they have a better compiler :)
16:51 karolherbst: so here is what nvidia does: they have some algorithms to calculate the most likely stack depth reached by a given shader
16:51 karolherbst: post optimization
16:51 karolherbst: and according to that, they either use the on chip cache or RAM
16:52 imirkin: karolherbst: no, they lay out their shader differently
16:52 karolherbst: nvidia does also do loop merging so nested loops are just single loops
16:52 karolherbst: they still use the RAM stack
16:52 imirkin: for this case
16:52 karolherbst: imirkin: irrelevant
16:52 imirkin: their BB layout = fine
16:52 imirkin: our BB layout = not fine
16:52 karolherbst: it only reduces the likelyhood of the GPU running out of space
16:52 imirkin: quite successfully it would seem
16:52 karolherbst: yeah
16:52 karolherbst: loop mering also helps a ton
16:53 karolherbst: but they still configure the stack to be on the RAM if the loop is complex enough
16:53 Lyude: epsilon_0: does this happen on the latest drm-tip btw?
16:53 karolherbst: pmoreau: you could try with 2+ threads and see if you see a count where it starts happening
16:53 karolherbst: I suspect it will be in the 20+ range, but we never know
16:54 karolherbst: I also suspect nv50 to be a bit broken in those areas, because compute was quite new
16:54 karolherbst: and the hw might just get stuck instead of erroring out
16:55 imirkin: karolherbst: this happens in frag shader
16:56 karolherbst: ehh.. sure, I meant more "general programming" than compute really
16:56 karolherbst: like unrestricted control flow and stuff
16:56 karolherbst: so what nvidias shader is doing differently is that threads exiting the loop are also stopped later on
16:57 karolherbst: and that allows to free some resources
16:57 imirkin: the frag shader just exits at some point
16:57 imirkin: silently
16:57 imirkin: so you end up with random colors
16:57 imirkin: since it's whater happens to be in r0..r3 at the time
16:57 karolherbst: ahh, okay, so this is for graphics and pmoreau issue was on compute, hitting traps, no?
16:57 imirkin: yeah, i think he was getting a TIMEOUT
16:57 karolherbst: yeah...
16:57 imirkin: in frag it was just totally silent exit
16:58 karolherbst: yep
16:58 imirkin: "this seems boring, what else ya got"
16:58 karolherbst: fyi, debugging compute shaders is also straightforward, so I wouldn't be surrpised that error handling differs
16:58 imirkin: there might also be some different configuration we have between 3d and compute
16:58 imirkin: which causes the difference
16:58 karolherbst: yeah, might be
16:59 karolherbst: so what helps is, if you break out of a loop, to do it in such a way that the thread is done quickly
16:59 karolherbst: and doesn't enter a new iteration or something
16:59 karolherbst: because that would split the stack again
17:00 karolherbst: and there is a small depth of splitting you can do
17:00 karolherbst: so essentially if you conditionally jump, you half your stack
17:00 karolherbst: until you can't
17:01 karolherbst: hence you want to merge in a way that all threads reach that point without running loose for too long
17:02 karolherbst: we only insert those merge points for if/else/than trees and not for loops btw
17:02 karolherbst: or was it the other way around?
17:03 karolherbst: ahh only for loops
17:03 karolherbst: and I think there was some magic for ifs within a loop or that's something I came up at some point
17:03 karolherbst: ahh no, it's inside from_tgsi
17:06 karolherbst: imirkin: and I think break implicitly syncs
17:06 karolherbst: that was something I assume for nvc0+ as well
17:07 imirkin: karolherbst: look at the bug
17:07 imirkin: there are two versions
17:07 imirkin: i did one without break
17:07 karolherbst: the second one, right?
17:07 imirkin: ya
17:07 imirkin: and then look at the nvidia one
17:07 karolherbst: but there you never sync
17:07 imirkin: and cry :)
17:07 karolherbst: no :D
17:08 karolherbst: breakaddr + break != bra
17:08 imirkin: uhhhh wut
17:08 karolherbst: yep
17:08 karolherbst: break does some stack magic as well
17:08 karolherbst: so if you use bra instead for breaks it should just make our shader work
17:08 imirkin: yeah, you're right, there's no joinat/join
17:08 karolherbst: the first one
17:09 imirkin: what's wrong with breakaddr + break exactly?
17:09 karolherbst: not quite sure yet
17:09 karolherbst: but it does adjust the stack in a weird way
17:09 imirkin: look at the one in comment 7
17:09 karolherbst: I really have to find some time to model all of that correctly
17:09 imirkin: where i just got rid of the break entirely
17:09 imirkin: but yeah, no joinat/join there
17:10 imirkin: i never noticed that before
17:10 imirkin: but i don't think it would have helped immensely
17:10 karolherbst: I can only really tell from my experience with nvc0+ where all of this is a bit more advanced
17:11 karolherbst: so what nvidia shader does is, they have on bra jumping to the end
17:11 karolherbst: in comment 7 we have two of those
17:11 karolherbst: ehh wait
17:11 karolherbst: nvidia also has two
17:12 karolherbst: ohhhh
17:12 karolherbst: imirkin: okay
17:12 karolherbst: we have 3
17:12 karolherbst: I missed the last one
17:12 karolherbst: imirkin: the last conditional bra is the issue
17:12 karolherbst: so we conditionally iterate again
17:12 imirkin: right
17:13 imirkin: whereas we need to conditionally bail out right?
17:13 karolherbst: yes
17:13 imirkin: and unconditionally continue?
17:13 imirkin: yes
17:13 imirkin: that's what RSpliet said before too iirc
17:13 karolherbst: okay
17:13 imirkin: so it's basically a BB layout issue
17:13 imirkin: but how to detect that properly.... not clear to me
17:13 karolherbst: in this case it can be fixed by a different BB layout, yes :)
17:13 karolherbst: yeah...
17:13 imirkin: like i can sit here and analyze the program in my head and come to that conclusion
17:14 imirkin: but i'm not sure how a simpler piece of code than whatever firmware's in my head can achieve this
17:14 karolherbst: well.. in nir we have those helpers to figure all of that out. TGSI misses them
17:14 imirkin: this needs to be figured out at the nv50 ir level
17:14 karolherbst: we can't
17:14 karolherbst: because unstructured
17:15 imirkin: ah, nir does it by structure. i see.
17:15 karolherbst: yeah
17:15 karolherbst: the helpers assume structure
17:15 imirkin: i think it should still be achievable
17:15 karolherbst: so you can know all the structure :)
17:15 karolherbst: yeah sure
17:15 imirkin: i'll try to think about it...
17:15 imirkin: at some point
17:15 karolherbst: it's just not as easy
17:15 karolherbst: I think the best idea would be to just remember the layout from the structured IR
17:15 karolherbst: and deal with it this way
17:15 imirkin: yeah, maybe
17:15 imirkin: which we can do from tgsi too
17:16 imirkin: since the unconditional is just the "end" of the loop
17:16 karolherbst: sure
17:16 karolherbst: just TGSI doesn't have those nice helpers :p
17:16 imirkin: sure it does. if op == "END": unconditional
17:16 karolherbst: so in nir you can just get all "breaks" pointing towards the end of the loop
17:16 imirkin: er, not end, but you know what i mean
17:16 karolherbst: and stuff
17:16 karolherbst: not saying it's impossible in TGSI
17:17 karolherbst: just that we already have those helper functions in nir :)
17:17 imirkin: trivial to determine in tgsi
17:17 imirkin: no helper function needed
17:17 imirkin: anyways, just need to figure out what to do with that info
17:17 karolherbst: imirkin: the problem starts with structure inside the loop etc..
17:17 karolherbst: but yeah
17:18 karolherbst: what you want to do is to convert the loop into a latch closed one
17:18 pmoreau: Right, “gr: TRAP_MP_EXEC - TP 0 MP 0: 00000008 [TIMEOUT] at 0000d0 warp 10, opcode 30020001 c4100780” was the error
17:19 pmoreau: And disabling the watchdog does help up to a certain degree, as does reducing the size of a block.
17:19 karolherbst: pmoreau: interesting
17:19 karolherbst: so maybe we just have a divergency explosion in your case
17:19 imirkin: btw, i'm not like the only allowed to fix these issues, so if you guys are interested in this, be my guests :)
17:19 karolherbst: imirkin: yeah.. I want to fix all of this anyway at some point
17:20 karolherbst: it's just an so annoying issue
17:20 pmoreau: As long as there is no RE needed, I’m happy to look into it. :-)
17:20 karolherbst: and doesn't really break all that much
17:20 imirkin: well, i think you're trying to fix the larger thing
17:20 karolherbst: yep
17:20 imirkin: which is ... larger
17:20 imirkin: i think this is a nice tractable little bit
17:20 imirkin: and it does come up on nv50 a lot in practice
17:20 karolherbst: okay
17:21 imirkin: pmoreau: so the thing is that the "jump back to top of loop" needs to be unconditional
17:21 imirkin: pmoreau: and the "jump out of loop" stuff is what needs to be conditional
17:21 karolherbst: I have to fix that stuff for volta+ anyway
17:21 pmoreau: There is some divergence but it should be confined to each loop iteration, so it should not be exploding.
17:21 karolherbst: uhh.. volta...
17:21 imirkin: from a raw "program execution" standpoint, it shouldn't matter, but from a practical standpoint, it does.
17:21 pmoreau: Mmh, okay
17:22 karolherbst: pmoreau: you diverge inside the inner loop, which is... painful
17:22 imirkin: pmoreau: the observation being that a lot more threads jump back to the top of the loop
17:22 imirkin: than ones that exit the loop
17:22 imirkin: so if you keep getting it wrong, it's not good.
17:23 pmoreau: In this case, all threads should jump back to the top of the loop: the number of iterations is the exact same for all threads. But some might do more work than others inside the loop.
17:23 karolherbst: imirkin: it's something more annoying than that
17:23 karolherbst: but yeah
17:23 imirkin: pmoreau: that may be
17:23 imirkin: but that's not how the hardware sees it i think
17:23 imirkin: pmoreau: i think it first pushes all the threads onto the stack
17:23 karolherbst: pmoreau: you have to loop at it from a divergency perspective
17:23 imirkin: and then tries to execut the 0 threads which don't jump to the top
17:23 karolherbst: *look
17:23 imirkin: etc
17:23 imirkin: or ... something
17:24 karolherbst: at some point, they all run single threaded, because you split too much
17:24 TimurTabi: skeggsb
17:24 karolherbst: so if you just run one thread, execution quits quite quickly
17:24 pmoreau: But it should sync back at the end of the conditional before jumping back to the top of the loop, no?
17:24 karolherbst: pmoreau: why would it?
17:24 pmoreau: It would be stupid to let it diverge until the end of time for no reason.
17:25 karolherbst: pmoreau: well, that's what is happening
17:25 karolherbst: so.. if you condentionally jump, you split
17:26 karolherbst: and if this group uncondentionally jumps again without syncing, you split again
17:26 pmoreau: Right, but then the compiler should add a joinat when both paths merge again.
17:26 karolherbst: *conditionally
17:26 karolherbst: pmoreau: you could sync around inner ifs, yes
17:27 pmoreau: But apparently it does not, since I have the issue?
17:27 karolherbst: yeah..... the nir path is a bit broken in this area
17:27 karolherbst: I have patches (tm)
17:27 karolherbst: let's see
17:29 karolherbst: pmoreau: https://github.com/karolherbst/mesa/commit/0b73ea24ecd3345c8389022bc99c07cbb762197f
17:29 karolherbst: try this maybe?
17:29 karolherbst: or did I merge it already
17:29 pmoreau: Thanks, will try
17:30 karolherbst: but.. it's a bit limiting
17:30 karolherbst: might cause other issues
17:30 karolherbst: and not fix yours
17:31 pmoreau: We will see, currently compiling.
17:32 karolherbst: pmoreau: you can also tweak the conditionals and let the code insert more sync points, you just need to be a bit careful about those things
17:33 karolherbst: and I might have to see your shader to give a proper hint
17:34 pmoreau: It did not help; I’ll share a trace in a bit.
17:54 karolherbst: I actually had a different patch in mind anyway, just I can't find it :D
17:55 pmoreau: karolherbst: Here, with your patch applied: https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/2036
17:55 pmoreau: Direct link to the logs: https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/2036/raw/main/test.log
17:55 pmoreau: (The snippet also includes the input CL kernel, and what NVIDIA generates for it.)
17:56 karolherbst: pmoreau: uhh
17:56 karolherbst: yeah.. the last loop
17:56 karolherbst: so this one splits
17:57 karolherbst: mhhh
17:59 karolherbst: pmoreau: we might just want to run without a timeout for CL :p
17:59 karolherbst: the shader is fine.. it just diverges a lot
17:59 pmoreau: :-D
17:59 karolherbst: but it shouldn't run out of stack
17:59 karolherbst: you only have one live thread group
17:59 karolherbst: and this is fine
17:59 karolherbst: as split of threads immediately sync
18:00 pmoreau: Even without the timeout, dmesg wasn’t free of complaints:
18:00 karolherbst: pmoreau: is the watchdog configured from userspace or inside the kernel?
18:01 pmoreau: < https://matrix.org/_matrix/media/r0/download/matrix.org/OnVEqahByHumUzNyhqPigFyK/message.txt >
18:01 pmoreau: Userspace
18:01 karolherbst: pmoreau: what the hell is even that error :D
18:02 pmoreau: 🤷
18:02 karolherbst: I guess it just WFI and aborts or something :/ mhh
18:02 pmoreau: And the laptop/vgaswitcheroo/whoever was trying to put the card back to sleep while it was still running.
18:02 karolherbst: ehh...
18:02 karolherbst: yeah well...
18:03 karolherbst: crap
18:03 karolherbst: I guess this can happen if we don't do enough ioctls...
18:03 karolherbst: pmoreau: we might need a "please don't turn of this GPU" ioctl
18:03 karolherbst: as... I can see this happening with CL a lot
18:04 karolherbst: mind testing with runpm disabled? :D
18:04 pmoreau: This is the quick patch I added for the watchdog for compute: https://gitlab.freedesktop.org/pmoreau/mesa/-/snippets/2037
18:05 pmoreau: Okay one sec, need to remember how to turn it off without rebooting.
18:05 karolherbst: but you said single threaded it was okay?
18:06 karolherbst: brb
18:16 pmoreau: Okay, no watchdog and a local size of 64, and the test passes (though takes like 5–10 seconds). If I go for the full amount of VRAM, it takes about 40 seconds, triggers the TLB flush messages I mentioned earlier, but completes successfully.
18:18 pmoreau: Even going back to the computed local size (of 512), it still works though does take ages to complete.
19:15 karolherbst: mhh, interesting that it gets slower the more threads you have
19:16 karolherbst: but yeah.. at least it does work
19:56 pmoreau: Well, not all threads will be able to run concurrently if the grid size is high enough.
19:58 pmoreau: imirkin: “so the thing is that the "jump back to top of loop" needs to be unconditional” that seems to be what Nouveau is doing, while NVIDIA has a conditional jump back to the top of the loop that also acts (by falling through) as the loop exit.
20:05 imirkin: you calling me a liar? :)
20:05 imirkin: hrmph
20:05 imirkin: i am a liar.
20:05 imirkin: oops
20:06 pmoreau: You could have gotten it backwards by mistake. And I can see a reason for it being the way you described it.
20:06 pmoreau: Oh
20:07 imirkin: ok, so
20:07 imirkin: technically you're right
20:07 imirkin: but
20:07 imirkin: 000000d8: 10021003 00000100 (e $c0) bra 0x108
20:07 imirkin: still sucks.
20:07 imirkin: but maybe you're right, and that's not the actual problem
20:09 pmoreau: I need to look at how those pre-break, break, and stuff work out since NVIDIA only seemed to use branches + joinat in that particular instance.
20:09 pmoreau: I might do so later this weekend, but we will see; I’ll take of the remaining comments on the MR first.
20:10 pmoreau: Regarding your comment on not needing to purge records for loads, why is that so? It does indeed work, but only purging the stores will prevent loads from being combined?
20:11 imirkin: pmoreau: the records are keeping track of "cached" loads
20:11 imirkin: if you happen to do a load which "interferes" with a cached load and you don't take advantage of the cache - that's your loss, but no big deal
20:11 imirkin: if you store something without the cache knowing, then ... it's a big deal :)
20:16 karolherbst: pmoreau, imirkin: you are looking at it a bit wrong, what matters is, if the spit of part is running into sync points or not. so if you branch conditionally and both groups end up doing stuff like breaks or syncs or something, then you can run into huge pain
20:17 karolherbst: so one group will be stuck at the break, while the other iterates further
20:18 pmoreau: Oh, I thought that pass was about combining 2 consecutive 32-bit loads into a single 64-bit load (and same for stores) for example, so by purging the records you were effectively preventing the pass to do that as it would have no memory of a previous load of 32 bits that it could merge the current load with.
20:18 imirkin: pmoreau: it does a few things
20:18 imirkin: it also combines stuff
20:19 imirkin: but the records are about maintaining the cache
20:20 pmoreau: Okay
20:20 imirkin: (and knowing what you can combine with)
20:21 karolherbst: imirkin: ohhh.... I have a stupid way of reverse engineering this...
20:22 karolherbst: we just map the off chip cache as a buffer into the kernel and dump it after certain instructions without doing any CF stuff
20:23 pmoreau: “so one group will be stuck at the break, while the other iterates further” so in this specific scenario since we only sync for the loop and not the inside if, half the warp will end up performing all its iterations until it reaches the sync outside the loop, and then the other half can finally progress further. :-/
20:23 karolherbst: pmoreau: yeah
20:24 karolherbst: but
20:24 karolherbst: I am not 100% sure if a break syncs
20:24 karolherbst: I have some hints towards this as this does explain things like this one
20:25 pmoreau: Why not use a joinat/SSY?
20:25 karolherbst: well
20:25 karolherbst: insert them in a way it helps
20:26 karolherbst: but usually if you know that after the break you just quit the shader you really don't need this sync overhead
20:26 karolherbst: might be that for this one bug we also just timeout...
20:27 karolherbst: not entirely sure what's happening there
20:27 pmoreau: Okay, but here the missing synchronisation would be inside the loop, and not outside right before leaving.
20:27 karolherbst: what's "Unhandled ustatus 0x00000040" anyway
20:27 karolherbst: pmoreau: why do you need to sync in the first place?
20:28 karolherbst: those threads exit the loop for a reason
20:28 karolherbst: and some threads will be finished, some not
20:28 karolherbst: you might sync after the loop and that's what the break is doing (probably)
20:28 karolherbst: but how can you sync inside it?
20:28 pmoreau: So that all lanes in the thread can go back to doing iterations rather than having one half doing all its iterations while the other does nothing.
20:28 karolherbst: there is no real structure besides a "I am done with the loop" thing
20:29 karolherbst: pmoreau: but they don't
20:29 karolherbst: if you branch, that block will quit
20:30 karolherbst: at least in your kernel
20:30 pmoreau: That block yes, but it still remains inside the loop, and continues to the next iteration.
20:30 karolherbst: yeah and the other waits after the loop
20:30 pmoreau: Why after? It still is withing the loop too
20:31 pmoreau: *within
20:31 karolherbst: it hits the break, no?
20:31 karolherbst: or the synced instruction in your case
20:31 pmoreau: The code is `for () { if () do something }`. There is no early exit from the loop for anyone.
20:32 karolherbst: hu? but the shader you postet says something different
20:32 pmoreau: IIRC, the break is hit once you exit the loop.
20:33 karolherbst: well, I am looking at your ISA output
20:33 karolherbst: and there is an early exit as you still have to do some SSA stuff
20:33 karolherbst: but that's besides the point anyway
20:33 karolherbst: you split the block and one part will exit, while the other continues
20:34 karolherbst: there is no way those can get merged unless you are done with the loop
20:35 pmoreau: The break is on line 287, which is when you exit the loop.
20:35 karolherbst: you mean 288, no?
20:35 pmoreau: Or are you looking at the .asm file, which is what NVIDIA produces for that kernel?
20:36 pmoreau: Yes, 288, sorry
20:37 karolherbst: okay.. let's look at it from top to bottom
20:37 karolherbst: the first point of splitting is "27: eq $c0 bra BB:8 (8)", right?
20:37 karolherbst: ejhh
20:37 karolherbst: 23: eq $c0 bra BB:5 (8) actually
20:38 karolherbst: so, what do the blocks do? the one jumps to BB:5 exiting the loop, while the other continues, right?
20:39 pmoreau: Right
20:39 karolherbst: but yeah.. right, the second split point is indeed problematic...
20:39 karolherbst: so you diverge wihtin the loop
20:39 karolherbst: kind of missed that
20:39 karolherbst: mhhhh
20:39 karolherbst: there is an easy solution here, which might just work
20:39 pmoreau: That’s the initial check whether to enter the loop at all or not (23).
20:40 karolherbst: pmoreau: soo... there is one thing you can try
20:40 karolherbst: move the "--curIfDepth" out of the if
20:40 karolherbst: and remove the curIfDepth checks
20:40 pmoreau: I’ll try that tomorrow, I need to get some sleep.
20:40 karolherbst: okay
20:40 karolherbst: I think that's enough to sync on nested ifs
20:41 karolherbst: but.. that can cause other issues :D
20:41 karolherbst: ohh wait
20:41 karolherbst: that won't work, because you are still inside the loop
20:41 karolherbst: mhh
20:41 karolherbst: ping me tomorrow and I'll write a patch
20:42 pmoreau: On a slightly differently note, do you know if there is a way for Nouveau to signal that a kernel experience a runtime issue (like instructions getting timed out, for example) back to the calling API? So both 1) if the Nouveau bits in Mesa have a way to get that info, and 2) if there is a way to pass it back via OpenGL/OpenCL.
20:42 pmoreau: I can definitely do that, thanks!
20:42 karolherbst: pmoreau: there is a way to signal traps to userspace, but... it's annoying
20:43 karolherbst: soo..
20:43 karolherbst: you can install a trap handler and every signal the kernel doesn't catch you get triggered tha trap handler and have a builtin kernel handle this one
20:43 karolherbst: maybe there are other ways
20:45 pmoreau: But in this case I’m guessing the kernel catches it as I get an error in dmesg. My question would be more if the kernel can propagate that info to Mesa (or Mesa query for it), so that it can set some GL_ERROR_RUNTIME or whatever for the compute dispatch call or kernel enqueue or clFinish() call.
20:45 pmoreau: So that the user knows that something wrong happened, without having to check dmesg.
20:45 karolherbst: pmoreau: ehh no, so you can configure the hardware which traps to sends IRQ for
20:45 karolherbst: and which to handle internally
20:46 karolherbst: we can of course let the kernel process the IRQ in a way that it gives this information back to userspace
20:46 karolherbst: but...
20:46 karolherbst: that's annoying to do :p
20:46 pmoreau: :-D
20:47 karolherbst: well.. either a blocking ioctl, but then you still have to handle it for the API
20:47 karolherbst: or.. you return an error when submitting command buffers
20:47 karolherbst: also annoying
20:47 karolherbst: the problem really is this async nature of everything
20:48 karolherbst: and the trap handler also being an overhead I guess
20:48 karolherbst: maybe it's for free.. dunno
20:48 karolherbst: I looked into all of that for nvc0
20:48 karolherbst: it's... annoying
20:48 karolherbst: and maybe I missed something
20:51 pmoreau: No idea. It would not have to be about what went wrong, but at least knowing that something went wrong. Like if I trigger the TDR on Windows, the device gets removed for my program and the API will return the DEVICE_REMOVED error code or something. So I don’t know exactly what happened (did I crash the GPU, did the kernel ran for too long, nor which part of my program did something wrong), but at least I know that
20:51 pmoreau: something went wrong and that my computations being off could be due to that.
20:51 karolherbst: pmoreau: yeah. I am planning to do that for nouveau as well :D
20:51 pmoreau: Ah, awesome!
20:51 karolherbst: but that's the "context went boom" szenario
20:51 karolherbst: we already have the kernel bits in place
20:52 karolherbst: so ioctls return -ENODEV on channels when it got destroyed in the kernel
20:52 pmoreau: Still better than silently failing, imho. :-)
20:52 karolherbst: yeah.. we still doesn't handle it inside mesa though
20:52 karolherbst: we could just recreate the channel and hope for the best
20:52 karolherbst: or... fail
20:52 pmoreau: Ah, okay.
20:52 karolherbst: there is also the robustness context thing where we can handle it explicitly and report back
20:52 karolherbst: kind of my todo list after fixing mt
20:53 karolherbst: *on
20:53 pmoreau: Anyway, I’ll head out for now as otherwise I feel like I’ll be staying until 4am. 🙃
20:53 karolherbst: and I think for CL that's even simple to implement as well
20:53 karolherbst: :D
20:53 karolherbst: yeah, have fun
20:53 pmoreau: Thanks!