00:00karolherbst: or ...
00:00karolherbst: but why is it faulting anyway
00:00karolherbst: and also so randomly
00:00karolherbst: anyway.. might be able to figure out what's up
00:21jekstrand: HdkR: Good call! size += 4096 inside nouveau_bo_new fixes it. :D
00:21jekstrand: Now I just have to figure out what's overflowing.
00:23jekstrand: The next question is: What?
00:23HdkR: do you have padding at the end of shader programs?
00:23HdkR: You need to have the end of shader programs padded to 64 byte alignment
00:23jekstrand: Nope, not a shader program.
00:24HdkR: That's good at least
00:24jekstrand: We're currently always allocating a new BO for shaders (yeah, that's terrible) so they'll always be 64B aligned.
00:26jekstrand: Ok, if I pad all the things (push, upload, and shader) by 4K, no fail. But that could just be re-arranging addresses such that it works. :-/
00:27jekstrand: Ok, if I only pad shaders, it works.
00:29jekstrand: Yup, it's the shader.
00:29jekstrand: HdkR: Any idea how much it over-fetches by?
00:29jekstrand: 256 isn't enough
00:30HdkR: Maybe the fetch amount has changed from what I know :)
00:31jekstrand: That's seeming likely. :)
00:31jekstrand: I'm gonna just throw 4k on the end and leave a TODO. :D
00:31HdkR: What family generation is this that has the issue?
00:32HdkR: Welp, guess I misremembered the wrong amount :P
15:18karolherbst: HdkR: nvidia inserts this weirdo BRA instruction after an EXIT, could that be somewhat related to that i-cache issue
15:49mupuf: karolherbst: where does it branch to :o?
15:49mupuf: branch on the exit?
15:52karolherbst: mupuf: https://gist.githubusercontent.com/karolherbst/b0e225adfed73232d34e81fc848376e5/raw/197277eb05965567077cc8a648a2f92cf06f0318/gistfile1.txt
15:53mupuf: I see, it jumps to itself
15:56mupuf: could this be that some blending operations (or pending writes) could still be going on when hitting exit, and the instruction would not be blocking until they succeed but rather just set a signal that we have reached the end of the shader?
15:57karolherbst: well... but how do you reach that part?
15:58mupuf: and so, the last jmp would be busy-waiting until the wave (or whatever the nvidia name is) is completely done (including any potential memory operation not being completely done)
15:59mupuf: Maybe exit just raises a flag (ready-to-kill), and the killing condition would be "ready-to-kill && all-background-operations-done"
15:59karolherbst: nope, exit means exit
15:59karolherbst: you can install a post exit handler, but you need to do that explicitly
15:59karolherbst: and the exit in there would still just exit
16:00mupuf: well, obviously, there are some conditions where the $ip keeps going, otherwise they would not insert useless instructions which polute the $I
16:00karolherbst: mhhhh _could_ be
16:02mupuf: hence why I was thinking that this would have to be something happening asynchronously of the instruction pointer... like memory-related things
16:02karolherbst: the more I think about it and read "stuff" the more it makes sense
16:03mupuf: it may also be a workaround for early silicon, and they just apply it for everything... just in case
16:03karolherbst: yeah.. my assumption as they add it for the prefetcher
16:03karolherbst: but what _if_ the program just keeps going and for heavily diverged things it could do random crap
16:04mupuf: isn't it the job of the compiler to prevent this situation?
16:06karolherbst: yeah well.. apparently
16:06mupuf: is there a discard instruction, or should one use exit?
16:07mupuf: or discarding is just disabling some lanes?
16:08karolherbst: there was something but I don't know how it's called anymore.. but probably you could also just disable some lanes
20:02HdkR: karolherbst: Yes :)
20:02karolherbst: HdkR: mhhh I tried that, but didn't fix the issue we were seeing
20:03karolherbst: maybe the branch did something terrible wrong...
20:16HdkR: karolherbst: There should be some nop padding after that BRA as well
20:16HdkR: disassembler might drop it from being visible?
20:17HdkR: Or maybe the blob only uploads that after loading it in to vram
20:17karolherbst: ahh.. that might be it
20:25karolherbst: HdkR: okay.. that doesn't work as well
20:26HdkR: Smells like a couple of edge cases might be stacking here
20:26karolherbst: or codegen doing stupid things
20:32karolherbst: HdkR: mhhh https://gist.githubusercontent.com/karolherbst/46941d86fe538c7de9783ffd439d79d4/raw/836a186d573ca42916ec5ab9182a5db8decebcdd/gistfile1.txt
20:32karolherbst: everything looks fine
20:32karolherbst: maybe not the alignment.. but...
20:33HdkR: It's at least a lot of nop there
20:34karolherbst: I wanted to be sure I add enough
20:36HdkR: Probably some edge that I'm not remembering
20:36HdkR: I don't think the scoreboarding in the nops and bra need to be anything special
20:37karolherbst: that's what codegen is doing
22:38karolherbst: HdkR: one thing we are wondering about is.. what's like the perf implications of using mme? Is it considered faster than normal command submission or what's like the main reason to use them? (I know practical reasons, like indirects)
22:39karolherbst: but maybe it makes sense to have a macro for something which could be done in a loop, because mme are blazing fast
22:42HdkR: Not sure the perf implications, just that it is worth using